The Art and Science of Data Analysis, Chapter 2: Uncovering the Common Errors in Data Analysis

Data distributions can help identify patterns and trends and provide insight that is used to design and implement an effective data analysis strategy. That’s why many mistakes in data analysis start with wrong data distribution evaluation.

The Art and Science of Data Analysis, Chapter 2: Uncovering the Common Errors in Data Analysis
Dodging the data analysis blunders

Introduction

Data distributions are a crucial element in data analysis, used to evaluate the characteristics of a dataset.  Data distributions can help identify patterns and trends and provide insight that is used to design and implement an effective data analysis strategy. That’s why many mistakes in data analysis start with wrong data distribution evaluation.

In this post, I will highlight the most common mistakes due to inappropriate data distribution evaluation.  For my entire carrier, I have never seen data analysts (including myself ) who have never made at least one of the mistakes described in the blog.

You can skip the next two paragraphs if you already aware what is the data distribution and what are the most common types of distributions in real-world use cases. But if not, welcome under the hood!

What is Data Distribution

Let's start from the beginning - what is data distribution in general?

In a simple notation, you can think of data distribution as an acceptable range of observable variables and a function that describes variable dynamics.

To make it even more straightforward let's take a look at the following example:

Let's say I would like to measure the height of my colleagues; in that case, the observable variable is "height." If I collect that data via a survey from my colleagues, most likely, I will get a broad range of values (e.g., from 165 to 210 centimeters). Still, some values will be more frequent than others (let's say between 172.1-178.5 centimeters), and this frequency of values is what we call data distribution.

Data distribution is essential in understanding data sets and in making data-driven decisions. It can be either continuous or discrete.

  • Continuous distributions are ones where the values can take any value between two given points, such as the man’s height in the example above.
  • Discrete distributions, on the other hand, are ones where the values can only take a limited set of values, such as the number of children a person has.

Data distributions can also be classified into various types, such as Normal, Binomial, Poisson, and Exponential.

  • Normal data distributions are the most common type of data distribution, and they are symmetrical with a bell-shaped curve.
  • Binomial data distributions are ones where there are only two possible outcomes, such as heads or tails.
  • Poisson and Exponential data distributions are ones where the values are more spread out.

Data distribution values come from the observations that the researcher makes. Depending on the type of research and the type of data being collected, the researcher collects the data and then analyzes it to determine the data distribution. This data can come from surveys, experiments, field observations, website logs, or other sources.

Most common distributions in data analysis

When it comes to data analysis and modeling, one of the most important topics is to understand the underlying data distribution. In that paragraph, we consider the most common distributions in real-world use cases.

Normal distribution

The normal distribution is also known as a bell-shaped curve because of its symmetrical shape. This distribution is often used to model the behavior of random variables (in particular in physical processes), such as height, weight, and age. The Normal distribution is so widespread because of the central limit theorem, which states that the average of a large number of independent random variables is approximately normally distributed.

A normal distribution can be fully described by just 3 variables: its mean, median, and standard deviation values. This distribution is the true love of any Data Analyst because it's very easy to calculate outliers, statistical tests, and sample from it, but one should be very careful as if you confuse it with other distributions your conclusions may be wrong.

# Python example of Normal distribution:

import numpy as np
import matplotlib.pyplot as plt

# Generate a sample of 1000 numbers from the normal distribution
data = np.random.normal(0, 1, 1000)

# Plot the histogram
plt.hist(data)
plt.show()

Poisson distribution

The Poisson distribution is a discrete probability distribution used to model the number of events that occur within a given time or space. The number of purchases made by customers in a given time range, Number of customers who clicked on the banner, Amount of re-posts of a given post, etc.

The Poisson distribution is characterized by a single parameter, the mean rate of occurrence. This parameter is typically estimated from observed data and is used to calculate the probability of a given number of events occurring within a specified time or space. The Poisson distribution is always positive and integer.

Python example of Poisson distribution:

import numpy as np
import matplotlib.pyplot as plt

# Generate a sample of 1000 purchases from the Poisson distribution
data = np.random.poisson(2, 10000)

# Plot the histogram
plt.hist(data)
plt.show()

If you are thinking about how that may look in SQL, here is a fair example of it:

SELECT COUNT(*) AS poisson_distribution
FROM orders
WHERE purchase_date BETWEEN '2019-01-01' AND '2019-01-31'
GROUP BY customer_id;

Binomial distribution

The Binomial distribution is a discrete probability distribution that is used to model the probability of success on a given number of Bernoulli trials. This distribution is commonly used to model the probability of success when there are only two outcomes (success or failure) for each trial.

For example, retail stores use the binomial distribution to model the probability that they receive a certain number of shopping returns each week: suppose it is known that 10% of all orders get returned at a certain store each week. If there are 50 orders that week, we can use a Binomial Distribution to find the probability that the store receives more than a certain number of returns that week.

The Binomial distribution is characterized by two parameters, the number of trials (n) and the probability of success (p). It is always bounded by values between 0 and 1.

Python example of Binomial distribution for a retail use case:

#Python example of Binomial distribution for a retail use case:

import numpy as np
import matplotlib.pyplot as plt

#Set Binomial distribution
n = 50
p= 0.1

# Generate a sample of returned oredrs for 100 stores using the Binomial distribution
data = np.random.binomial(n, p, 100)

# Plot the histogram
plt.hist(data)
plt.show()

and a SQL example:

 SELECT 
        COUNT(*)
 FROM orders
        WHERE returns = 1
 GROUP BY week
 HAVING COUNT(*) > X

This query will return the number of orders that have more than X returns in a given week.

Exponential distribution

The Exponential distribution is a continuous probability distribution used to model the time between events, such as the time between customer purchases or the time between failures of a machine.

The most common use case is “Time-To-Event” when the analyst is trying to predict how many cycles will happen before a machine experience a failure or how much time is left before the next order from a customer.

The Exponential distribution is characterized by a single parameter, the rate of occurrence. This parameter is typically estimated from observed data and is used to calculate the probability of a given amount of time between events. The Exponential distribution is always positive and unbounded. It is typically used in scenarios where the probability of an event occurring decreases/increases exponentially with time.

#Python example of Exponential distribution:

import numpy as np
import matplotlib.pyplot as plt

# Generate a sample of 1000 numbers from the Exponential distribution
data = np.random.exponential(scale = 0.5, size = 1000)

# Plot the histogram
plt.hist(data)
plt.show()

And SQL example:

SELECT COUNT(*) 2FROM orders 3WHERE time_between_orders > X

This query will return the number of orders that have more than X time between them.

In DataGPT we do pay attention to the distribution we observe in dashboards and try to handle it differently avoiding all common mistakes which I will highlight in the next section.

For instance, there are a few mistakes that we avoid when comparing data using our Algo:

  1. Failing to consider the context of the data: It's important to consider the context in which the data was collected and to understand any potential factors that may have influenced the results.
  2. Comparing apples to oranges: Make sure that you are comparing data that is truly comparable. If you are comparing data from different sources or collected using different methods, it may not be a fair comparison.
  3. Ignoring statistical significance: Be sure to consider whether the differences you are observing in the data are statistically significant.
  4. Overgeneralizing: Be careful not to overgeneralize based on your comparative analysis. It's important to consider the limitations of your data and to be mindful of the factors that may have influenced your results.

Common mistakes

Now it’s time to find out how deep is a rabbit hole. I will go you through the most common mistakes and share  insight that can help you to avoid such mistakes

Wrongly Defining your metric - distribution depends on a metric (outcome variable you are going to track)

This is indeed a number one mistake that is happening in data analysis. To understand it better let's look at a couple of examples:

Example 1: How to measure "big"?

There are various ways to measure how "big" an object, person, or animal is. We might measure a person's "bigness" by their height, or by their weight, or by their body mass index (BMI). Each of these metrics can give us a different answer, as each metric is measuring something different. [Obesity, definition, how to measure obesity, Body Mass Index (BMI), Weight Control, Weight Loss, Strategies for Weight Loss, Dieting, Holisticonline.com ]

For example, a person could be considered tall (measured by height) but not necessarily heavy (measured by weight). [100 Retail Mail Letters, Cards, Flats, and Parcels | Postal Explorer ]

Example 2: How to measure the "unemployment rate"?

The official US Unemployment Rate is "Total unemployed persons, as a percent of the civilian labor force." However, this begs the definitions of "unemployed" and "labor force". Different definitions of these two terms can lead to very different unemployment rates.

For example, some people might include people who are "underemployed" (working part-time but would like to work full-time) in the calculations, while others might not.[Labor Force Characteristics (CPS) :  U.S. Bureau of Labor Statistics ]

The insights here are:

Do not assume you understand what a measure is measuring before you start running any analysis.

Different metrics can give you different results, so it is important to define your metric properly and make sure you understand what it is measuring before you start.

Additionally, make sure you are aware of any potential biases that might be tied to your metric, and that you are taking those into account when interpreting the results.

Also. be sure to find and read the definition carefully; it may not be what you think.

Be especially careful when making comparisons between different data sets, as they may have different definitions of the same metric. This can lead to inaccurate conclusions.

For example, if two countries are reporting different unemployment rates, you need to make sure that the definitions of "unemployment" are the same for both countries before you can draw any meaningful conclusions.

Example 3: What is a good outcome variable for deciding whether cancer treatment in a country has been improving?

A first thought might be the "number of deaths from cancer". But this is not a very good metric, since it does not take into account the number of people who are diagnosed with cancer. Instead, a better metric might be the mortality rate (number of deaths from cancer per 100,000 people).

Besides, the number of deaths might increase simply because the population is increasing. Or it might go down if cancer incidence is decreasing. "Percent of the population that dies of cancer in one year" would take care of the first problem, but not the second.

The insight here is:

A rate is often a better measure than a raw count.

Example 4: What is a good outcome variable for answering the question, "Do males or females suffer more traffic fatalities?"

In light of the considerations in Example 3, a rate is probably better than a count. But what rate? Deaths per hour traveled or deaths per vehicle owned could be misleading because, e.g., males might own more vehicles than females.

The best metric might be "fatalities per thousand miles traveled". That way, the mortality rate is not affected by the number of vehicles owned and can be easily compared across genders.

Example 5: What is a good outcome variable for research on the effect of medication on bone fractures? The outcome that is really of interest in this example is "number of fractures" (or possibly "number of hip fractures" or "number of vertebral fractures," etc.), or, taking into account the lesson from Example 3, - "percentage of people in this category who have this type of fracture."

But often, bone density is taken as an outcome of such research. Bone density is correlated with fracture risk, but is not the same as fracture risk. This is an example of what is called a proxy measure (or a surrogate measure).

Sometimes it is impossible (or not possible for practical purposes ) to use the real measure, so a proxy measure is used instead.[Outcome Variables]

The insights here are:

It is important to remember that a proxy measure is not the same as a real measure and that one has to be careful when interpreting the results.

Be aware of the difference between the outcome you are interested in, and the measure you are using as a proxy.

Additionally, it is important to understand the definition of the metric you are using and to be aware of any potential biases that might be associated with it.

To overcome all those issues in DataGPT we have an onboarding service where we fearfully observe customers' metrics together and try to deeply understand and define them in a way to ensure that the metric is used for the best results.

*For privacy reasons, we cannot share real examples from our customers; however, the examples provided are illustrative and have been adapted from sources 1, 2, 3, 4, and 5 in our reference list.

Analysis of mean

Probably the most common summary measure used in data analysis is the mean. Many of the common analysis techniques (e.g., o t-tests, linear regression, analysis of variance) concern the mean. In many circumstances, focusing on the mean is appropriate. But there are also many circumstances where focusing on the mean can lead us astray.

Here are some of them.

1. The mean may be misleading in highly skewed distributions

In highly skewed distributions, the mean can be misleading because extreme values in one tail of the distribution can have a large influence on the value of the mean. For example, if the data is skewed to the left, the mean will be lowered, while if the data is skewed to the right, the mean will be raised. If a skewed distribution, the mean is usually not in the middle.[Summary Statistics for Skewed Distributions ]

Example: The mean of the ten numbers 1, 1, 1, 2, 2, 3, 5, 8, 12, 17  is 52/10 = 5.2. Seven of the ten numbers are less than the mean, with only three of the ten numbers greater than the mean.

A better measure of the center for this distribution would be the median, which in this case is (2+3)/2 = 2.5. Five of the numbers are less than 2.5, and five are greater.

How to deal with skewed data?:

Fortunately, many of the skewed random variables that arise in applications are lognormal. That means that the logarithm of the variable is normal, and hence most common statistical techniques can be applied to the logarithm of the original variable. (With robust techniques, approximately lognormal distributions can also be handled by taking logarithms.) However, doing this may require some care in interpretation. There are two common routes to interpretation when dealing with logarithms.

  1. In many fields, it is common to work with the log of the original outcome variable, rather than the original variable. Thus one might do a hypothesis test for equality of the means of the logs of the variables. A difference in the means of the logs will tell you that the original distributions are different, which in some applications may answer the question of interest.
  2. For situations that require interpretation in terms of the original variable, we can often exploit the fact that the logarithm transformation and its inverse, the exponential transformation, preserve order. This implies that they take the median of one variable to the median of another. So if a variable X is lognormal and we take its logarithm, Y = logX

Another option here is a Quantile Regression - Standard regression estimates the mean of the conditional distribution (conditioned on the values of the predictors) of the response variable. For example, in simple linear regression, with one predictor X and response variable Y, we calculate an equation y = a + bx that tells us that when X takes on the value x, the mean of Y is approximately a + bx.  Quantile regression is a method for estimating conditional quantiles, including the median.

Python Example of lognormal transform:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create skewed distribution
data = np.random.normal(loc=0, scale=1, size=1000)
data[100:] = np.random.random_sample(900)+0.8

# Calculate the mean
mean = np.mean(data)

# Plot histogram
plt.hist(data, bins=20)
plt.show()

# Output
print('Mean of the Data: {}'.format(mean))

# The mean of the data is -1.614 and is an inaccurate representation due to the skewed distribution.

# Solution: Log Transformation

# Transform the data using log
data_log = np.log(data + np.abs(np.min(data)) + 0.1)

# Calculate the mean
mean_log = np.mean(data_log)

# Plot histogram
plt.hist(data_log, bins=20)
plt.show()

# Output
print('Mean of the Data (Log): {}'.format(mean_log))

Mean of the Data: 1.1663752245118233 2

Mean of the Data (Log): 1.2582441644630327

2. The mean may be misleading in bimodal distributions or ordinal distributions

In bimodal distributions, the mean can be misleading for the same reasons that it can be misleading in highly skewed distributions. Extreme values in one tail of the distribution can have a large influence on the value of the mean. It happens because the two modes of the data will pull the mean in opposite directions, resulting in a value that does not accurately represent either mode.

For example, if a bimodal distribution consists of one mode of 2 and another mode of 20, the mean may be around 11, which does not accurately represent either mode.

You can see the detailed example on Python below:

import matplotlib.pyplot as plt

# create a bimodal population
population = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60]
population_a = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
population_b = [21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60]

# calculate the mean
mean = sum(population) / len(population)

# plot the bimodal distribution
plt.hist(population_a, bins=20, color = 'blue', alpha=0.5)
plt.hist(population_b, bins=20, color = 'red', alpha=0.5)
plt.axvline(mean, color='green', label='Mean')
plt.title('Bimodal Distribution')
plt.legend()
plt.show()

The mean is misleading in this bimodal distribution because it is not representative of the underlying data. The mean is pulled up by the larger population in the second peak, while the first peak is not represented in the mean.

To deal with bimodal distribution it is better to split it by modes and analyze each part separately. A simple real-world example might be the e-commerce website, if you analyze B2C and B2B sales together most likely you will end up with a bimodal case. You can think that this example is very straightforward and one usually would not make such mistakes, but there might be underlying properties that can’t be observed directly from the data (for instance if customers are males/females, ticktokers/Instagrammers , or any other customers' features which you don’t know from the data you have)

Another example is ordinal distribution. To get into the issue, you can take a look at the example below:

An ordinal variable is a categorical variable for which the possible values are ordered. Ordinal variables can be considered “in-between” categorical and quantitative variables.

Example: Educational level might be categorized as
   1: Elementary school education
   2: High school graduate
   3: Some college
   4: College graduate
   5: Graduate degree

  • In this example (and for many other ordinal variables),

the quantitative differences between the categories are uneven, even though the differences between the labels are the same. (e.g., the difference between 1 and 2 is four years, whereas the difference between 2 and 3 could be anything from part of a year to several years)

Thus it does not make sense to take a mean of the values.

The insight here is: Common mistake is to treat ordinal variables like quantitative variables without thinking about whether this is appropriate in the particular situation at hand.

  • For example, the “floor effect” can produce the appearance of interaction when using Least Squares Regression, when no interaction is present

Let’s illustrate with Python why the mean for ordinal data is misleading:

import matplotlib.pyplot as plt

# Create data
data_ordinal = [3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5]

# Calculate mean
mean = sum(data_ordinal)/len(data_ordinal)

# Plot data
plt.hist(data_ordinal, bins=5, edgecolor='black')
plt.title('Ordinal Distribution')
plt.xlabel('Categories')
plt.ylabel('Frequency')
plt.show()

print(f'The mean of this distribution is {mean:.2f}')

The mean of this distribution is 4.00, which suggests that the majority of the data is in the 4th category.
However, looking at the histogram, it is clear that the data is evenly distributed across all 5 categories,
with no clear majority. Therefore, the mean of this distribution is misleading and does not accurately represent the data.

In DataGPT we use the univariate segmentation technique to address the multi-modal distribution problem and as for ordinal data we treat it as a category instead.

3. The mean is usually a wrong measure for extreme cases

In some cases, we may be interested I'm extreme values and in the mean or median.

Examples

  1. If you are deciding what capacity air conditioner you need, the average yearly (or even average summer) temperature will not give you guidance in choosing an air conditioner that will keep your house cool on the hottest days. For this purpose, it would be much more helpful to know the highest temperature you might encounter, or how many days you can expect to be above a certain temperature.
  2. Traffic safety interventions are typically aimed at high-speed situations. So the average speed is not as useful as, say, the 85th percentile of speed.

In DataGPT we carefully asses the distributions and select appropriate metrics.

Interpreting comparisons between two periods without comparing them to the baseline/ control group

This is a very common issue when two periods are compared. Analysts often base their conclusions regarding the impact of some action on noting that some recent actions (such as marketing campaigns) yield a significant effect in the experimental condition or group, whereas the corresponding effect in the control group or in the previous period is not significant.

Based on these two separate test outcomes, analysts will sometimes suggest that the metric in the experimental group is larger than the metric in the control group or the same group in the previous period.

This type of erroneous inference is very common but incorrect. For instance, as illustrated in the plot below, two variables X and Y, each measured in two different groups of 20 participants, could have different outcomes in terms of statistical significance: a correlation coefficient for the correlation between the two variables in group A might be statistically significant (ie, have p≤0.05), whereas a similar correlation coefficient might not be statistically significant for group B. This could happen even if the relationship between the two variables is virtually identical for the two groups, so one should not infer that one correlation is greater than the other.

A similar issue occurs when estimating the effect of marketing campaigns in two different groups (current period, past period): the campaign could yield a significant effect on current behavior, but in fact, the increase of value for the current period might be observed for all groups (not just group of interest). One can only conclude that the effect of a campaign is different from the other group effects only through a direct statistical comparison between the two effects. [Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript ]

The insights here are:

Comparing two periods without a baseline or reference group can lead to inaccurate or misleading conclusions because there is no context for understanding the magnitude or importance of the differences between the two periods.

For example, if two groups are compared and one group shows a significant difference in a particular metric, it is important to know how that difference compares to the baseline or reference groups in order to accurately interpret the results.

Without this comparison, it is not clear whether the difference is significant or not, whether it represents a real change or just random variation, or whether there is a common trend in data. By comparing the two periods to a baseline or reference groups, analysts can more accurately determine the impact of a particular action or factor on the metric being studied.

In DataGPT we run a set of statistical tests to ensure that the change between the two periods we observe is statistically significant comparing the change for a specific group vs other groups.

Spurious correlations

Correlations are an important tool in science in order to assess the magnitude of an association between two variables or two groups.

Yet, the use of parametric correlations, such as Pearson’s relies on a set of assumptions, which are important to consider as violation of these assumptions may give rise to spurious correlations.

Spurious correlations most commonly arise if one or several outliers exist for one of the two variables.

As illustrated in the top row of the graph below, a single value away from the rest of the distribution can inflate the correlation coefficient.

Spurious correlations can also arise from clusters, e.g. if the data from two groups are pooled together when the two groups differ in those two variables (as illustrated in the bottom row of the graph below).[Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript ]

Why does Pearson’s correlation or outlier impact the subject of spurious correlations?

Parametric correlations, such as Pearson's correlation coefficient, are statistical methods that are used to measure the strength and direction of a linear relationship between two variables. These correlations rely on a set of assumptions, including that the data is normally distributed and that the relationship between the variables is linear. If these assumptions are not met, the correlation coefficient may not accurately reflect the true relationship between the variables, and may instead give rise to a spurious correlation.

Outliers, or observations that are unusually large or small compared to the rest of the data, can have a particularly significant impact on parametric correlations. If one or several outliers are present for one of the two variables, they can heavily influence the correlation coefficient and lead to a spurious correlation. This is because the outlier observations may not fit the assumptions of the parametric correlation and may not accurately represent the relationship between the two variables.

Correlation does not mean causation

One common mistake in the analysis of time series data is to assume that a correlation between two variables means that one variable is causing the other. While it is true that a strong correlation between two variables can indicate a causal relationship, it is also possible that other factors may be causing the correlation, or that the correlation is simply a coincidence. In order to establish a causal relationship, it is necessary to consider additional evidence and to use experimental or quasi-experimental methods to control for other potential factors that may be influencing the relationship.

For example, suppose there is a strong correlation between the number of ice cream sales and the number of drownings in a particular region. In that case, it does not necessarily mean that eating ice cream causes drownings. There may be other factors, such as hot weather or the presence of public pools, that are causing both the increase in ice cream sales and the increase in drownings. [Spurious correlations ]

Without considering these other factors, it is not possible to determine whether the correlation between the two variables is due to a causal relationship or to some other factor.

This is especially the case when you compare two data points in time series-like variables.

For time series it makes more sense to evaluate the causal relationship instead of a correlation, one of the approaches is to use the Granger-causality test or train a time series model in order to see if one metric can predict another.[https://arxiv.org/pdf/2105.02675.pdf]

Here is a Python example that illustrates how to apply a Granger causality test to determine whether one-time series BJSales_Lead is causing another one - BJSales:

import pandas as pd
from statsmodels.tsa.stattools import grangercausalitytests

# Load the time series data time_series_data.csv into a pandas DataFrame (2 time series BJSales and BJSales_Lead)
df = pd.read_csv('time_series_data.csv')

# Run the granger causality test on the two time series
result = grangercausalitytests(df[['BJSales', 'BJSales_Lead']], maxlag=2, verbose=True)

# Check the p-values of the test to determine whether there is causality between the two time series
for key, value in result.items():
    print(f"Lag {key}:")
    for pval, fval, df_ in value:
        print(f"  p-value: {pval:.3f}  f-value: {fval:.3f}  df: {df_}")

This code runs the Granger causality test with a maximum lag of 2, which means that it will consider whether the past 2 values of BJSales_Lead contain information that helps predict BJSales. The test will return a dictionary with the p-values of the test for each lag. A low p-value (less than 0.05) indicates that there is a significant relationship between the two time series, and thus X (BJSales_Lead)  causes Y (BJSales).

The insights here are:

In order to avoid spurious correlations, it is important to carefully consider the assumptions of parametric correlations and to ensure that the data meets these assumptions or to use a different statistical method if the assumptions are not met.

If assumptions of correlation coefficients are violated one can use other methods like Regression analysis, cross-correlation for time series, or non-parametric correlations

it is important to be cautious when interpreting correlations, especially in the analysis of time series data, and to consider other evidence and potential confounding factors in order to establish a causal relationship.

In DataGPT we have a concept of Related Metrics where we make sure that we avoid reporting spurious correlations.

Circular analysis


The circular analysis is any form of analysis that retrospectively selects features of the data to characterize the dependent variables, resulting in a distortion of the resulting statistical test (Kriegeskorte et al., 2010). Circular analysis can take many shapes and forms, but it inherently relates to recycling the same data to first characterize the test variables and then to make statistical inferences from them and is thus often referred to as ‘double dipping’ [Everything You Never Wanted to Know about Circular Analysis, but Were Afraid to Ask - Nikolaus Kriegeskorte, Martin A Lindquist, Thomas E Nichols, Russell A Poldrack, Edward Vul, 2010 ].

It is a common mistake that can lead to biased or misleading results. It occurs when you use the same data to both select the variables that you will use in your analysis and test the relationships between those variables.

Here's an example of how circular analysis might occur in the context of sales data:

Imagine that you have a large dataset of sales data, and you want to identify the factors that are most important in predicting sales. You start by selecting a few variables that you think might be related to sales, such as the price of the product, the number of advertisements that were run for the product, and the number of competitors in the market.

You then run a statistical analysis to test whether these variables are significantly related to sales. If the analysis shows that one or more of these variables are indeed related to sales, you might be tempted to conclude that these variables are important predictors of sales.

However, this conclusion might be biased or misleading because you used the same data to both select the variables that you tested and to test the relationships between those variables. This is the circular analysis problem: by selecting variables based on the data, you are potentially introducing bias into the analysis.

Another example of the same problem is when you have a multivariate relationship in your data: To illustrate that example imagine you have 3 groups of users in your sales data for the e-commerce website: Users with Android, Users from China, and Users who are non-English speakers. And you see those Android users generated a lot more traffic this week. There might be the case that users share the same properties - most users from China are using Androids and do not speak English, so you can misleadingly conclude that your traffic grew because of Android users and not because of Chinese users.

The insights here are:

To avoid the circular analysis problem, it's important to be mindful of how you are selecting the variables that you will use in your analysis. One way to do this is to use a separate dataset to select the variables that you will test, and then use another dataset to test the relationships between those variables. This will help ensure that your results are not biased by the selection of the variables.

Another way is to perform multivariate segmentation that can help you to determine a cross-relation between different variables, that is what we are doing in DataGPT to avoid that problem

Failing to correct for multiple comparisons

When analysts explore if there is a noticeable metric increase for multiple groups, they often explore the effect of multiple task conditions on multiple variables.

This practice is termed exploratory analysis, as opposed to confirmatory analysis, which by definition is more restrictive. When performed with frequentist statistics, conducting multiple comparisons during exploratory analysis can have profound consequences for the interpretation of significant findings.

In any experimental design involving more than two conditions (or a comparison of two groups), the exploratory analysis will involve multiple comparisons and will increase the probability of detecting an effect even if no such effect exists (false positive, type I error).

In this case, the larger the number of groups, the greater the number of tests that can be performed. As a result, the probability of observing a false-positive increases (family-wise error rate). For example, in a 2 × 3 × 3 experimental design the probability of finding at least one significant main or interaction effect is 30%, even when there is no effect.

The most famous case with multiple comparison issues is when researchers identify that the dead salmon is actually alive [Bennett et al., 2009]

This issue can be handled using p-value corrections during the statistical tests. P-value correction is a statistical technique that is used to adjust the p-values of multiple statistical tests in order to account for the fact that you are conducting multiple comparisons. This is necessary because when you conduct multiple statistical tests, the probability of finding at least one false-positive result (i.e., a result that appears significant due to chance alone) increases.

In the context of multiple variables, p-value correction is particularly important when you are conducting exploratory analysis, as opposed to confirmatory analysis. In an exploratory analysis, you are often exploring the relationships between multiple variables without a specific hypothesis in mind. This can lead to an increase in the number of statistical tests that you conduct, which in turn increases the probability of finding at least one false-positive result.[Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript ]

To correct this problem, you can use one of several p-value correction methods, such as the Bonferroni correction, the Holm-Bonferroni correction, or the Benjamini-Hochberg correction. These methods adjust the p-values of the statistical tests to account for the fact that you are conducting multiple comparisons, and can help to reduce the probability of finding a false-positive result.

Python example of how to perform multitest correction:

import numpy as np
from statsmodels.stats.multitest import multipletests

# Set the number of variables and the number of comparisons
n_vars = 10
n_comparisons = 100

# Generate some random p-values
pvals = np.random.uniform(size=n_comparisons)

# Apply the Bonferroni correction to the p-values
rejected, corrected_pvals, alphacSidak, alphacBonf = multipletests(pvals, n_vars,method='bonferroni')

# Check which comparisons were significant after correction
significant_comparisons = np.where(rejected)[0]

This will apply the Bonferroni correction to a series of n_comparisons p-values, assuming that there are n_vars variables being compared. The multipletests the function will return a Boolean array indicating which comparisons were significant after correction (rejected), as well as the corrected p-values (corrected_pvals).

There are many other p-value correction methods available, such as the Holm-Bonferroni correction and the Benjamini-Hochberg correction.

The insight here is:

It's important to note that p-value correction should be used with caution, as it can also increase the probability of finding a false-negative result (i.e., failing to find a significant result that is actually present). It's always a good idea to consider the context of your data and the specific goals of your analysis when deciding whether and how to apply p-value correction. In DataGPT we always do a p-value correction in case of multiple testing.

Ignoring effect size when interpreting results from statistical tests

Read this joke as an intro:

Two statisticians are out hunting.

The first one fires at the deer but overshoots by 5 feet. The second one fires and undershoots the deer by 5 feet.

"Got it!" they both yell.

When using statistical tests to validate some hypothesis, Analysts apply a statistical threshold (normally alpha=.05) for adjudicating statistical significance p-value.

There is numerous research about the arbitrariness of this threshold [Wasserstein et al., 2019], and alternatives have been proposed [ Colquhoun, 2014; Lakens et al., 2018; Benjamin et al., 2018].

Although misinterpreting the results of a statistical test when the outcome is not significant is extremely common, this is because a non-significant p-value does not distinguish between the lack of an effect due to the effect being objectively absent (contradictory evidence to the hypothesis) or due to the insensitivity of the effect - the effect is small.

In other words - non-significant effects could literally mean very different things - a true hypothesis rejection, an underpowered genuine effect, or an ambiguous effect [ Altman and Bland, 1995 ].

Therefore, if the analysts wish to interpret a non-significant result as supporting evidence against the hypothesis, they need to demonstrate that this evidence is meaningful. It's important to understand that the p-value alone does not provide a complete picture of the statistical significance of a result. While the p-value tells you the probability of obtaining a result that is at least as extreme as the one you observed, if the null hypothesis is true, it does not tell you anything about the size or practical importance of the effect.

This can be a dangerous mistake because it means that you might be ignoring results that are actually important or meaningful, simply because they don't meet the threshold for statistical significance. For example, you might see a non-significant effect and conclude that there is no relationship between the variables, when in fact there is a small but meaningful relationship that was not detected due to the limitations of your study (e.g., low statistical power or small sample size).[Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript ]

To illustrate this mistake let’s look at the following example: Imagine that you are conducting a study to examine the relationship between the price of a product and the number of units sold. You collect data on the price and sales of 100 different products and run a linear regression to test whether there is a significant relationship between the two variables.

The p-value of the regression is 0.1, which means that there is a 10% probability of obtaining a result this extreme if there is no relationship between the variables. Based on the p-value alone, you might conclude that there is no significant relationship between price and sales.

However, if you look at the effect size of the relationship, you might see a different story. For example, you might find that the Pearson correlation coefficient (r) between price and sales is 0.3. This indicates that there is a moderate positive relationship between the two variables, even though the p-value is not statistically significant.

And the same example with a code piece in Python:

import pandas as pd
from scipy.stats import pearsonr
import statsmodels.api as sm

# Load the sales data into a pandas DataFrame
df = pd.read_csv('sales_data.csv')

# Extract the price and sales variables
price = df['price']
sales = df['sales']

# Calculate the Pearson correlation coefficient between price and sales
r, _ = pearsonr(price, sales)

# Fit a linear regression model to the data
model = sm.OLS(sales, sm.add_constant(price)).fit()

# Get pv alues from OLs model
pval = model.pvalues[1]

# Print the p-value and the effect size (r)
print(f"p-value: {pval:.3f}")
print(f"r: {r:.3f}")


# Print the summary of the model
print(model.summary())

If you were to ignore this effect size and focus only on the p-value, you might make the mistake of concluding that there is no relationship between price and sales, when in fact there is a meaningful relationship that is being missed due to the limitations of your study (e.g., low statistical power or small sample size). This could lead you to make incorrect decisions about pricing or marketing strategies, which could have a significant impact on the success of your business.

In this case, it's clear that looking at the p-value alone and ignoring effect size is a very dangerous mistake because it could lead you to draw incorrect conclusions about the relationship between the variables. It's important to consider both the p-value and the effect size when interpreting the results of your statistical analyses.

The insight here is:

To avoid this mistake, it's important to consider other measures of effect size in addition to the p-value. Some common measures of effect size include Cohen's d, which is a measure of the standardized difference between two means, and r, which is the Pearson correlation coefficient. These measures can help you get a sense of the magnitude of the effect that you are seeing, even if the p-value is not statistically significant.

It's also important to consider the context of your study and the specific goals of your analysis. For example, you might be more interested in detecting large effects that have practical significance, rather than small effects that are statistically significant but not practically important.

In DataGPT we always look at the effect size when doing statistical tests.

Is data that looks normal always normally distributed?

This question definitely would be a winner for a TOP 10 mistakes in data analysis victories. And this is also my favorite question when I interview candidates for the data analyst position.

First I assume you are aware of a normal distribution and its properties, if not you can find more details under the hidden section at the beginning of the post. To understand the problem better we consider the following case study.

Case study: Medicare set out to improve healthcare quality by rewarding hospitals that balanced patient safety and overall experience with lower insurance claims.

The Department of Health and Human Services created a Value-Based Purchasing (VBP) Total Performance Score (TPS) to compare how well each of about 2,700 hospitals performed each year.[CMS Hospital Value-Based Purchasing Program Results for Fiscal Year 2020 | CMS ]

The score has a complex calculation that includes many heterogeneous inputs, like patient surveys, insurance claims, and readmission rates.

As an incentive, the government withholds 2% of national Medicare hospital reimbursement and redistributes that amount according to each hospital’s TPS.

Higher-scoring hospitals receive a bonus while lower-scoring ones are penalized.

The VBP system sparked a research idea by a hospital executive. He proposed to research a possible link between a hospital’s quality and its leadership style. The TPS was used as a proxy for quality, and hospital quality managers were surveyed using a Likert-type instrument.

The data was downloaded, and a questionnaire was emailed to 2,777 quality managers. Managers that worked for more than one hospital received only one questionnaire. Because prior TPS data was analyzed by qualified experts and thought to be “a fairly normal distribution centered around a score of 37, with a small number of exceptional hospitals scoring above 80,”  it was assumed the current data was similar.

A total of 150 completed questionnaires were returned, and the executive ran a multiple linear regression to look for any correlation between quality and leadership style.

Several outliers were eliminated so error residuals looked more normal, and different transformations on the leadership data were attempted.

The errors were then deemed normally distributed by the LLN and CLT. The results were disappointing, with the leadership style explaining less than 2% of the variation in scores.

The associated P-values on the independent variables were highly insignificant. What went wrong?

In this case study, the analysts assumed that the data on hospital quality (as measured by the TPS) was normally distributed, which led them to make several mistakes in their analysis.

First, by assuming that the data were normally distributed, the researchers were able to apply the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN) to justify their use of multiple linear regression. However, if the data was not actually normally distributed, these assumptions would not hold, and the results of the regression might be biased or misleading.

Second, the assumption of normality led the researchers to eliminate several outliers in the data, which could have affected the results of the analysis. Outliers can often have a significant impact on the shape of a distribution, and eliminating them can change the characteristics of the data in ways that are difficult to predict.

To correct these problems, the researchers could start by examining the distribution of the TPS data to see if it is actually normally distributed.

That can be easily done in Python:

import pandas as pd
from scipy.stats import normaltest
import seaborn as sns
import matplotlib.pyplot as plt

# Load the data into a pandas DataFrame
df = pd.read_csv('hospital_data.csv')

# Extract the TPS variable
tps = df['tps']

# Check if the TPS data is normally distributed
stat, pval = normaltest(tps)
if pval < 0.05:
    print("The TPS data is not normally distributed")
else:
    print("The TPS data is normally distributed")

# Plot the distribution of the TPS data
sns.histplot(tps, kde=True)
plt.show()

The TPS data is not normally distributed

If it is not, they might need to evaluate a distribution type and use a different statistical method that is more appropriate for the shape of the data.

In Python, it could be done with Fitter library

import pandas as pd
from fitter import Fitter

# Load the data into a pandas DataFrame
df = pd.read_csv('hospital_data.csv')

# Extract the TPS variable
tps = df['tps']

# Fit a distribution to the TPS data
fitter = Fitter(tps)
fitter.fit()

# Print the name of the best-fitting distribution
print(fitter.summary())

Fitting 106 distributions: 99%|█████████▉| 105/106 [01:18<00:01, 1.48s/it]

 

sumsquare_error

aic

bic

kl_div

ks_statistic

ks_pvalue

gamma

0.250335

651.843669

-944.306927

inf

0.026579

0.999850

For that specific data, the True distribution was Gamma, which supposes using the GLM model for gamma distribution instead

What else you can do:

They could also consider whether it was appropriate to eliminate the outliers in the data and consider the potential impact of these outliers on the results of the analysis. It's also important for the researchers to consider the limitations of the study and the specific goals of the analysis.

For example, they might want to consider whether the sample size of 150 is sufficient to provide reliable results, or whether other factors (such as the complexity of the TPS calculation) could be affecting the results. By taking these factors into account, the researchers can help to ensure that their analysis is more accurate and meaningful.

Conclusion

In conclusion, data analysis is an essential part of many research projects and can help to uncover valuable insights and trends. However, it is necessary to be aware of common mistakes that can lead to incorrect or misleading results.

These mistakes can include wrongly defining the metric or outcome variable, relying on the mean as the sole measure of central tendency in certain distributions, failing to consider baseline comparisons, interpreting correlations as causations, committing circular analysis, failing to correct for multiple comparisons, and ignoring effect size when interpreting statistical tests.

By understanding these common pitfalls, researchers can avoid making these mistakes and increase the reliability and validity of their results. It is also important to remember that even data that appears to be normally distributed may not always be so, and it is essential to carefully examine the distribution of the data and consider appropriate statistical methods to ensure the accuracy of the results.

So being aware of the most common distribution mistakes can help you at least to know where to check if something looks suspicious and never produce such mistakes or use a DataGPT tool where you do not have to care about those mistakes at all as we handle them automatically.

The content of this blog post has been also influenced by various sources, included in the reference list below, which have informed our understanding of the topic throughout the blog post.

References

  1. Alternative Medicine, Complementary Medicine, Integrative Medicine, Mind-Body Medicine, Herbs, Nutrition . (n.d.). Obesity: Introduction. Retrieved from Obesity, definition, how to measure obesity, Body Mass Index (BMI), Weight Control, Weight Loss, Strategies for Weight Loss, Dieting, Holisticonline.com
  2. U.S. Bureau of Labor Statistics. (n.d.). Frequently Asked Questions (FAQs). Retrieved from ILC Frequently Asked Questions :  U.S. Bureau of Labor Statistics
  3. BMJ. (2014, December 2). Statsminiblog: Surrogate, proxy or process? Retrieved from StatsMiniBlog: Surrogate, proxy or process? - ADC Online Blog
  4. Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence. BMJ, 311, 485. NCBI - WWW Error Blocked Diagnostic
  5. Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533, 452–454. https://doi.org/10.1038/533452a
  6. Bennett, C. M., Wolford, G. L., & Miller, M. B. (2009). The principled control of false positives in neuroimaging. Social Cognitive and Affective Neuroscience, 4, 417–422. principled control of false positives in neuroimaging
  7. Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376. https://doi.org/10.1038/nrn3475
  8. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1, 140216. https://doi.org/10.1098/rsos.140216
  9. Calin-Jageman, R. J., & Cumming, G. (2019). The new statistics for better science: Ask how much, how uncertain, and what else is known. The American Statistician, 73, 271–280. https://doi.org/10.1080/00031305.2018.1518266
  10. Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in Psychology, 5, 781.Using Bayes to get the most out of non-significant results
  11. Han, H., & Glenn, A. L. (2018). Evaluating methods of correcting for multiple comparisons implemented in SPM12 in social neuroscience fMRI studies: An example from moral psychology. Social Neuroscience, 13, 257–267. https://doi.org/10.1080/17470919.2017.1324521
  12. Holman, L., Head, M. L., Lanfear, R., & Jennions, M. D. (2015). Evidence of experimental bias in the life sciences: Why we need blind data recording. PLOS Biology, 13(7), e1002190.Evidence of Experimental Bias in the Life Sciences: Why We Need Blind Data Recording
  13. Makin, T. R., & Orban de Xivry, J. J. (2019). Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife, 8, e48175.Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript
  14. Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196–217. https://doi.org/10.1207/s15327957pspr0203_4
  15. Kmetz, J. L. (2019). Correcting corrupt research: Recommendations for the profession to stop misuse of p-values. The American Statistician, 73(sup1), 36–45. https://doi.org/10.1080/00031305.2018.1518271
  16. Krueger, J. I., & Heck, P. R. (2019). Putting the p-value in its place. The American Statistician, 73(sup1), 122–128. https://doi.org/10.1080/00031305.2018.1470033
  17. Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs
  18. Noble, W. S. (2009). How does multiple testing correction work? Nature Biotechnology, 27(12), 1135–1137. https://doi.org/10.1038/nbt1209-1135
  19. Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146.Causal inference in statistics: An overview
  20. Wilson, R. C., & Collins, A. G. E. (2019). Ten simple rules for the computational modeling of behavioral data. eLife, 8, e49547. Ten simple rules for the computational modeling of behavioral data
  21. The Data Nudge. (n.d.). Common mistakes in data analysis. Retrieved from Common Mistakes in Data Analysis
  22. Wright, D. (n.d.). Normal distribution problem: Common mistake. Retrieved from Normal Distribution Problems- Two Common Mistakes - Dawn Wright, Ph.D.
  23. Towards Data Science. (n.d.). Your data isn't normal. Retrieved from https://towardsdatascience.com/your-data-isnt-normal-54fe98b1f322