One of the most common question that I get while mentoring GB/BB projects as well as while training LSS belts is “What should I do when I have non normal data?” Normally distributed data takes a center stage in statistics. A large number of statistical tests are based on the assumption of normality of the data, which instills a lot of fear in project leaders when there data is not normally distributed.
Do read about what Normal Distribution and Probability distributions are before you go on. (Opens in new tab).
A few years ago, some statisticians held a belief that when the processes has non normal data distribution, there is something wrong with the processes or that the processes were ‘out of control’. In their view, the purpose of a control chart was to determine when the processes were non-normal so they could be “corrected” and returned to normality. Fortunately, most statisticians and LSS practitioners today do not adhere to this belief. We recognize today that there is nothing wrong about a non-normal data and the preferred use of normally distributed data in statistics is only due to its simplicity and nothing more.
Many processes naturally follow a Non Normal Distribution, or a specific type of Non Normal distribution. Cycle time, calls per hour, customer waiting time, shrinkage etc., are a few examples of such processes.
Types of Non Normal Data Distributions
There are many types of Non-Normal distributions that a data set can follow, based on the nature of the process, the data collection methodology used, the sample size, outliers in the data etc. Few of the major Non-Normal distributions are listed below;
- Beta Distribution
- Exponential Distribution
- Gamma Distribution
- Inverse Gamma Distribution
- Log Normal Distribution
- Logistic Distribution
- Maxwell-Boltzmann Distribution
- Poisson Distribution
- Skewed Distribution
- Symmetric Distribution
- Uniform Distribution
- Unimodal Distribution
- Weibull Distribution
Reasons for Non Normal Data Distribution
Before you decide which distribution your data follows, ensure that your data is free of measurement system variation and does not have any data stability issues. Please read my previous posts on What variation is and what Measurement System Variation is.
Many processes or data sets naturally fit a Non Normal distribution. For example, the number of accidents will tend to fit a Poisson distribution and lifetimes of products usually fit a Weibull distribution. However, there may be times when your data is supposed to fit a normal distribution, but does not. For example, time taken to reach office from home data is usually supposed to fit a normal distribution. If you face a Non Normal distribution for such data sets, it is advised to check for the below reasons in your data and correct if needed.
- Outliers / Extreme values: Outliers can skew your distribution. The central tendency of your data set (Mean) is especially very sensitive to outliers and may result in a Non-Normal distribution. You should identify all the outliers, which may be extremely high or extremely low values in the data set or special causes in the process and remove them. Once done, check for normality again. It is important that we identify outliers as truly special causes before we eliminate them. The nature of normally distributed data is that, we can expect a small percentage of extreme values. Not every outlier is caused by a special reason. Extreme values should be removed the data only if there are more of them than expected under normal conditions.
- Subgroups / Overlap of two or more processes: A data set, which is a combination of two or more data sets from two or more processes combined into one, can also lead to a non-normal distribution. If you take two data sets, which follow normal distribution and merge them into one, it will follow a bimodal distribution. The remedial action for these situations is to determine the reasons, which cause bimodal or multimodal distribution and then stratify the data. Ensure that your data set is coherent and is not a mixture of multiple subgroups.
- Insufficient data discrimination: Round-off errors or measurement devices with poor resolution/precision can make truly continuous and normally distributed data look discrete and non-normal. We should use a more accurate measurement system or should collect more data points to overcome insufficient data discrimination or an insufficient number of different values.
- Smaller Sample size: This can cause a normally distributed data look scattered. For example, if you look at distribution of the height of 50 students in a particular class, you will see that it follows a normal distribution. However, if you randomly chose just three students from the same class, it may follow a uniform distribution. Or a skewed distribution as well depending on which students you chose. Increasing your sample size until you get normal distribution usually resolve this issue.
- Values Close Process Boundaries: If a process has many values close to zero or close to a natural process boundary, the data distribution will skew to the right or left. In this case, a transformation, such as the Box-Cox power transformation, may help make data normal. When we compare transformed data, we should transform everything under comparison in the same way.
- Sorted Data: Data collected from a normally distributed process can also fit a non-normal distribution if it represents just a sample / subset of the total output of the process. This happens when we sort the data before analysing it. Suppose there is a ring manufacturing process where the target is to produce rings with a diameter of 10 CM. The USL and LSL are 10.25 CM and 9.75 CM respectively. If the ring diameter data were collected from such a process and all values outside the specification limits were removed, it will show a non-normal distribution (uniform distribution), even though the data collected will originally be normally distributed
- Data Follows a Different Distribution: In addition to above-mentioned reasons where a normally distributed process data can show as non-normal, there are many data types, which follow a non-normal distribution by nature. In such cases, you should analyse the data using the tests that do not assume normality.
How to deal with Non Normal Distribution
Once you have ensured that your data in non-normal due to the nature of the data / process itself and not due to any of the above-mentioned reasons, then you can proceed with analyzing the same. There are two ways to go about analyzing the non-normal data. Either use the non-parametric tests, which do not assume normality or transform the data using an appropriate function, forcing it to fit normal distribution.
Several tests are robust to the assumption of normality such as t-test, ANOVA, Regression and DOE. We should use such tests only for normally distributed data. However, you may still be able to run these tests, with caution, if your sample size is large enough.
If you have a very small sample, a sample that is skewed or one that naturally fits another distribution type, you should run a non-parametric test. A non-parametric test is one that does not assume that the data fits any specific distribution type. Non-parametric tests include the Wilcoxon test, the Mann-Whitney Test, Moods Median test and the Kruskal-Wallis test. Below is the list of tests, which assumes normality, and the equivalent non-parametric test.
Generally, there are two reasons to statistically analyse data. First involves various tests to see if the data is stable and to calculate the process capability / sigma levels in measure phase of the project using pre-improvement project Y data and in control phase using post-improvement project Y data. The next part involves hypothesis testing in Measure phase and control charts in control phase. The test equivalent non-parametric tests mentioned above are suitable used for hypothesis testing.
Let us also look at how to calculate process capability and sigma level when the data is non-normal.
Process Capability for Non Normal Data
We perform ‘Capability Analysis > Normal’ in Minitab to calculate process sigma when we have normally distributed data. This capability analysis test assumes that the data is normal and accordingly calculates the Process Sigma (short term) and Cpk values.
Master Lean and Six Sigma Acronyms in No Time!
The Ultimate Guide to LSS Lingo – Yours for Free
Subscribe and Get Your Hands on the Most Comprehensive List of 220+ LSS Acronyms Available. No more searching for definitions, no more confusion. Just pure expertise at your fingertips. Get your free guide and other ebooks and templates today. Download Now!
However, we cannot do the same test when the data is non-normal. The alternate test is ‘Capability Analysis > Non-Normal’. Below figure shows the path for this test.
One of the prerequisite for this test is to know the exact distribution that the data is following as evident from the below figure.
Hence, the first task is to identify the distribution of the data by using ‘Individual Distribution Identification’ test in Minitab.
Chose the column with your non normal data set in the command box and run the test.
The output of this test is multiple probability plots with p-values for each distribution that it tests.
The distribution with the highest p-value is the best-fit distribution of the data. We should ignore the p-values for transformations – Box Cox and Johnson transformation – while identifying the best fit distribution. In our case, the p-value for lognormal is the highest. Hence, the best fit distribution for our data set is Lognormal distribution. This also means that our data set follows Lognormal distribution. Select this distribution in dialogue box for capability analysis test to calculate process sigma.
Re-run the capability analysis test for Non Normal data once again selecting Lognormal distribution as the best fit distribution.
By the way, do check out the Certified Lean Six Sigma Black Belt Handbook – it is one of the most essential guide for anyone trying to get certified as LSS Black belt or in general wants to understand LSS and improve processes. – check it out here.
Once you run this test, you will get the process capability for this data set.
There are instances where we do not get any distribution with p-value that is more than 0.05. This means that the data set does not follow any distributions that the test looks for. In such scenarios, the one of the preferred remedial action is to transform the non-normal data into normal data. We do it using one of the data transformation methods. Box Cox power transformation and Johnson’s transformation are most preferred methods to for such data transformation. More about data transformation using these methods in the next article.
Master Lean and Six Sigma Acronyms in No Time!
The Ultimate Guide to LSS Lingo – Yours for Free
Subscribe and Get Your Hands on the Most Comprehensive List of 220+ LSS Acronyms Available. No more searching for definitions, no more confusion. Just pure expertise at your fingertips. Get your free guide and other ebooks and templates today. Download Now!
Sachin Naik
Passionate about improving processes and systems | Lean Six Sigma practitioner, trainer and coach for 14+ years consulting giant corporations and fortune 500 companies on Operational Excellence | Start-up enthusiast | Change Management and Design Thinking student | Love to ride and drive
Valuable information.
Thanks, I’ve just been looking for info about Non Normal data since a long time. Every time I see that my data follows Non Normal distribution, I used to get scared. This write up explained why I should not be. Looking forward to Non normal to normal data conversion posts. Thank you.
Thanks for this post. I usually get confused whenever I get data that is non normally distributed. This post cleared it up for me. Thanks.
This is an informative and well-written article that addresses an important topic in Lean Six Sigma – how to deal with non-normal data. The article explains the different types of non-normal distributions and provides useful tips on how to transform data to make it more normal. The use of real-world examples made it easy to understand how these techniques can be applied in practice. The article also highlighted the importance of understanding the underlying assumptions and limitations of these techniques. I found it to be an excellent resource for anyone looking to improve their understanding of statistical process control. Thanks for sharing!
I found this article to be incredibly informative and well-written. The explanations of non-normal data and how to work with it in Six Sigma projects were clear and easy to understand. The real-world examples provided gave me a better understanding of how these concepts can be applied. The emphasis on using the appropriate statistical tools for non-normal data is important and often overlooked. Thanks for sharing this valuable information.
This article is a great resource for understanding how to work with non-normal data in Six Sigma projects. The explanations of different types of non-normal data and the appropriate statistical tools to use are clear and easy to understand. The examples provided helped me to see how these concepts can be applied in real-world situations. I appreciate the emphasis on using the right tool for the job, as it is important to not to force a normal distribution on data that is not normal.
You have a gift for making tough topics manageable. Your explanation of non-normal data is excellent – I now have a much better understanding of the subject thanks to this well-written piece. Bravo!
I was pleasantly surprised by the level of detail and clarity in this article about non-normal data. Your writing style is engaging and informative, making a complex topic easy to understand. Great job!
Good Points