Observed variables often contain outliers that have unusually large or small values when compared with others in a data set.
Outliers can be caused by incorrect measurements, including data entry errors, or by coming from a different population than the rest of the data. If the measurement is correct, it represents a rare event. Two aspects of an outlier can be considered. The first reason is to find outliers which influence assumptions of a statistical test, for example, outliers conflict the normal distribution assumption in an ANOVA test, and deal with them properly in order to improve statistical analysis.
This could be considered as a preliminary step for data analysis. The second reason is to use the outliers themselves for the purpose of obtaining certain critical information about the data.There are two kinds of outlier detection methods: formal tests and informal tests. Formal and informal tests are usually called tests of discordancy and outlier labeling methods, respectively. Most formal tests need test statistics for hypothesis testing. They are usually based on assuming some well-behaving distribution, and test if the target extreme value is an outlier of the distribution.
Selection of these tests mainly depends on numbers and type of target outliers, and type of data distribution. (Hoaglin, Iglewicz, and Tukey 1986) reviewed and compared five selected formal tests which are applicable to the normal distribution, such as the Generalized ESD, Kurtosis statistics, Shapiro-Wilk, the Boxplot rule, and the Dixon test, through simulations.Even though formal tests are quite powerful under well-behaving statistical assumptions such as a distribution assumption, most distributions of real-world data may be unknown or may not follow specific distributions such as the normal, gamma, or exponential.
On the other hand, most outlier labeling methods, informal tests, generate an interval or criterion for outlier detection instead of hypothesis testing, and any observations beyond the interval or criterion is considered as an outlier. Various location and scale parameters are mostly employed in each labeling method to define a reasonable interval or criterion for outlier detection. There are two reasons for using an outlier labeling method. One is to find possible outliers as a screening device before conducting a formal test. The other is to find the extreme values away from the majority of the data regardless of the distribution. While the formal tests usually require test statistics based on the distribution assumptions and a hypothesis to determine if the target extreme value is a true outlier of the distribution, most outlier labeling methods present the interval using the location and scale parameters of the data. Although the labeling method is usually simple to use, some observations outside the interval may turn out to be falsely identified outliers after a formal test when the outliers are defined as only observations that deviate from the assuming distribution.
For a large data set that is statistically problematic, it is difficult to identify the distribution of the data or transform it into a proper distribution such as the normal distribution, labeling methods can be used to detect outliers.Outlier-labeling methods such as the Standard Deviation (SD) and the boxplot are commonly used and are easy to use. These methods are quite reasonable when the data distribution is symmetric and mound-shaped such as the normal distribution. The boxplot which was developed by (Tukey 1993) is another very helpful method since it makes no distributional assumptions nor does it depend on a mean or standard deviation. The lower quartile (q1) is the 25th percentile, and the upper quartile (q3) is the 75th percentile of the data.
The inter-quartile range (IQR) is defined as the interval between q1 and q3 defined q1-(1.5*iqr) and q3+(1.5*iqr) as “inner fences”, q1-(3*iqr) and q3+(3*iqr) as “outer fences”, the observations between an inner fence and its nearby outer fence as “outside”, and anything beyond outer fences as “far out”. High renamed the “outside” potential outliers and the “far out” problematic outliers. The “outside” and “far out” observations can also be called possible outliers and probable outliers, respectively.
This method is quite effective, especially when working with large continuous data sets that are not highly skewed. Although Tukey’s method is quite effective when working with large data sets that are fairly normally distributed, many distributions of real-world data do not follow a normal distribution.