#6 - How to Detect Outliers Like a Detective?

In this article, we will learn different techniques of detecting outliers that can affect the data in a lot of ways.

Outliers - In a Nutshell

In order to understand what an outlier is, we can think of a great example that explains the concept in a nutshell. Imagine, there is summer season going on. You are with your friends chilling. Everyone in your group is wearing normal clothes (light clothes that people usually wear in the summer season) except for the one who is like a weirdo. That single person is wearing a sweater. Now clearly, he/she is totally a different person who is not like the rest and certainly, he/she doesn't fit in your group.

The explanation can be clearly understood by seeing the cover image of this article.

Credits of Cover Image - Photo by Will Myers on Unsplash

Question in your Mind

  • Wait a second. Let's say I have numerical data, how can I be a detective to catch hold of an outlier in my data? Obviously, the data will be not like the cover image. (dang it)

  • In that case what should I do? (double dang it)

Answer to Your Question

Yes, what you are saying is absolutely logical. In order to detect the outlier in numerical data (or any data for reason), we can use various statistical methods that help find out the outlier.

Definition of Outlier - Statistical term

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate the experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

There are 3 easy ways to detect the outliers within the data.

  1. Visual Exploration
    • Scatter Plot
    • Box Plot
  2. Z-score Method
  3. IQR - (Inter Quartile Range)

Note - The readers of this article are assumed to know basic statistical measures and how to calculate them. If yes, you can proceed further. Otherwise, you may want to refer to this article.

Outliers can occur due to the wrong entry of the data or typo into the original dataset. Now that we know different ways of detecting, we shall understand each with an example.

Dummy Data

Let's have data values say -

$$x = [10, 14, 15, 12, 18, 20, 19, 17, 22, 25, 80, 21, 25, 24, 13]$$

and

$$y = [13, 24, 17, 22, 26, 21, 14, 11, 10, 19, 23, 27, 11, 15, 85]$$

Necessary Imports

imports.png

1. Visual Exploration

By representing the data in the form of a graph, one can easily identify the variability or the difference in data patterns that may lead to having an outlier. The visual representation is not more effective than the other two methods.

Scatter Plot

Scatter Plot basically plots or visualizes the data along the coordinate axes based on the scale range mentioned.

Code

scatter_xy.png

Output

xy_scatter.png

Box Plot

Box plot basically plots or visualizes the statistical measures of the data. The outliers are automatically shown either side min and max values. This is more robust than the scatter plot in detecting the outliers. The explanation of the box plot is below.

box_plot_explanation.jpg

Credits - The above image is taken from the Internet.

Code - x

x_box.png

Output - x

x_box_o.png

The point (50) outside the max line is an outlier.

Code - y

y_box.png

Output - y

y_box_o.png

The point (60) outside the max line is an outlier.

Dang you (back)

2. Z-score Method

The Z-score method is used to normalize (standard normalize) the data. The normalized data is also called Standardized Normal Distribution. One important thing about this is that the mean is always 0 and the standard deviation is always 1.

Formula

Let

$$X = [x_1, x_2, x_3, \dots, x_n]$$

and

$$\mu_x \rightarrow \text{Mean}$$

$$\sigma_x \rightarrow \text{Standard Deviation}$$

then

$$z_x = \frac{(x_i - \mu_x)}{\sigma_x}$$

When we compute the above, it generates a sequence of numbers that are mostly normalized (standardized) values.

  • If any value is greater than +3 is an outlier.
  • Similarly, if any value less than -3 is an outlier.

Code

zscore.png

The above code returns a sequence of normalized values.

outliervals.png

The above code returns the outliers from the original data by getting the indices.

Testing

zscore-out.png

The above result is clearly stating that 50 and 60 are outliers of x and y.

Double dang you (back)

3. IQR - (Inter Quartile Range)

IQR is self-explained by just looking at the box plot chart. The formula for IQR is

$$IQR = Q_3 - Q_1$$

where

$$Q_1 \rightarrow \text{First Quartile}$$

$$Q_3 \rightarrow \text{Third Quartile}$$

and

$$Q_2 \rightarrow \text{Median}$$

Code

q1q3.png

Once we compute IQR, we need to find the bounds, i.e., upper bound and lower bound. We use a constant (1.5) to get the bounds.

  • The data point which is less than the lower bound is an outlier.
  • The data point which is greater than the upper bound is an outlier.

Code

iqr-outlier.png

Testing

zscore-out.png

The above result is clearly stating that 50 and 60 are outliers of x and y.

Triple dang you (back)

Hence the Questions are answered. You have now successfully become the detective who can detect the outliers in the data.


Well, that's all for now. This article is included in the series Exploratory Data Analysis, where I share tips and tutorials helpful to get started. We will learn step-by-step how to explore and analyze the data. As a pre-condition, knowing the fundamentals of programming would be helpful. The link to this series can be found here.

No Comments Yet