# How To Detect Outliers In Dataset

**Detect and Handle **the **outliers **is **biggest **and **challengeable **task in **Machine learning**.

Outliers directly effect on **model accuracy**.

First let understand , what is the **outliers **in **dataset**?

An **outlier **is a data set that is **distant **from all other **observations**. A data points that **lies outside **the overall **distribution **of the dataset.

**Now**, let understand with the help of example….

In an **organization**, The **salary range **of all **employees **in **between 10k**$ to **50k**$.

So, in salary column all **employee’s salaries **fall under the given range.

**Suppose**, we have 10 **employees **in an organization and their **salaries distributions**.

These all the list of **employee’s salaries**. so it’s clearly visible **1,50,000**$ is not in range and it doesn’t fall in between **10k**$ to **50k**$. So, It indicates **outlier **of this **salary **column.

**Outliers **occurs by **human errors **like wrong entry ,**Variability **in the data and an **experimental measurement error **etc. but it might be possible in our case the **salary of CEO is 1,50,000$.** How can you say this done by **human mistake.**

In our case, there were only **10 **entries and we could easily find **outlier **manually or by hand or by watch .but if we have **millions **of **entries **so that time how will you find out the **outlier **from **million entries**.

## For That,

## There are some majorly used techniques. we will discussed later..

**1-Using scatter plots.**

**2-Using Box plot.**

**3-Using z score.**

**4-Using the IQR interquartile range.**

We will discuss all of these in detail….

# What are the impacts of having outliers in a dataset?

- It causes various problems during our
**statistical analysis**. - It may cause a
**significant impact**on the**mean**and the**standard deviation**. - It directly effect on the
**model’s accuracy**.

# 1. Detecting outlier using Scatter Plot

`import matplotlib.pyplot as plt`

x = [5,7,8,7,2,17,2,9,4,9,8,9,6]

y = [86,86,87,88,111,86,103,87,88,81,80,85,86]

plt.scatter(x, y)

plt.show()

**x** and **y** are our data points and right here we are trying to find out the **outliers **from the **data using scatter plot.**

**Right here,** clearly visible **three outliers **in **dataset**. **These three dots is not in range of data points variability.**

# 2. Detecting outlier using Z score

# Using Z score

Formula for Z score = (Observation — Mean)/Standard Deviation

**z = (X — μ) / σ**

**Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 2 standard deviation.**

outliers=[]dataset=[11,10,12,14,12,15,14,13,15,102,12,14,17,19,107,10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]defdetect_outliers(data):

threshold=3

mean = np.mean(data)

std =np.std(data)

foriindata:

z_score= (i - mean)/std

ifnp.abs(z_score) > threshold:

outliers.append(y)

returnoutliersoutlier_pt=detect_outliers(dataset)

print(outlier_pt)

These three **[102,107,108] **are **outliers**.

# 3. Detecting outliers using InterQuantile Range

**75%- 25% **values in a dataset

*Steps —*

*1. Arrange the data in increasing order*

*2. Calculate first(q1) and third quartile(q3)*

*3. Find interquartile range (q3-q1)*

*4.Find lower bound q1*1.5*

*5.Find upper bound q3*1.5*

Anything that **lies outside **of **lower **and **upper bound **is an **outlier**.

dataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]

import numpy as npdataset= [11,10,12,14,12,15,14,13,15,102,12,14,17,19,107, 10,13,12,14,12,108,12,11,14,13,15,10,15,12,10,14,13,15,10]quantile1, quantile3= np.percentile(dataset,[25,75])print("range between quantile1 to quantile3")print(quantile1,quantile3)print("IQR")iqr_value=quantile3-quantile1print(iqr_value)print("Find the lower bound value and the higher bound value")lower_bound_val = quantile1 -(1.5 * iqr_value)upper_bound_val = quantile3 +(1.5 * iqr_value)print(lower_bound_val,upper_bound_val)

**SO**, Data points below** 7.5 **and above **19.5 **consider as **outliers.**

# 4. Detecting the outliers using Box Plots

Draw a **box plot** on given dataset and detect the **outliers **using **box plots.**

import matplotlib.pyplot as plt

value1 = [82,76,24,40,67,62,75,78,71,32,98,89,78,67,72,82,87,66,56,52]box_plot_data=[value1]plt.boxplot(box_plot_data)plt.show()

These are **four **most commonly used techniques to **detect the outliers** from the datasets.

**In next tutorial , we will discuss about how we can handle outliers.**

**If any doubt regarding this tutorial ask feel free on LinkedIn- **http://linkedin.com/in/puneet166

**GitHub workspace link- ****https://github.com/puneet166?tab=repositories**