Data Science Project Life Cycle Part 1 (Data Analysis And Visualization)

Photo by Carlos Muza on Unsplash

This is series of how to developed data science project.

This is part 1.

All the Life-cycle In A Data Science Projects-
1. Data Analysis and visualization.
2. Feature Engineering.
3. Feature Selection.
4. Model Building.
5. Model Deployment.

This whole Data science project life cycle divided into 4 parts.

Part 1: How to analysis and visualize the data.

Part 2: How to perform feature engineering on data.

Part 3: How to perform feature extraction.

Part 4: model building and development.

This tutorial is helpful for them. who has always confused , how to developed data science projects and what is the process for developing data science projects.

So, if you are one of them . so you are at right place.

Now, In this whole series. we gonna discuss about complete life cycle of data science projects. we will discuss from Scratch.

In this Part 1. we will see, how to analysis and visualize the data.

So, for all of this, we have to take simple data set of Titanic: Machine Learning from Disaster

So, using this dataset we will see, how complete data science life cycle is work.

In next future tutorial we will take complex dataset.

dataset link -https://www.kaggle.com/c/titanic/data

In this dataset 891 records with 11 features(columns).

So, let begin with part 1-Analysis and visualize the data in data science life cycle.

STEP 1-

Import pandas for perform manipulation on dataset.

Import numpy to deal with mathematical operations.

Import matplotlib for visualize the dataset in form of graphs.

STEP 2-

Dataset is like this-

There is 11 features in dataset.

1-PassengerId

2-Pclass

3-Name

4-Sex

5-Age

6-Sibsp

7-Parch

8-Ticket

9-Fare

10-cabin

11-embeded

ABOUT THE DATA –

VariableDefinitionKeysurvivalSurvival0 = No, 1 = YespclassTicket class1 = 1st, 2 = 2nd, 3 = 3rdsexSexAgeAge in yearssibsp# of siblings / spouses aboard the Titanicparch# of parents / children aboard the TitanicticketTicket numberfarePassenger farecabinCabin numberembarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

STEP 3 -After import the dataset. now time to analysis the data.

First, we gonna go check the Null value in dataset.

In that code-

Now here, We are checking, how many features contains null values.

here is list..

Output

STEP 4-

In 1st line of code – we are extracting only those features with their null values , who has contains 1 or more than 1 Null values.

In 2nd line of code – print those features, who has contains 1 or more than 1 null values.

In 3rd line of code – we extracting only those features, who has contains 1 or more than 1 null values using list comprehensions.

In 4th line of code -Print those all features.

Output

We will deal with null values in feature engineering part 2.

STEP 5-

Check the shape of the dataset.

Output

STEP 6 - Check unique values in passenger i’d, name, ticket columns.

Output

891 unique values in passenger id and name columns

681 unique values in ticket columns.

So , These three columns will not be do any effect on our prediction. these columns not useful, because all values are unique values .so we are dropping these three columns from the dataset.

STEP 7- Drop these three columns

STEP 8-

Output

Check the shape of dataframe once again .It has 891 *9.

Now, After drop three features .9 columns (features) present in dataset.

STEP 9 - In this step, we gonna go check is this dataset is balance or not.

If dataset is not balance . it might be reason of bad accuracy, because in dataset. if you have many data points which belong only to single class. so this thing may lead over-fitting problem.

Output

Here, Dataset is not proper balance, it’s 60–40 ratio.

So , it will not be effect so much on our predictions. If it effect on our prediction we will deal with this soon.

STEP 10 -

1st line of code — Copy the dataset in dataset variable. because , we will perform some manipulation on data . so that manipulation does not effect on real data. so that we are copying dataset into datasett variable.

2nd line of code - we are dividing dependent feature (“Survived “) into two part-

1. extract all non survival data points from the “Survived” column and store into not_survived variable.

2. extract all survival data points from the “Survived” column and store into survival variable.

It is for visualize the dataset.

STEP 11 - Now , We are checking of null values’s features(columns) dependency with the Survived or non survived peoples .

First of all , right here, we are checking null values ‘s features(columns) dependency with survived people only.

1st line of code- Import matplotlib.lib for visualize purpose.

2nd line of code- copy the only survived people dataset into dataset variable for further processing and it copy because any manipulation does not effect on real data frame.

3rd line of code - Used loop . This loop will take only null value’s features(columns) only.

4th line of code — Null values replace by the 1 or non null values replace by 0. for check the dependency.

Graph 1
Graph 2
Graph 3

Look in above graphs . all 1 values is basically null values and 0 values is not null values.

In age’s graph , dependency is low with null value .

In cabin’s graph, dependency is high with null values.

In embarked’s graph ,there is approximate not any dependency with null values.

STEP 12- Over here, we are checking dependency with not survived peoples.

Graph 1
Graph 2
Graph 3

Same like step 7 . but over here we are dealing with non survival people’s.

STEP 13 -Let find the relationship between all categories variable(features) with survive peoples.

Survived is dependent variable (feature).

Fare is numeric variable.

Age is also numeric variable.

So , we are not considering these three features (columns) .

These all are relationship between categorical variables(features) with survived peoples.

STEP 14- Let find the relationship between all categories variable with not survived peoples.

These all are relationship between categorical variables(features) with not survived peoples.

STEP 15 -Now check with the numeric variables.

From this , Its clearly show-

Those who were above 31 years old had very few chances to escapes.

Those who fares was less than 48 had very few chances to escapes.

STEP 16- Analysis the numeric features. check distribution of the numeric variable(features).

Age followed normal distribution, but Fare is not following normal distribution.

STEP 17- Check the outliers in dataset.

There are many ways to check the outliers.

1- Box plot.

2- Z-score.

3- scatter plot

ETC.

First of all , Find the outliers using scatter plot-

explainjatiopn scatter of fare

In Age less no of outlier.

but in fare you can see there is lot of outlier in fare.

STEP 18- Find outlier using box plot-

Box plot is better way to check outlier in distribution.

box plot og Age
explain age box plot
box plot of fare
Explanation about image

Box plot is also show same picture like scatter.

Age has few no of outlier , but in fare has lot of outliers.

This first part ended here.

In this part we saw. how, we can analysis and visualize the dataset.

In next part we will see how feature engineering perform.

Data Science and Machine Learning Enthusiast.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store