# Data Science Project Life Cycle Part 1 (Data Analysis And Visualization)

This is series of how to developed data science project.

This is part 1.

All the Life-cycle In A Data Science Projects-
1. Data Analysis and visualization.
2. Feature Engineering.
3. Feature Selection.
4. Model Building.
5. Model Deployment.

This whole Data science project life cycle divided into 4 parts.

Part 1: How to analysis and visualize the data.

Part 2: How to perform feature engineering on data.

Part 3: How to perform feature extraction.

Part 4: model building and development.

This tutorial is helpful for them. who has always confused , how to developed data science projects and what is the process for developing data science projects.

So, if you are one of them . so you are at right place.

Now, In this whole series. we gonna discuss about complete life cycle of data science projects. we will discuss from Scratch.

In this Part 1. we will see, how to analysis and visualize the data.

So, for all of this, we have to take simple data set of Titanic: Machine Learning from Disaster

So, using this dataset we will see, how complete data science life cycle is work.

In next future tutorial we will take complex dataset.

In this dataset 891 records with 11 features(columns).

So, let begin with part 1-Analysis and visualize the data in data science life cycle.

STEP 1-

`import pandas as pdimport numpy as npimport matplotlib as plt%matplotlib inline`

Import pandas for perform manipulation on dataset.

Import numpy to deal with mathematical operations.

Import matplotlib for visualize the dataset in form of graphs.

STEP 2-

`data=pd.read_csv(‘path.csv’)data.head(5)`

Dataset is like this-

There is 11 features in dataset.

1-PassengerId

2-Pclass

3-Name

4-Sex

5-Age

6-Sibsp

7-Parch

8-Ticket

9-Fare

10-cabin

11-embeded

VariableDefinitionKeysurvivalSurvival0 = No, 1 = YespclassTicket class1 = 1st, 2 = 2nd, 3 = 3rdsexSexAgeAge in yearssibsp# of siblings / spouses aboard the Titanicparch# of parents / children aboard the TitanicticketTicket numberfarePassenger farecabinCabin numberembarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

# Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

STEP 3 -After import the dataset. now time to analysis the data.

First, we gonna go check the Null value in dataset.

`null_count=data.isnull().sum()null_count`

In that code-

Now here, We are checking, how many features contains null values.

here is list..

STEP 4-

`null_features=null_count[null_count>0].sort_values(ascending=False)print(null_features) null_features=[features for features in data.columns if data[features].isnull().sum()>0]print(null_features)`

In 1st line of code – we are extracting only those features with their null values , who has contains 1 or more than 1 Null values.

In 2nd line of code – print those features, who has contains 1 or more than 1 null values.

In 3rd line of code – we extracting only those features, who has contains 1 or more than 1 null values using list comprehensions.

In 4th line of code -Print those all features.

We will deal with null values in feature engineering part 2.

STEP 5-

`data.shape`

Check the shape of the dataset.

STEP 6 - Check unique values in passenger i’d, name, ticket columns.

`print("{} unique values in passenger id columns ".format(len(data.PassengerId.unique())))print("{} unique values in Name columns ".format(len(data.Name.unique())))print("{} unique values in Ticket columns ".format(len(data.Ticket.unique())))`

891 unique values in passenger id and name columns

681 unique values in ticket columns.

So , These three columns will not be do any effect on our prediction. these columns not useful, because all values are unique values .so we are dropping these three columns from the dataset.

STEP 7- Drop these three columns

`data=data.drop(['PassengerId','Name','Ticket'],axis=1)`

STEP 8-

`data.shape`

Check the shape of dataframe once again .It has 891 *9.

Now, After drop three features .9 columns (features) present in dataset.

STEP 9 - In this step, we gonna go check is this dataset is balance or not.

If dataset is not balance . it might be reason of bad accuracy, because in dataset. if you have many data points which belong only to single class. so this thing may lead over-fitting problem.

`data["Survived"].value_counts().plot(kind='bar')`

Here, Dataset is not proper balance, it’s 60–40 ratio.

So , it will not be effect so much on our predictions. If it effect on our prediction we will deal with this soon.

STEP 10 -

`datasett=data.copynot_survived=data[data['Survived']==0]survived=data[data['Survived']==1]`

1st line of code — Copy the dataset in dataset variable. because , we will perform some manipulation on data . so that manipulation does not effect on real data. so that we are copying dataset into datasett variable.

2nd line of code - we are dividing dependent feature (“Survived “) into two part-

1. extract all non survival data points from the “Survived” column and store into not_survived variable.

2. extract all survival data points from the “Survived” column and store into survival variable.

It is for visualize the dataset.

STEP 11 - Now , We are checking of null values’s features(columns) dependency with the Survived or non survived peoples .

First of all , right here, we are checking null values ‘s features(columns) dependency with survived people only.

`import matplotlib.pyplot as pltdataset = survived.copy()%matplotlib inline for features in null_features:               dataset[features] = np.where(dataset[features].isnull(), 1, 0)            dataset.groupby(features)['Survived'].count().plot.bar()    plt.xlabel(features)    plt.ylabel('Survived')    plt.title(features)    plt.show()`

1st line of code- Import matplotlib.lib for visualize purpose.

2nd line of code- copy the only survived people dataset into dataset variable for further processing and it copy because any manipulation does not effect on real data frame.

3rd line of code - Used loop . This loop will take only null value’s features(columns) only.

4th line of code — Null values replace by the 1 or non null values replace by 0. for check the dependency.

Look in above graphs . all 1 values is basically null values and 0 values is not null values.

In age’s graph , dependency is low with null value .

In cabin’s graph, dependency is high with null values.

In embarked’s graph ,there is approximate not any dependency with null values.

STEP 12- Over here, we are checking dependency with not survived peoples.

`import matplotlib.pyplot as pltdataset = not_survived.copy()%matplotlib inline for features in null_features:       dataset = not_survived.copy()            dataset[features] = np.where(dataset[features].isnull(), 1, 0)            dataset.groupby(features)['Survived'].count().plot.bar()    plt.xlabel(features)    plt.ylabel('Survived')    plt.title(features)    plt.show()`

Same like step 7 . but over here we are dealing with non survival people’s.

STEP 13 -Let find the relationship between all categories variable(features) with survive peoples.

`num_features=data.columnsfor feature in num_features:    if feature != 'Survived' and feature !='Fare' and feature != 'Age':                        survived.groupby(feature)['Survived'].count().plot.bar()        plt.xlabel(feature)        plt.ylabel('Survived')        plt.title(feature)        plt.show()`

Survived is dependent variable (feature).

Fare is numeric variable.

Age is also numeric variable.

So , we are not considering these three features (columns) .

These all are relationship between categorical variables(features) with survived peoples.

STEP 14- Let find the relationship between all categories variable with not survived peoples.

`for feature in num_features:    if feature != 'Survived' and feature !='Fare' and feature != 'Age':                        not_survived.groupby(feature)['Survived'].count().plot.bar()        plt.xlabel(feature)        plt.ylabel('Not Survived')        plt.title(feature)        plt.show()`

These all are relationship between categorical variables(features) with not survived peoples.

STEP 15 -Now check with the numeric variables.

`non_sur_fare_mean=round(not_survived['Fare'].mean())non_sur_age_mean=round(not_survived['Age'].mean())  print('its not survived people fare averages',non_sur_fare_mean)print('its not survived people Age averages',non_sur_age_mean)sur_fare_mean=round(survived['Fare'].mean())sur_age_mean=round(survived['Age'].mean())  print('its survived people fare averages',sur_fare_mean)print('its survived people Age averages',sur_age_mean)`

From this , Its clearly show-

Those who were above 31 years old had very few chances to escapes.

Those who fares was less than 48 had very few chances to escapes.

STEP 16- Analysis the numeric features. check distribution of the numeric variable(features).

`for feature in num_features:    if feature =='Fare' or feature == 'Age':        data[feature].hist(bins=25)        plt.xlabel(feature)        plt.ylabel("Count")        plt.title(feature)        plt.show()`

Age followed normal distribution, but Fare is not following normal distribution.

STEP 17- Check the outliers in dataset.

There are many ways to check the outliers.

1- Box plot.

2- Z-score.

3- scatter plot

ETC.

First of all , Find the outliers using scatter plot-

`for feature in num_features:        if feature =='Fare' or feature == 'Age':plt.scatter(data[feature],data['Survived'])                plt.xlabel(feature)                plt.ylabel('Survived')                plt.title(feature)                plt.show()`

In Age less no of outlier.

but in fare you can see there is lot of outlier in fare.

STEP 18- Find outlier using box plot-

Box plot is better way to check outlier in distribution.

`for feature in num_features:    if feature =='Fare' or feature == 'Age':                                   data.boxplot(column=feature)            plt.ylabel(feature)            plt.title(feature)            plt.show()`

Box plot is also show same picture like scatter.

Age has few no of outlier , but in fare has lot of outliers.

This first part ended here.

In this part we saw. how, we can analysis and visualize the dataset.

In next part we will see how feature engineering perform.

Data Science and Machine Learning Enthusiast.

## More from Puneet Singh

Data Science and Machine Learning Enthusiast.

## Must’ve skills for a career in Data Science and their free-resources

Get the Medium app