Data Science Project Life Cycle Part 2 (Feature Engineering)

Photo by Dan Meyers on Unsplash

We already saw about data analysis and visualization.

Now, In this second part. we will see how feature engineering perform on data.

So, in this tutorial. we will perform feature engineering on part1’s dataset.

What , We will perform in feature engineering -

1- Handle Null values.

2- Perform label encoding .

3- Perform feature scaling.

So let start with this-

STEP 1- Once again we are checking, how many columns(features) contains null values.

Before it, first let discuss about techniques of handle null values.

There are various way to handle NULL values.

1. Delete that rows. who have contains null values.

2. Replace null values with the most frequently repeated values.

3. predict null values with some ML algorithm.

4. Replace Null values with some central tendency.(it’s for only numeric data).


null_features=[features for features in data.columns if data[features].isnull().sum()>0]

Cabin has contain 687 null values.

Age -177


Cabin has contained many null values. so fill all 687 values is difficult for us.

So, we are going to drop this particular column.

Because in embarked 687 null values out of 891 total values . so fill nan value using any technique mention above will be meaning less.

So, we are going to drop this particular column.

Age has 177 null values and this is numeric columns . so we are going to fill nan value using central tendency mean.

Embarked has 2 null values. so these two null values . we are filling with most repeated values in Embarked column .

STEP 2- filling missing value and drop cabin column.

#1st step
nn=data['Age'].mean(axis = 0, skipna = True)
#2nd step
data=data.drop(['Cabin'], axis=1)
# 3rd step



Now, there is no missing value’s features in dataset.

STEP 4- After fill all null values .Data like this.

STEP 5- After fill all null values .Time of feature encoding.

Q-why we are performing feature encoding?

Ans- because , most of the feature in this data set is categorical columns. but machine does not understand categorical data. machine understand only numeric data. so for this we are converting categorical data into numeric data.

There are many different type of encoding techniques ..

1- one-hot encoding

2- one hot encoding with many variable.

3- mean encoding.

4- label encoding

5- target guided ordinal encoding


So here, we are going to use label encoding , because it best for our problem statement.

STEP 6- In our dataset only two categorical features are left for performing some encoding techniques.



So, now we are performing label encoding on these two features..

from sklearn import preprocessinglabel_encoder = preprocessing.LabelEncoder()data['Embarked']= label_encoder.fit_transform(data['Embarked'])data['Sex']= label_encoder.fit_transform(data['Sex'])

STEP 7- After performing label encoding our dataset is like this.

STEP 8- After performing label coding.

look above dataset , there is high vast difference between fares, Age columns with sex, pclass , parch, embarked, sibsp columns.

For minimize this difference, we are performing feature scaling on these two columns.

Now time to performing feature scaling .

We are performing feature scaling on these two columns.

1- Age.

2- Fares.

There are many different type of techniques for feature scaling but here we are using min-max scaler.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()data[['Age','Fare']]=scaler.fit_transform(da[['Age','Fare']].values)

STEP 9- After perform feature scaling . our data like this.

Now feature engineering has been finished.

In next part we will see . how perform feature extraction on dataset for development of machine learning models.




Data Science , Machine Learning , BlockChain Developer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Stock Price Prediction Using fbprophet

Insights of Movie Industry & Elements affecting the result of a movie

An Introduction to Ordinary Least Squares, Ridge and Lasso Regression

Long-term demand forecasting

Maximizing Scarce Maintenance Resources with Data

Episode 106: A Modeller Looks Back

Getting Started with Seaborn : Basics Tutorial

Scatter Plot - A Tool for Descriptive Statistics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Puneet Singh

Puneet Singh

Data Science , Machine Learning , BlockChain Developer

More from Medium

Into the world of Data Science

Non-Tech to Data Science Role- Beginner’s Guide.

Data analytics and reporting in pharmaceutical

Essential Libraries To Have In Your Toolbox For Data Science And ML — Series #2 — Pandas