Data Science Project Life Cycle Part 2 (Feature Engineering)

Puneet Singh
4 min readSep 3, 2020
Photo by Dan Meyers on Unsplash

We already saw about data analysis and visualization.

Now, In this second part. we will see how feature engineering perform on data.

So, in this tutorial. we will perform feature engineering on part1’s dataset.

What , We will perform in feature engineering -

1- Handle Null values.

2- Perform label encoding .

3- Perform feature scaling.

So let start with this-

STEP 1- Once again we are checking, how many columns(features) contains null values.

Before it, first let discuss about techniques of handle null values.

There are various way to handle NULL values.

1. Delete that rows. who have contains null values.

2. Replace null values with the most frequently repeated values.

3. predict null values with some ML algorithm.

4. Replace Null values with some central tendency.(it’s for only numeric data).

ETC.

null_features=null_count[null_count>0].sort_values(ascending=False)
print(null_features)
null_features=[features for features in data.columns if data[features].isnull().sum()>0]
print(null_features)

Cabin has contain 687 null values.

Age -177

Embarked-2

Cabin has contained many null values. so fill all 687 values is difficult for us.

So, we are going to drop this particular column.

Because in embarked 687 null values out of 891 total values . so fill nan value using any technique mention above will be meaning less.

So, we are going to drop this particular column.

Age has 177 null values and this is numeric columns . so we are going to fill nan value using central tendency mean.

Embarked has 2 null values. so these two null values . we are filling with most repeated values in Embarked column .

STEP 2- filling missing value and drop cabin column.

#1st step
nn=data['Age'].mean(axis = 0, skipna = True)
nn=np.round(nn)
data['Age']=data['Age'].fillna(value=nn)
#----------------------------------------------------
#2nd step
data=data.drop(['Cabin'], axis=1)
#----------------------------------------------------
# 3rd step
n1=data['Embarked'].mode()
data['Embarked']=data['Embarked'].fillna(value='S')

STEP 3-

data.isnull().sum()

Now, there is no missing value’s features in dataset.

STEP 4- After fill all null values .Data like this.

STEP 5- After fill all null values .Time of feature encoding.

Q-why we are performing feature encoding?

Ans- because , most of the feature in this data set is categorical columns. but machine does not understand categorical data. machine understand only numeric data. so for this we are converting categorical data into numeric data.

There are many different type of encoding techniques ..

1- one-hot encoding

2- one hot encoding with many variable.

3- mean encoding.

4- label encoding

5- target guided ordinal encoding

ETC.

So here, we are going to use label encoding , because it best for our problem statement.

STEP 6- In our dataset only two categorical features are left for performing some encoding techniques.

1-Embarked

2-Sex

So, now we are performing label encoding on these two features..

from sklearn import preprocessinglabel_encoder = preprocessing.LabelEncoder()data['Embarked']= label_encoder.fit_transform(data['Embarked'])data['Sex']= label_encoder.fit_transform(data['Sex'])

STEP 7- After performing label encoding our dataset is like this.

STEP 8- After performing label coding.

look above dataset , there is high vast difference between fares, Age columns with sex, pclass , parch, embarked, sibsp columns.

For minimize this difference, we are performing feature scaling on these two columns.

Now time to performing feature scaling .

We are performing feature scaling on these two columns.

1- Age.

2- Fares.

There are many different type of techniques for feature scaling but here we are using min-max scaler.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()data[['Age','Fare']]=scaler.fit_transform(da[['Age','Fare']].values)

STEP 9- After perform feature scaling . our data like this.

Now feature engineering has been finished.

In next part we will see . how perform feature extraction on dataset for development of machine learning models.

--

--

Puneet Singh

Data Science , Machine Learning , BlockChain Developer