How to Prepare your Data for Learning with Scikit-Learn.

If you want to implement your learning algorithm with sci-kit-learn, the first thing you need to do is to prepare your data.

This will showcase the structure of the problem to the learning algorithm you decide to use.

Related course: Python Machine Learning Course

The only barrier to this is the need for a different algorithm to initiate different assumption about the data to be processed which may sometimes warrant different transform.

There are four proven steps in the preparation of data for learning with sci-kit-learn. They include:

  1. rescale the data
  2. standardization of data
  3. normalize the data
  4. turn data into binary

Data Preparation

Rescale the data

Rescaling the attributes of your data particularly when it consists of different scales which enables several learning algorithms to benefit from the rescaling process for data to ensure occurrence in the same scale.

This process is callable nominalization with attributes having a rescaled range of 0 and 1. It ensures the existence of optimization algorithm that forms the core of gradient descent -an exam of the learning algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler

# data values
X = [ [110, 200], [120, 800], [310, 400], [140, 900], [510, 200], [653, 400] ,[310, 880] ]

# transofrm data
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

# summarize transformed data
numpy.set_printoptions(precision=2)
print(rescaledX[0:6,:])

The rescaled values will be between 0 and 1:

1
2
3
4
5
6
[[0.   0.  ]
[0.02 0.86]
[0.37 0.29]
[0.06 1. ]
[0.74 0. ]
[1. 0.29]]

It is also valuable in algorithms that take into consideration weighing of neutral networks, regression and all algorithms that engage distance measure such as K-Nearest Neighbors.

Standardize Data

This technique is effective in the transformation of attributes using a Gaussian distribution.

The Gaussian distribution uses a mean of 0 with the standard deviation set at 1.Logistic regression, linear regression, and linear discriminating analysis are most suitable with Gaussian distribution as input variables that makes better use of rescaled data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy

# data values
X = [ [110, 200], [120, 800], [310, 400], [140, 900], [510, 200], [653, 400] ,[310, 880] ]

# scaler
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:6,:])

Scaled values:

1
2
3
4
5
6
[[-1.02  -1.178]
[-0.968 0.901]
[ 0.013 -0.485]
[-0.865 1.247]
[ 1.045 -1.178]
[ 1.783 -0.485]]

Normalize data

To normalize the data in Scikit-learn, it involves rescaling each observation to assume a length of 1 - a unit form in linear algebra.

Normalizer class software can be best used in normalizing data in python with Scikit-learn.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Normalize values
from sklearn.preprocessing import Normalizer
import pandas
import numpy

# data values
X = [ [110, 200], [120, 800], [310, 400], [140, 900], [510, 200], [653, 400] ,[310, 880] ]

# normalize values
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=2)
print(normalizedX[0:6,:])

Normalized values are then:

1
2
3
4
5
6
[[0.48 0.88]
[0.15 0.99]
[0.61 0.79]
[0.15 0.99]
[0.93 0.37]
[0.85 0.52]]

Sparse datasets with varying scale specifically benefit more from the preprocessing in the algorithm using distance measure like K-Nearest Neighbors. A typical example is neutral network. Making binary with data

Binary Data Transformation

It can be achieved with a binary threshold marked as 1 less than or equal to 0. It is useful in Probabilities with crisp values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Binary values
from sklearn.preprocessing import Binarizer
import pandas
import numpy

# data values
X = [ [110, 200], [120, 800], [310, 400], [140, 900], [510, 200], [653, 400] ,[310, 880] ]

# binarize data
binarizer = Binarizer(threshold=500).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=1)
print(binaryX[0:6,:])

The threshold value is very important, as it will decide which values become zero or one.

1
2
3
4
5
6
[[0 0]
[0 1]
[0 0]
[0 1]
[1 0]
[1 0]]

Also, it’ pos’s is of huge significance in adding features to feature engineering. By now, you must be familiar with the steps involved in the preparation of data for machine learning with Scikit-learn.

Remember, the four steps involved are:

  1. rescaling the data
  2. standardization of data
  3. normalizing the data
  4. making binary with data.

If you are new to Machine Learning, then I highly recommend this book.

Download examples and exercises