Classification is one of the machine learning tasks. So what is classification?
It’s something you do all the time, to categorize data.

Look at any object and you will instantly know what class it belong to: is it a mug, a tabe or a chair.
That is the task of classification and computers can do this (based on data).

This article is Machine Learning for beginners. Let’s make our first machine learning program

Related course: Machine Learning Intro for Python Developers
icon

Supervised Machine Learning

Training data

Imports the machine learning module sklearn. (Supervised) Machine learning algorithm uses examples or training data. A training phase is the first step of a machine learning algorithm.

Example data is used, so collect data first. Like a set of images of apples and oranges and write down features.

Features can be used to distinct between the two classes. A feature is a property, like the color, shape or weight. It can be expressed as numeric value.

One of the key tasks is to get good features from your training data. Write down the category of each image. Category is the class, you can take class 0 for apples and class 1 for oranges.

You can have as many classes as you want, but this example we’ll use 2 classes (apples and oranges).

machine learning training data for classifier

Write the features horizontally, the line represents the first image.

So this is called a feature vector. This set of numbers represents the image.

Classifier

After the training phase, a classifier can make a prediction.
Given a new feature vector, is the image an apple or an orange?

There are different types of classification algorithms, one of them is a decision tree.

If you have new data, the algorithm can decide which class you new data belongs.
The output will be [0] for apple and [1] for orange.

So this is new data and then we simply make the algorithm predicts.

1
2
3
4
5
6
7
8
9
10
from sklearn import tree

features = [[0,50],[0,60],[1,35],[1,36],[1,40]]
labels = [0,0,1,1,1]

algorithm = tree.DecisionTreeClassifier()
algorithm = algorithm.fit(features, labels)

newData = [[0,51]]
print(algorithm.predict(newData))

Overfitting and underfitting

In general the more training data you’ll have the better the classifier becomes.
If you have very little training data (underfitting), you won’t have good predictions.

So in general it becomes more accurate with more data. But there is a limit to that, that’s called overfitting.

Download examples and exercises