How to load Machine Learning Data in Python

In order to start your machine learning project in Python, you need to be able to load data properly. If you are a beginner in Python, this article will help you learn how to load machine learning data using three different techniques.

Related course: Python Machine Learning Course

Load Machine Learning Data

Before we go deeper, you need to know that CSV or comma separated values is the most commonly used format for which machine learning data is presented. In the CSV file of your machine learning data, there are parts and features that you need to understand. These include:

CSV File Header: The header in a CSV file is used in automatically assigning names or labels to each column of your dataset. If your file doesnt have a header, you will have to manually name your attributes.
Comments: You can identify comments in a CSV file when a line starts with a hash sign (#). Depending on the method you choose to load your machine learning data, you will have to determine if you want these comments to show up, and how you can identify them.
Delimiter: A delimiter separates multiple values in a field and is indicated by the comma (,). The tab (\t) is another delimiter that you can use, but you have to specify it clearly.
Quotes: If field values in your file contain spaces, these values are often quoted and the symbol that denotes this is double quotation marks . If you choose to use otehr characters, you need to specify this in your file.

After identifying these critical parts of your data file, lets go ahead and learn the different methods on how to load machine learning data in Python.

Load Data with Python Standard Library

With Python Standard Library, you will be using the module CSV and the function reader() to load your CSV files. Upon loading, the CSV data will be automatically converted to NumPy array which can be used for machine learning.

For example, below is a small code that when you run using the Python API will load this dataset that has no header and contains numeric fields. It will also automatically convert it to a NumPy array.

# Load CSV (using python)
import csv
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')
print(data.shape)

Simply explained, this code commands the program to load an object that enables iteration over each row of the data and can be converted easily into a NumPy array. Running the sample code produces the following shape of the array:

1 (768, 9)

Load Data File With NumPy

Another way to load machine learning data in Python is by using NumPy and the numpy.loadtxt() function.

In the sample code below, the function assumes that your file has no header row and all data use the same format. It also assumes that the file pima-indians-diabetes.data.csv is stored in your current directory.

# Load CSV
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = numpy.loadtxt(raw_data, delimiter=",")
print(data.shape)

Running the sample code above will load the file as a numpy.ndarray and produces the following shape of the data:

1 (768, 9)

If your file can be retrieved using a URL, the above code can be modified to the following, while yielding the same dataset:

# Load CSV from URL using NumPy
from numpy import loadtxt
from urllib.request import urlopen
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indiansiabetes.data.csv'
raw_data = urlopen(url)
dataset = loadtxt(raw_data, delimiter=",")
print(dataset.shape)

Running the code will produce the same resulting shape of the data:

1 (768, 9)

python pandas data load csv

Load Data File With Pandas

The third way to load your machine learning data is using Pandas and the pandas.read_csv() function.

The pandas.read_csv() function is very flexible and the most ideal way to load machine learning data. It returns a pandas.DataFrame that enables you to start summarizing and plotting immediately.

The sample code below assumes that the pima-indians-diabetes.data.csv file is stored in your current directory.

1 # Load CSV using Pandas
2 import pandas
3 filename = 'pima-indians-diabetes.data.csv'
4 names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
5 data = pandas.read_csv(filename, names=names)
6 print(data.shape)

You will notice here that we explicitly idetntified the names of each attribute to the DataFrame. When we run the sample code above prints the following shape of the data:

1 (768, 9)

If your file can be retrieved using a URL, the above code can be modified as to the following, while yielding the same dataset:

1 # Load CSV using Pandas from URL
2 Import pandas
3 url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
4 names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
5 data = pandas.read_csv(url, names=names)
6 print(data.shape)

Running the sample code above will download a CSV file, parse it, and produce the following shape of the loaded DataFrame:

1 (768, 9)

If you are new to Machine Learning, then I highly recommend this book.

Download examples and exercises