
Dealing with missing data
It is not uncommon in real-world applications for our samples to be missing one or more values for various reasons. There could have been an error in the data collection process, certain measurements are not applicable, or particular fields could have been simply left blank in a survey, for example. We typically see missing values as the blank spaces in our data table or as placeholder strings such as NaN
, which stands for not a number, or NULL
(a commonly used indicator of unknown values in relational databases).
Unfortunately, most computational tools are unable to handle such missing values, or produce unpredictable results if we simply ignore them. Therefore, it is crucial that we take care of those missing values before we proceed with further analyses. In this section, we will work through several practical techniques for dealing with missing values by removing entries from our dataset or imputing missing values from other samples and features.
Identifying missing values in tabular data
But before we discuss several techniques for dealing with missing values, let's create a simple example data frame from a Comma-separated Values (CSV) file to get a better grasp of the problem:
>>> import pandas as pd >>> from io import StringIO >>> csv_data = \ ... '''A,B,C,D ... 1.0,2.0,3.0,4.0 ... 5.0,6.0,,8.0 ... 10.0,11.0,12.0,''' >>> # If you are using Python 2.7, you need >>> # to convert the string to unicode: >>> # csv_data = unicode(csv_data) >>> df = pd.read_csv(StringIO(csv_data)) >>> df A B C D 0 1.0 2.0 3.0 4.0 1 5.0 6.0 NaN 8.0 2 10.0 11.0 12.0 NaN
Using the preceding code, we read CSV-formatted data into a pandas DataFrame
via the read_csv
function and noticed that the two missing cells were replaced by NaN
. The StringIO
function in the preceding code example was simply used for the purposes of illustration. It allows us to read the string assigned to csv_data
into a pandas DataFrame
as if it was a regular CSV file on our hard drive.
For a larger DataFrame
, it can be tedious to look for missing values manually; in this case, we can use the isnull
method to return a DataFrame
with Boolean values that indicate whether a cell contains a numeric value (False
) or if data is missing (True
). Using the sum
method, we can then return the number of missing values per column as follows:
>>> df.isnull().sum() A 0 B 0 C 1 D 1 dtype: int64
This way, we can count the number of missing values per column; in the following subsections, we will take a look at different strategies for how to deal with this missing data.
Note
Although scikit-learn was developed for working with NumPy arrays, it can sometimes be more convenient to preprocess data using pandas' DataFrame
. We can always access the underlying NumPy array of a DataFrame
via the values
attribute before we feed it into a scikit-learn estimator:
>>> df.values array([[ 1., 2., 3., 4.], [ 5., 6., nan, 8.], [ 10., 11., 12., nan]])
Eliminating samples or features with missing values
One of the easiest ways to deal with missing data is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely; rows with missing values can be easily dropped via the dropna
method:
>>> df.dropna(axis=0) A B C D 0 1.0 2.0 3.0 4.0
Similarly, we can drop columns that have at least one NaN
in any row by setting the axis
argument to 1
:
>>> df.dropna(axis=1) A B 0 1.0 2.0 1 5.0 6.0 2 10.0 11.0
The dropna
method supports several additional parameters that can come in handy:
# only drop rows where all columns are NaN # (returns the whole array here since we don't # have a row with where all values are NaN >>> df.dropna(how='all') A B C D 0 1.0 2.0 3.0 4.0 1 5.0 6.0 NaN 8.0 2 10.0 11.0 12.0 NaN # drop rows that have less than 4 real values >>> df.dropna(thresh=4) A B C D 0 1.0 2.0 3.0 4.0 # only drop rows where NaN appear in specific columns (here: 'C') >>> df.dropna(subset=['C']) A B C D 0 1.0 2.0 3.0 4.0 2 10.0 11.0 12.0 NaN
Although the removal of missing data seems to be a convenient approach, it also comes with certain disadvantages; for example, we may end up removing too many samples, which will make a reliable analysis impossible. Or, if we remove too many feature columns, we will run the risk of losing valuable information that our classifier needs to discriminate between classes. In the next section, we will thus look at one of the most commonly used alternatives for dealing with missing values: interpolation techniques.
Imputing missing values
Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might lose too much valuable data. In this case, we can use different interpolation techniques to estimate the missing values from the other training samples in our dataset. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column. A convenient way to achieve this is by using the Imputer
class from scikit-learn, as shown in the following code:
>>> from sklearn.preprocessing import Imputer >>> imr = Imputer(missing_values='NaN', strategy='mean', axis=0) >>> imr = imr.fit(df.values) >>> imputed_data = imr.transform(df.values) >>> imputed_data array([[ 1., 2., 3., 4.], [ 5., 6., 7.5, 8.], [ 10., 11., 12., 6.]])
Here, we replaced each NaN
value with the corresponding mean, which is separately calculated for each feature column. If we changed the axis=0
setting to axis=1
, we'd calculate the row means. Other options for the strategy
parameter are median
or most_frequent
, where the latter replaces the missing values with the most frequent values. This is useful for imputing categorical feature values, for example, a feature column that stores an encoding of color names, such as red, green, and blue, and we will encounter examples of such data later in this chapter.
Understanding the scikit-learn estimator API
In the previous section, we used the Imputer
class from scikit-learn to impute missing values in our dataset. The Imputer
class belongs to the so-called transformer classes in scikit-learn, which are used for data transformation. The two essential methods of those estimators are fit
and transform
. The fit
method is used to learn the parameters from the training data, and the transform
method uses those parameters to transform the data. Any data array that is to be transformed needs to have the same number of features as the data array that was used to fit the model. The following figure illustrates how a transformer, fitted on the training data, is used to transform a training dataset as well as a new test dataset:

The classifiers that we used in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, belong to the so-called estimators in scikit-learn with an API that is conceptually very similar to the transformer class. Estimators have a predict
method but can also have a transform
method, as we will see later in this chapter. As you may recall, we also used the fit
method to learn the parameters of a model when we trained those estimators for classification. However, in supervised learning tasks, we additionally provide the class labels for fitting the model, which can then be used to make predictions about new data samples via the predict
method, as illustrated in the following figure:
