How to Use SimpleImputer to Handle Missing Values in Python

What to Do When You Encounter Missing Values in Python

Aug 07, 2025

Missing data is a common issue in real-world datasets. If left untreated, it can break models or skew results. The good news? Python’s scikit-learn provides a straightforward way to deal with it using SimpleImputer. Here’s how to use it effectively.

What Is SimpleImputer?

SimpleImputer is a class in sklearn.impute that fills in missing values with a specific strategy, such as:

Mean
Median
Most frequent (mode)
A constant value

It works for both numerical and categorical data. Let’s walk through a basic example using a pandas DataFrame.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

data = {
    'age': [25, 27, np.nan, 35, 40],
    'income': [50000, np.nan, 60000, 58000, np.nan],
    'gender': ['male', 'female', np.nan, 'female', 'male']
}

df = pd.DataFrame(data)
print(df)

The output is as follows:

    age   income  gender
0  25.0  50000.0    male
1  27.0      NaN  female
2   NaN  60000.0     NaN
3  35.0  58000.0  female
4  40.0      NaN    male

We have missing values in all three columns.

Do you want to master Deep Learning techniques tailored for time series, trading, and market analysis🔥? My book breaks it all down from basic machine learning to complex multi-period LSTM forecasting while going through concepts such as fractional differentiation and forecasting thresholds. Get your copy here 📖!

Deep Learning for Finance

Handling Missing Data

Sometimes we encounter these data when we import them from data sources. It can be frustrating to know that there are gaps. Hence, we try to remedy this by using simple mathematical calculations which limit the impact on statistical modelling.

Use mean or median for numeric data:

num_imputer = SimpleImputer(strategy='mean')
df[['age', 'income']] = num_imputer.fit_transform(df[['age', 'income']])

This replaces NaNs with the column mean. You could also use 'median' if the data is skewed.

Use 'most_frequent' for categorical columns like gender:

cat_imputer = SimpleImputer(strategy='most_frequent')
df[['gender']] = cat_imputer.fit_transform(df[['gender']])

The result is as follows:

    age   income  gender
0  25.0  50000.0    male
1  27.0  57000.0  female
2  31.75 60000.0    male
3  35.0  58000.0  female
4  40.0  57000.0    male

Now your dataset is clean and ready for modeling.

Always fit the imputer on your training data only, then transform both train and test sets. If you’re working in a pipeline, use SimpleImputer as a preprocessing step in Pipeline or ColumnTransformer. For more complex patterns of missingness, consider models like KNNImputer or using predictive models to impute.

SimpleImputer is a quick and reliable tool for handling missing data.

Now, let’s quickly go back to an interesting topic. I previously mentioned that it’s better to use the median if the data is skewed. Why is that?

When data is skewed, the mean gets dragged in the direction of the skew, while the median stays closer to the center of the actual distribution. That’s why the median is usually a better choice for imputation or summarization in skewed datasets. The mean adds up all values and divides by the count. So even a single extreme value can shift the average significantly.

Example:

values = [10, 12, 11, 13, 200]  # 200 is an outlier
mean = sum(values) / len(values)  # 49.2
median = sorted(values)[len(values) // 2]  # 12

Here, mean = 49.2, median = 12. The mean does not reflect the typical value—it’s skewed by the outlier.

The median is the middle value—literally. It doesn’t care how big or small the extremes are. It just splits the data in half.

In right-skewed data (like income, house prices), the tail pulls the mean rightward. But the median stays put. When filling in missing values, using the mean in skewed data can introduce artificial bias:

It creates values that don’t naturally exist.
It overestimates or underestimates what a "typical" value would be.

The median, by contrast, keeps the imputed values within the realistic center of the data.

Use median:

When the data is right-skewed (e.g. income, transaction amounts).
When there are outliers.
When preserving robustness is more important than mathematical precision.

Every week, I analyze positioning, sentiment, and market structure. Curious what hedge funds, retail, and smart money are doing each week? Then join hundreds of readers here in the Weekly Market Sentiment Report 📜 and stay ahead of the game through chart forecasts, sentiment analysis, volatility diagnosis, and seasonality charts.

Free trial available🆓

All About Trading!

Discussion about this post