How to Use SimpleImputer to Handle Missing Values in Python
What to Do When You Encounter Missing Values in Python
Missing data is a common issue in real-world datasets. If left untreated, it can break models or skew results. The good news? Python’s scikit-learn
provides a straightforward way to deal with it using SimpleImputer
. Here’s how to use it effectively.
What Is SimpleImputer?
SimpleImputer
is a class in sklearn.impute
that fills in missing values with a specific strategy, such as:
Mean
Median
Most frequent (mode)
A constant value
It works for both numerical and categorical data. Let’s walk through a basic example using a pandas DataFrame.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
data = {
'age': [25, 27, np.nan, 35, 40],
'income': [50000, np.nan, 60000, 58000, np.nan],
'gender': ['male', 'female', np.nan, 'female', 'male']
}
df = pd.DataFrame(data)
print(df)
The output is as follows:
age income gender
0 25.0 50000.0 male
1 27.0 NaN female
2 NaN 60000.0 NaN
3 35.0 58000.0 female
4 40.0 NaN male
We have missing values in all three columns.
Do you want to master Deep Learning techniques tailored for time series, trading, and market analysis🔥? My book breaks it all down from basic machine learning to complex multi-period LSTM forecasting while going through concepts such as fractional differentiation and forecasting thresholds. Get your copy here 📖!
Handling Missing Data
Sometimes we encounter these data when we import them from data sources. It can be frustrating to know that there are gaps. Hence, we try to remedy this by using simple mathematical calculations which limit the impact on statistical modelling.
Use mean or median for numeric data:
num_imputer = SimpleImputer(strategy='mean')
df[['age', 'income']] = num_imputer.fit_transform(df[['age', 'income']])
This replaces NaNs with the column mean. You could also use 'median'
if the data is skewed.
Use 'most_frequent'
for categorical columns like gender:
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['gender']] = cat_imputer.fit_transform(df[['gender']])
The result is as follows:
age income gender
0 25.0 50000.0 male
1 27.0 57000.0 female
2 31.75 60000.0 male
3 35.0 58000.0 female
4 40.0 57000.0 male
Now your dataset is clean and ready for modeling.
Always fit the imputer on your training data only, then transform both train and test sets. If you’re working in a pipeline, use SimpleImputer
as a preprocessing step in Pipeline
or ColumnTransformer
. For more complex patterns of missingness, consider models like KNNImputer
or using predictive models to impute.
SimpleImputer
is a quick and reliable tool for handling missing data.
Now, let’s quickly go back to an interesting topic. I previously mentioned that it’s better to use the median if the data is skewed. Why is that?
When data is skewed, the mean gets dragged in the direction of the skew, while the median stays closer to the center of the actual distribution. That’s why the median is usually a better choice for imputation or summarization in skewed datasets. The mean adds up all values and divides by the count. So even a single extreme value can shift the average significantly.
Example:
values = [10, 12, 11, 13, 200] # 200 is an outlier
mean = sum(values) / len(values) # 49.2
median = sorted(values)[len(values) // 2] # 12
Here, mean = 49.2, median = 12. The mean does not reflect the typical value—it’s skewed by the outlier.
The median is the middle value—literally. It doesn’t care how big or small the extremes are. It just splits the data in half.
In right-skewed data (like income, house prices), the tail pulls the mean rightward. But the median stays put. When filling in missing values, using the mean in skewed data can introduce artificial bias:
It creates values that don’t naturally exist.
It overestimates or underestimates what a "typical" value would be.
The median, by contrast, keeps the imputed values within the realistic center of the data.
Use median:
When the data is right-skewed (e.g. income, transaction amounts).
When there are outliers.
When preserving robustness is more important than mathematical precision.
Every week, I analyze positioning, sentiment, and market structure. Curious what hedge funds, retail, and smart money are doing each week? Then join hundreds of readers here in the Weekly Market Sentiment Report 📜 and stay ahead of the game through chart forecasts, sentiment analysis, volatility diagnosis, and seasonality charts.
Free trial available🆓