Python NumPy: How to Interpolate NaN Values in an Array
Missing data, often represented as NaN (Not a Number) in NumPy arrays, can pose challenges for many numerical computations and analyses. Interpolation is a common technique to estimate these missing values based on the existing data points in the array. For 1D arrays, linear interpolation is a frequently used method where a missing value is estimated by fitting a straight line between its nearest known neighbors.
This guide will comprehensively demonstrate how to perform 1D linear interpolation of NaN values in a NumPy array using the numpy.interp() function. We will also explore a convenient alternative by leveraging the Series.interpolate() method from the Pandas library for a more direct approach.
Understanding Interpolation for Missing Values (NaNs)
Interpolation is the process of estimating unknown values that fall between known data points. When dealing with NaNs in a NumPy array, linear interpolation aims to fill these gaps by assuming a linear relationship between the valid (non-NaN) data points surrounding the NaN.
For example, if we have [1, NaN, 3], linear interpolation would estimate the NaN as 2. If we have [1, NaN, NaN, 4], the two NaNs might be estimated as 2 and 3 respectively.
Method 1: Using numpy.interp() for 1D Linear Interpolation
The numpy.interp(x, xp, fp) function performs one-dimensional linear interpolation. It finds the values of a function (fp) at new points (x) given a set of known data points (xp, fp), where xp must be monotonically increasing.
How numpy.interp() Works
x: The x-coordinates at which to evaluate the interpolated values (in our case, the indices ofNaNs).xp: The x-coordinates of the known data points (indices of non-NaN values). Must be increasing.fp: The y-coordinates (values) of the known data points, corresponding toxp.
Step-by-Step Implementation for Interpolating NaNs
To use np.interp() for NaNs in a 1D array:
- Identify the indices of
NaNvalues (these are ourxpoints). - Identify the indices of non-NaN values (these are our
xppoints). - Get the actual non-NaN values (these are our
fppoints). - Call
np.interp()and assign the results back to theNaNpositions in the array.
import numpy as np
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
print(f"Original array: {array_with_nans}")
# Make a copy to avoid modifying the original array if needed elsewhere
interpolated_array_np = array_with_nans.copy()
# 1. Find indices of NaNs (our 'x' for interpolation)
nan_indices = np.isnan(interpolated_array_np).nonzero()[0]
print(f"Indices of NaN values (x): {nan_indices}")
# 2. Find indices of non-NaNs (our 'xp')
not_nan_indices = (~np.isnan(interpolated_array_np)).nonzero()[0]
print(f"Indices of non-NaN values (xp): {not_nan_indices}")
# 3. Get the actual non-NaN values (our 'fp')
known_values = interpolated_array_np[~np.isnan(interpolated_array_np)]
# or: known_values = interpolated_array_np[not_nan_indices]
print(f"Known values (fp): {known_values}")
# 4. Perform interpolation and assign back
# Only interpolate if there are known points to interpolate from
if len(known_values) > 1 and len(nan_indices) > 0 : # Need at least 2 known points for interp
interpolated_values = np.interp(nan_indices, not_nan_indices, known_values)
interpolated_array_np[nan_indices] = interpolated_values
print(f"Array after np.interp() interpolation: {interpolated_array_np}")
Output:
Original array: [10. 12. nan 18. nan nan 27. 30.]
Indices of NaN values (x): [2 4 5]
Indices of non-NaN values (xp): [0 1 3 6 7]
Known values (fp): [10. 12. 18. 27. 30.]
Array after np.interp() interpolation: [10. 12. 15. 18. 21. 24. 27. 30.]
A Reusable Function for np.interp()
import numpy as np
def interpolate_nans_with_numpy(array_like):
"""
Interpolates NaN values in a 1D NumPy array using linear interpolation.
Handles NaNs at the beginning or end by propagating the nearest valid value (extrapolation).
"""
arr = np.array(array_like, dtype=float).copy() # Ensure float for NaNs and copy
nan_mask = np.isnan(arr)
if not np.any(nan_mask): # No NaNs to interpolate
return arr
x_coords_all = np.arange(len(arr))
known_x_coords = x_coords_all[~nan_mask]
known_y_values = arr[~nan_mask]
if len(known_y_values) < 2: # Cannot interpolate with less than 2 known points
# Handle cases: all NaNs, or only one known point (fill all NaNs with it)
if len(known_y_values) == 1:
arr[nan_mask] = known_y_values[0]
return arr # Or raise error, or return original if all NaNs
nan_x_coords = x_coords_all[nan_mask]
interpolated_values = np.interp(nan_x_coords, known_x_coords, known_y_values)
arr[nan_mask] = interpolated_values
return arr
# Example usage:
test_array1 = np.array([1, 1, np.NaN, 2, 2, np.NaN, 3, 3, np.NaN])
print(f"Interpolated test_array1: {interpolate_nans_with_numpy(test_array1)}")
test_array2 = np.array([np.nan, 1, np.nan, 2, np.nan])
print(f"Interpolated test_array2: {interpolate_nans_with_numpy(test_array2)}")
Output:
Interpolated test_array1: [1. 1. 1.5 2. 2. 2.5 3. 3. 3. ]
Interpolated test_array2: [1. 1. 1.5 2. 2. ]
np.interp also performs extrapolation for points outside the range of xp, using the first or last fp value. This means NaNs at the very beginning or end of the array will be filled with the nearest valid data point.
Method 2: Using pandas.Series.interpolate() (Convenient Alternative)
If you have Pandas installed, using Series.interpolate() is often more straightforward as it's designed for this kind of task and handles edge cases well.
Converting NumPy Array to Pandas Series
First, convert your NumPy array to a Pandas Series.
import pandas as pd # Make sure pandas is installed: pip install pandas
import numpy as np
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
# Convert NumPy array to Pandas Series
pd_series = pd.Series(array_with_nans)
print("Pandas Series from NumPy array:")
print(pd_series)
Output:
Pandas Series from NumPy array:
0 10.0
1 12.0
2 NaN
3 18.0
4 NaN
5 NaN
6 27.0
7 30.0
dtype: float64
Applying Series.interpolate()
The interpolate() method on a Pandas Series fills NaN values. By default, it uses linear interpolation.
import pandas as pd
import numpy as np
# pd_series defined as above
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
pd_series = pd.Series(array_with_nans)
# Interpolate NaN values (default method is 'linear')
interpolated_pd_series = pd_series.interpolate(method='linear') # 'linear' is default
print("Pandas Series after .interpolate():")
print(interpolated_pd_series)
Output:
Pandas Series after .interpolate():
0 10.0
1 12.0
2 15.0
3 18.0
4 21.0
5 24.0
6 27.0
7 30.0
dtype: float64
Pandas' interpolate() offers various methods beyond linear (e.g., 'polynomial', 'spline'), providing more advanced options if needed. It also handles NaNs at the beginning/end more flexibly with limit_direction parameter (e.g. 'forward', 'backward', 'both').
Converting Back to NumPy Array or List
After interpolation, you can convert the Pandas Series back to a NumPy array or a Python list.
import pandas as pd
import numpy as np
# interpolated_pd_series defined as above
array_with_nans = np.array([10.0, 12.0, np.nan, 18.0, np.nan, np.nan, 27.0, 30.0])
pd_series = pd.Series(array_with_nans)
interpolated_pd_series = pd_series.interpolate(method='linear') # 'linear' is default
# Convert back to NumPy array
result_numpy_array_from_pandas = interpolated_pd_series.to_numpy() # Preferred over .values
print("Interpolated data as NumPy array (from Pandas):")
print(result_numpy_array_from_pandas)
print()
# Convert to Python list
result_list_from_pandas = interpolated_pd_series.tolist()
print("Interpolated data as Python list (from Pandas):")
print(result_list_from_pandas)
Output:
Interpolated data as NumPy array (from Pandas):
[10. 12. 15. 18. 21. 24. 27. 30.]
Interpolated data as Python list (from Pandas):
[10.0, 12.0, 15.0, 18.0, 21.0, 24.0, 27.0, 30.0]
Limitations and Considerations
- 1D Only: Both
numpy.interp()andpandas.Series.interpolate()(as shown) are primarily designed for 1D data. For 2D arrays, you would typically apply these methods row-wise or column-wise in a loop, or use more advanced 2D interpolation techniques (e.g., fromscipy.interpolate). - Monotonic
xpfornp.interp():numpy.interp()requires thexparray (indices of known points) to be monotonically increasing. This is naturally satisfied whenxpare indices. - Extrapolation:
np.interp()extrapolatesNaNs at the beginning/end by using the nearest valid value. Pandas'interpolate()offers more control over this with itslimitandlimit_directionparameters. - Sufficient Known Points: Linear interpolation needs at least two known data points to interpolate between them. If an array has fewer than two non-NaN values,
np.interpmight behave unexpectedly or fill with the single known value (ifinterpolate_nans_with_numpyfunction is used). Pandas'interpolatemight leave NaNs if it can not find points to interpolate between based on its settings.
Conclusion
Interpolating NaN values is a valuable technique for handling missing data in NumPy arrays.
- For 1D linear interpolation using pure NumPy, the
numpy.interp()function provides the core mechanism. You need to manually identify theNaNand non-NaNpositions and values to feed intonp.interp(). - If Pandas is available, converting your NumPy array to a
pandas.Seriesand using itsinterpolate()method (e.g.,pd.Series(arr).interpolate().to_numpy()) is often a more convenient and robust approach, offering more built-in options and handling of edge cases like leading/trailingNaNs.
Choose the method that best suits your project's dependencies and the complexity of your interpolation needs. For simple 1D linear cases, both are effective.