masked arrays and np.nan* functions
Published 2025-05-09
TIL about masked arrays in numpy (np.ma) and np.nan* functions. Both deal cleanly with np.nan values in arrays.
Example problem: given a 2D array where each row is a signal that could contain NaNs, return the z-score of each element within its row.
My initial solution:
def row_nan_zscore(matrix): masked = ma.masked_invalid(matrix) z = (masked - masked.mean(axis=1, keepdims=True)) / masked.std(axis=1, keepdims=True) return zI used masked arrays at first because it seems cleaner to not have to fill in NaNs after the operations, which would be necessary with np.nan* functions.
Concerns/problems with this solution:
- Returns a
masked_arrayrather than a plainndarray, which could be problematic if any down-stream code expects anndarray-> solution: returnz.filled(np.nan)orz.data - If a row is all NaN or has zero variance,
masked.stdreturns 0, causing a divide by zero. Themasked_arrayhandles this silently by masking invalid values:z.filled(np.nan)returns NaNs for the invalid rows, whilez.datareturns 0s for the invalid rows (except the initial NaNs remain NaN). I assume it’s returning 0s because subtracting by the mean leads to a row of 0s? But it’s still unclear why0/0doesn’t returnnan. Anyway, this means I prefer filling with NaNs; never rely on.datawhere.mask.any() == np.True_
So a cleaned solution with np.ma (and using temporaries for clarity with the following np.nan* approach):
def row_nan_zscore_mask(matrix): masked = ma.masked_invalid(matrix)
mean = masked.mean(axis=1, keepdims=True) std = masked.std(axis=1, keepdims=True) z = (masked - mean) / std
return z.filled(np.nan)Now using np.nan* functions:
def row_nan_zscore(matrix): mean = np.nanmean(matrix, axis=1, keepdims=True) std = np.nanstd(matrix, axis=1, keepdims=True) z = (matrix - mean) / std
z[np.isnan(matrix)] = np.nan return zThe z[np.isnan(matrix)] = np.nan line isn’t really necessary, since NaNs propagate through the previous operations, but I kept it for clarity.
The np.nan* approach is about 2.5x faster than np.ma for a (10_000, 1_000) array.