masked arrays and np.nan* functions

Published 2025-05-09

TIL about masked arrays in numpy (np.ma) and np.nan* functions. Both deal cleanly with np.nan values in arrays.

Example problem: given a 2D array where each row is a signal that could contain NaNs, return the z-score of each element within its row.

My initial solution:

def row_nan_zscore(matrix):
    masked = ma.masked_invalid(matrix)
    z = (masked - masked.mean(axis=1, keepdims=True)) / masked.std(axis=1, keepdims=True)
    return z

I used masked arrays at first because it seems cleaner to not have to fill in NaNs after the operations, which would be necessary with np.nan* functions.

Concerns/problems with this solution:

So a cleaned solution with np.ma (and using temporaries for clarity with the following np.nan* approach):

def row_nan_zscore_mask(matrix):
    masked = ma.masked_invalid(matrix)
    mean = masked.mean(axis=1, keepdims=True)
    std = masked.std(axis=1, keepdims=True)
    z = (masked - mean) / std
    return z.filled(np.nan)

Now using np.nan* functions:

def row_nan_zscore(matrix):
    mean = np.nanmean(matrix, axis=1, keepdims=True)
    std = np.nanstd(matrix, axis=1, keepdims=True)
    z = (matrix - mean) / std
    z[np.isnan(matrix)] = np.nan
    return z

The z[np.isnan(matrix)] = np.nan line isn’t really necessary, since NaNs propagate through the previous operations, but I kept it for clarity.

The np.nan* approach is about 2.5x faster than np.ma for a (10_000, 1_000) array.