masked arrays and np.nan* functions

Published 2025-05-09

TIL about masked arrays in numpy (np.ma) and np.nan* functions. Both deal cleanly with np.nan values in arrays.

Example problem: given a 2D array where each row is a signal that could contain NaNs, return the z-score of each element within its row.

My initial solution:

def row_nan_zscore(matrix):
    masked = ma.masked_invalid(matrix)
    z = (masked - masked.mean(axis=1, keepdims=True)) / masked.std(axis=1, keepdims=True)
    return z

I used masked arrays at first because it seems cleaner to not have to fill in NaNs after the operations, which would be necessary with np.nan* functions.

Concerns/problems with this solution:

Returns a masked_array rather than a plain ndarray, which could be problematic if any down-stream code expects an ndarray -> solution: return z.filled(np.nan) or z.data
If a row is all NaN or has zero variance, masked.std returns 0, causing a divide by zero. The masked_array handles this silently by masking invalid values: z.filled(np.nan) returns NaNs for the invalid rows, while z.data returns 0s for the invalid rows (except the initial NaNs remain NaN). I assume it’s returning 0s because subtracting by the mean leads to a row of 0s? But it’s still unclear why 0/0 doesn’t return nan. Anyway, this means I prefer filling with NaNs; never rely on .data where .mask.any() == np.True_

So a cleaned solution with np.ma (and using temporaries for clarity with the following np.nan* approach):

def row_nan_zscore_mask(matrix):
    masked = ma.masked_invalid(matrix)

    mean = masked.mean(axis=1, keepdims=True)
    std = masked.std(axis=1, keepdims=True)
    z = (masked - mean) / std

    return z.filled(np.nan)

Now using np.nan* functions:

def row_nan_zscore(matrix):
    mean = np.nanmean(matrix, axis=1, keepdims=True)
    std = np.nanstd(matrix, axis=1, keepdims=True)
    z = (matrix - mean) / std

    z[np.isnan(matrix)] = np.nan
    return z

The z[np.isnan(matrix)] = np.nan line isn’t really necessary, since NaNs propagate through the previous operations, but I kept it for clarity.

The np.nan* approach is about 2.5x faster than np.ma for a (10_000, 1_000) array.