memory mapping with numpy

Published 2025-05-07 • Updated 2025-05-08

Great way to work with GB+ datasets: store the mapping rather than the file in memory, and only load the parts of the dataset you actually access.

def normalize_large_array(filename_in, filename_out, shape):
    src = np.memmap(filename_in, dtype='float32', mode='r', shape=shape)
    dst = np.memmap(filename_out, dtype='float32', mode='w+', shape=shape)

    mean = np.mean(src)
    std = np.std(src)
    dst[:] = (src - mean) / std
    dst.flush()

For most bioinformatics use cases, streaming files in chunks/lines is the way to go. Memory mapping is most useful when you need to do lots of random lookups (rather than a single pass through the file), especially with fixed-width numeric values rather than variable-length ASCII lines.