memory mapping with numpy
Published 2025-05-07 • Updated 2025-05-08
Great way to work with GB+ datasets: store the mapping rather than the file in memory, and only load the parts of the dataset you actually access.
def normalize_large_array(filename_in, filename_out, shape): src = np.memmap(filename_in, dtype='float32', mode='r', shape=shape) dst = np.memmap(filename_out, dtype='float32', mode='w+', shape=shape)
mean = np.mean(src) std = np.std(src) dst[:] = (src - mean) / std dst.flush()For most bioinformatics use cases, streaming files in chunks/lines is the way to go. Memory mapping is most useful when you need to do lots of random lookups (rather than a single pass through the file), especially with fixed-width numeric values rather than variable-length ASCII lines.