n-gram encoding
Published 2025-05-21 • Updated 2025-10-11
A variant of target encoding; the idea is to add features which cross categorical values (such as a “PublicationDay_PublicationTime” feature in Kaggle’s Predict Podcast Listening Time dataset, which might contain “Sunday_Afternoon”) and encode the mean of that specific categorical value. This introduces an interaction that a model might not otherwise pick up on.
Danger: crossing high-cardinality features can quickly blow up feature count. Typical mitigation strategies include limiting to 2- or 3-gram, and frequency filtering.