Predict Podcast Listening Time
Published 2025-05-14 • Updated 2025-10-11
This playground competition is completed so I’m doing it to learn from the top solutions.
Leaderboard stats for reference:
count 3000.000000mean 13.918805std 3.613350min 11.44833025% 12.55611550% 12.89713575% 13.117038max 35.372890I’d like to get top 10% (around 11.74) after implementing concepts from the top solutions.
Data
Columns and datatypes:
id int64Podcast_Name objectEpisode_Title objectEpisode_Length_minutes float64Genre objectHost_Popularity_percentage float64Publication_Day objectPublication_Time objectGuest_Popularity_percentage float64Number_of_Ads float64Episode_Sentiment objectListening_Time_minutes float64Columns with missing values:
Episode_Length_minutes- 87093 NaNs, or 11.6%
Guest_Popularity_percentage- 146030 NaNs, or 19.5%
Number_of_Ads- 1 NaN, negligible
The target is Listening_Time_minutes, but it’s not totally clear what this means. I’m guessing it’s the listening time per person per episode. Target stats:
count 750000.000000mean 45.437406std 27.138306min 0.00000025% 23.17835050% 43.37946075% 64.811580max 119.970000The correlation table between numerical columns indicates that Episode_Length_minutes is by far the greatest predictor of the target, with a correlation coefficient of 0.92.
My approach
ideas:
- most of the
objectcolumns can just get one hot encoded.Podcast_Namehas 48 unique values (presumably because most of this data was artificially augmented from a much smaller data set), so I think I’ll OHE that too Episode Titleentries all take the formatEpisode X, whereXis a number between 1 and 100. I’ll convert this column into a feature ofint, since I suspect that there could be some kind of correlation between episode number and listening time (maybe higher episode numbers tend to be listened to more by the ‘hardcore’ listeners, who would tend to listen longer)- This may be a bad idea but I’ll use a model to predict the missing
Episode_Length_minutesandGuest_Popularity_percentagevalues
preprocessing:
train = pd.read_csv("train.csv")test = pd.read_csv("test.csv")df_all = pd.concat((train, test), ignore_index=True)
def process_episode_title(episode_title): return episode_title.split(" ")[1]
def preprocess_base(data):
data = pd.concat((data, pd.get_dummies(data["Podcast_Name"]), pd.get_dummies(data["Genre"]), pd.get_dummies(data["Publication_Day"]), pd.get_dummies(data["Publication_Time"]), pd.get_dummies(data["Episode_Sentiment"])), axis=1)
data["Episode_Title"] = data["Episode_Title"].apply(process_episode_title).astype("int16")
# dealing with NaNs data["Number_of_Ads"] = data["Number_of_Ads"].fillna(data["Number_of_Ads"].median())
data = data.drop(columns=["id", "Podcast_Name", "Genre", "Publication_Day", "Publication_Time", "Episode_Sentiment"])
return data
def add_nans_float(data, cols): for i, column in enumerate(cols): params = { "n_estimators" : 1000, "random_state" : 1, "tree_method" : "hist", "device" : "cuda", } model = XGBRegressor(**params) train_data = data.iloc[np.where(data[column].isna() == False)[0].tolist()].drop(columns=cols[i+1:]) pred_data = data.iloc[np.where(data[column].isna() == True)[0].tolist()].drop(columns=cols[i+1:]) X = train_data.drop(column, axis=1) y = train_data[column] model.fit(X, y)
y_pred = model.predict(pred_data.drop(column, axis=1)) y_pred = np.clip(y_pred, 0, max(y_pred))
mask = data[column].isna() assert mask.sum() == y_pred.size, "size mismatch!" data.loc[mask, column] = y_pred
return data
def preprocess(data): data = preprocess_base(data) data = add_nans_float(data, ["Episode_Length_minutes", "Guest_Popularity_percentage"]) return data
train = preprocess(train)I got a score of 13.19146 with this approach, which puts me at around the 80th percentile. I’m a bit surprised it’s so bad. I also tried adding squares of the numerical columns but that didn’t help so I guess there’s not much to be gained from polynomial features; however, it may be worth adding polynomial interactions between features.
After training on both train and test data for add_nans_float and adding NaN flag features, I got a score of 13.00851, ~65th percentile. I think the NaN flag is an important takeaway from this; the model needs to know where a value was predicted. I wonder if the NaN flag somehow tells the model “this is a low-confidence value”.
Looking at the top solutions
Looking at the top solution by Chris Deotte. It’s pretty complex.
Thoughts and ideas:
- Apparently the top notebooks are using really high max depths for some models
- What led him to a stack was training an XGB on all his previous experiments’ OOF (out of fold) and test preds, and his score jumped to the top. I think the idea behind that is that if you train a model on an ensemble of models’ outputs, you get something that mixes all of them, and for this dataset where there are a bunch of “if this then that” relationships in the data, that’s somehow better than each individual model
- I like the idea of combining a bunch of different models that approach the data in different ways
- Look into target encoding
- Stacking is so effective with this data set because there are so many missing values in the most important feature by far
- Stacks excel at extracting feature interaction
- The main problem to solve in this challenge is what do you do when the main predictor (ELM) is missing? When it’s missing all the other features matter a lot more
- “All in all, I trained around 100 models. From the beginning, I saved every single OOF and test preds as parquet files. My Hill Climbing script reads the OOFs and computes roughly optimal blending weights (both positive and negative). Every now and then, I ran the Hill Climber and removed those OOFs and test predictions that weren’t assigned any weights to speed up hill climbing. Unexpectedly, even some very early OOFs from before I added TE were still assigned some weight. The numbers given in the description above are what remained, i.e. what Hill Climbing still considered relevant.” from Johannes Heller
- Need to learn about hill climbing vs stacking
- Try making a model that trains on all columns except ELM and add it to the stack — it provides the meta-learner with a backup option for rows where ELM is missing
- Figure out AutoGluon and n-gram feature interactions from this
Next up: training these models on GPU using cuDF and cuML. Training on CPU takes way too long and takes too much memory; using GPU compute + memory will allow me to use my laptop for other things while training.
Using cuDF and cuML
I installed the RAPIDS library (installation process) and made some minor edits to the pipeline to make it entirely GPU-based. Now I’m running much larger models (which kinda feels like cheating), so without adjusting the preprocessing — just running larger models — I’ve gone from a score of 13.01 (~65th percentile) to 12.69 (~33rd).
Roadmap from here
- implement stacking
- figure out some level 0s
- figure out how to keep OOFs, especially how to compute OOFs of every single row
- stack
- write about stacking, why it’s so effective here
- look at the 2nd place solution and figure out how it was so effective with just a single model
Level 0s
preprocessing
- predict ELM and GPP
- drop ELM
- drop ELM and GPP
models
- XGB
- Lasso
- SVR
- KNN Regressor
- Random Forest
- AutoGluon
- LightGBM
- CatBoost?
hyperparams
- high max depth
- low max depth
How to keep OOFs
Current implementation:
proc_funcs = [preprocess0, preprocess1, preprocess2]params_lst = [params0, params1, params2]
train = cudf.read_csv("train.csv")test = cudf.read_csv("test.csv")
oof = {i: cp.zeros(len(train), dtype=cp.float32) for i in range(len(proc_funcs) * len(params_lst))}ptest = {i: cp.zeros(len(test), dtype=cp.float32) for i in range(len(proc_funcs) * len(params_lst))}
K = 5kf = KFold(n_splits=K, shuffle=True, random_state=1)
for fold, (tr_idx, val_idx) in enumerate(kf.split(train)): print("Fold: ", fold) tr_idx = cp.asarray(tr_idx) val_idx = cp.asarray(val_idx)
X_tr = train.drop("Listening_Time_minutes", axis=1).iloc[tr_idx] y_tr = train["Listening_Time_minutes"].iloc[tr_idx] X_val = train.drop("Listening_Time_minutes", axis=1).iloc[val_idx]
for i, proc_func in enumerate(proc_funcs): X = proc_func(X_tr) X_val_local = proc_func(X_val) test_local = proc_func(test) for j, params in enumerate(params_lst): model_idx = 3*i + j print(f" Evaluating model ID {model_idx} / {len(proc_funcs) * len(params_lst) - 1}")
model = XGBRegressor(**params) model.fit(X, y_tr)
oof[model_idx][val_idx] = model.predict(X_val_local) ptest[model_idx] = model.predict(test_local) / KIt easily lets me add new processing functions and hyperparams.
End result
I trained my level 1 stacker on the preprocessing, model, and hyperparameters listed above and ended up with a score of 12.54 (top ~22%) before moving on to Predict Calorie Expenditure - Kaggle Playground Series when it was released.
I didn’t do a great job of recording where I was, so now (writing 3 months later) I’ve forgotten the details of what I did. I may revisit this competition someday, but it’s unlikely.