Predict Podcast Listening Time

Published 2025-05-14 • Updated 2025-10-11

This playground competition is completed so I’m doing it to learn from the top solutions.

Leaderboard stats for reference:

count 3000.000000
mean 13.918805
std 3.613350
min 11.448330
25% 12.556115
50% 12.897135
75% 13.117038
max 35.372890

I’d like to get top 10% (around 11.74) after implementing concepts from the top solutions.

Data

Columns and datatypes:

id int64
Podcast_Name object
Episode_Title object
Episode_Length_minutes float64
Genre object
Host_Popularity_percentage float64
Publication_Day object
Publication_Time object
Guest_Popularity_percentage float64
Number_of_Ads float64
Episode_Sentiment object
Listening_Time_minutes float64

Columns with missing values:

The target is Listening_Time_minutes, but it’s not totally clear what this means. I’m guessing it’s the listening time per person per episode. Target stats:

count 750000.000000
mean 45.437406
std 27.138306
min 0.000000
25% 23.178350
50% 43.379460
75% 64.811580
max 119.970000

The correlation table between numerical columns indicates that Episode_Length_minutes is by far the greatest predictor of the target, with a correlation coefficient of 0.92.

My approach

ideas:

preprocessing:

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
df_all = pd.concat((train, test), ignore_index=True)
def process_episode_title(episode_title):
return episode_title.split(" ")[1]
def preprocess_base(data):
data = pd.concat((data,
pd.get_dummies(data["Podcast_Name"]),
pd.get_dummies(data["Genre"]),
pd.get_dummies(data["Publication_Day"]),
pd.get_dummies(data["Publication_Time"]),
pd.get_dummies(data["Episode_Sentiment"])), axis=1)
data["Episode_Title"] = data["Episode_Title"].apply(process_episode_title).astype("int16")
# dealing with NaNs
data["Number_of_Ads"] = data["Number_of_Ads"].fillna(data["Number_of_Ads"].median())
data = data.drop(columns=["id", "Podcast_Name", "Genre", "Publication_Day", "Publication_Time", "Episode_Sentiment"])
return data
def add_nans_float(data, cols):
for i, column in enumerate(cols):
params = {
"n_estimators" : 1000,
"random_state" : 1,
"tree_method" : "hist",
"device" : "cuda",
}
model = XGBRegressor(**params)
train_data = data.iloc[np.where(data[column].isna() == False)[0].tolist()].drop(columns=cols[i+1:])
pred_data = data.iloc[np.where(data[column].isna() == True)[0].tolist()].drop(columns=cols[i+1:])
X = train_data.drop(column, axis=1)
y = train_data[column]
model.fit(X, y)
y_pred = model.predict(pred_data.drop(column, axis=1))
y_pred = np.clip(y_pred, 0, max(y_pred))
mask = data[column].isna()
assert mask.sum() == y_pred.size, "size mismatch!"
data.loc[mask, column] = y_pred
return data
def preprocess(data):
data = preprocess_base(data)
data = add_nans_float(data, ["Episode_Length_minutes", "Guest_Popularity_percentage"])
return data
train = preprocess(train)

I got a score of 13.19146 with this approach, which puts me at around the 80th percentile. I’m a bit surprised it’s so bad. I also tried adding squares of the numerical columns but that didn’t help so I guess there’s not much to be gained from polynomial features; however, it may be worth adding polynomial interactions between features.

After training on both train and test data for add_nans_float and adding NaN flag features, I got a score of 13.00851, ~65th percentile. I think the NaN flag is an important takeaway from this; the model needs to know where a value was predicted. I wonder if the NaN flag somehow tells the model “this is a low-confidence value”.

Looking at the top solutions

Looking at the top solution by Chris Deotte. It’s pretty complex.

Thoughts and ideas:

Next up: training these models on GPU using cuDF and cuML. Training on CPU takes way too long and takes too much memory; using GPU compute + memory will allow me to use my laptop for other things while training.

Using cuDF and cuML

I installed the RAPIDS library (installation process) and made some minor edits to the pipeline to make it entirely GPU-based. Now I’m running much larger models (which kinda feels like cheating), so without adjusting the preprocessing — just running larger models — I’ve gone from a score of 13.01 (~65th percentile) to 12.69 (~33rd).

Roadmap from here

  1. implement stacking
    1. figure out some level 0s
    2. figure out how to keep OOFs, especially how to compute OOFs of every single row
    3. stack
  2. write about stacking, why it’s so effective here
  3. look at the 2nd place solution and figure out how it was so effective with just a single model

Level 0s

preprocessing

models

hyperparams

How to keep OOFs

Current implementation:

proc_funcs = [preprocess0, preprocess1, preprocess2]
params_lst = [params0, params1, params2]
train = cudf.read_csv("train.csv")
test = cudf.read_csv("test.csv")
oof = {i: cp.zeros(len(train), dtype=cp.float32) for i in range(len(proc_funcs) * len(params_lst))}
ptest = {i: cp.zeros(len(test), dtype=cp.float32) for i in range(len(proc_funcs) * len(params_lst))}
K = 5
kf = KFold(n_splits=K, shuffle=True, random_state=1)
for fold, (tr_idx, val_idx) in enumerate(kf.split(train)):
print("Fold: ", fold)
tr_idx = cp.asarray(tr_idx)
val_idx = cp.asarray(val_idx)
X_tr = train.drop("Listening_Time_minutes", axis=1).iloc[tr_idx]
y_tr = train["Listening_Time_minutes"].iloc[tr_idx]
X_val = train.drop("Listening_Time_minutes", axis=1).iloc[val_idx]
for i, proc_func in enumerate(proc_funcs):
X = proc_func(X_tr)
X_val_local = proc_func(X_val)
test_local = proc_func(test)
for j, params in enumerate(params_lst):
model_idx = 3*i + j
print(f" Evaluating model ID {model_idx} / {len(proc_funcs) * len(params_lst) - 1}")
model = XGBRegressor(**params)
model.fit(X, y_tr)
oof[model_idx][val_idx] = model.predict(X_val_local)
ptest[model_idx] = model.predict(test_local) / K

It easily lets me add new processing functions and hyperparams.

End result

I trained my level 1 stacker on the preprocessing, model, and hyperparameters listed above and ended up with a score of 12.54 (top ~22%) before moving on to Predict Calorie Expenditure - Kaggle Playground Series when it was released.

I didn’t do a great job of recording where I was, so now (writing 3 months later) I’ve forgotten the details of what I did. I may revisit this competition someday, but it’s unlikely.