Predict Podcast Listening Time

Published 2025-05-14 • Updated 2025-10-11

This playground competition is completed so I’m doing it to learn from the top solutions.

Leaderboard stats for reference:

count 3000.000000
mean  13.918805
std   3.613350
min   11.448330
25%   12.556115
50%   12.897135
75%   13.117038
max   35.372890

I’d like to get top 10% (around 11.74) after implementing concepts from the top solutions.

Data

Columns and datatypes:

id                          int64
Podcast_Name                object
Episode_Title               object
Episode_Length_minutes      float64
Genre                       object
Host_Popularity_percentage  float64
Publication_Day             object
Publication_Time            object
Guest_Popularity_percentage float64
Number_of_Ads               float64
Episode_Sentiment           object
Listening_Time_minutes      float64

Columns with missing values:

Episode_Length_minutes
- 87093 NaNs, or 11.6%
Guest_Popularity_percentage
- 146030 NaNs, or 19.5%
Number_of_Ads
- 1 NaN, negligible

The target is Listening_Time_minutes, but it’s not totally clear what this means. I’m guessing it’s the listening time per person per episode. Target stats:

count    750000.000000
mean         45.437406
std          27.138306
min           0.000000
25%          23.178350
50%          43.379460
75%          64.811580
max         119.970000

The correlation table between numerical columns indicates that Episode_Length_minutes is by far the greatest predictor of the target, with a correlation coefficient of 0.92.

My approach

ideas:

most of the object columns can just get one hot encoded. Podcast_Name has 48 unique values (presumably because most of this data was artificially augmented from a much smaller data set), so I think I’ll OHE that too
Episode Title entries all take the format Episode X, where X is a number between 1 and 100. I’ll convert this column into a feature of int, since I suspect that there could be some kind of correlation between episode number and listening time (maybe higher episode numbers tend to be listened to more by the ‘hardcore’ listeners, who would tend to listen longer)
This may be a bad idea but I’ll use a model to predict the missing Episode_Length_minutes and Guest_Popularity_percentage values

preprocessing:

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
df_all = pd.concat((train, test), ignore_index=True)

def process_episode_title(episode_title):
    return episode_title.split(" ")[1]

def preprocess_base(data):

    data = pd.concat((data,
                      pd.get_dummies(data["Podcast_Name"]),
                      pd.get_dummies(data["Genre"]),
                      pd.get_dummies(data["Publication_Day"]),
                      pd.get_dummies(data["Publication_Time"]),
                      pd.get_dummies(data["Episode_Sentiment"])), axis=1)

    data["Episode_Title"] = data["Episode_Title"].apply(process_episode_title).astype("int16")

    # dealing with NaNs
    data["Number_of_Ads"] = data["Number_of_Ads"].fillna(data["Number_of_Ads"].median())

    data = data.drop(columns=["id", "Podcast_Name", "Genre", "Publication_Day", "Publication_Time", "Episode_Sentiment"])

    return data

def add_nans_float(data, cols):
    for i, column in enumerate(cols):
        params = {
            "n_estimators" : 1000,
            "random_state" : 1,
            "tree_method"  : "hist",
            "device"       : "cuda",
        }
        model = XGBRegressor(**params)
        train_data = data.iloc[np.where(data[column].isna() == False)[0].tolist()].drop(columns=cols[i+1:])
        pred_data = data.iloc[np.where(data[column].isna() == True)[0].tolist()].drop(columns=cols[i+1:])
        X = train_data.drop(column, axis=1)
        y = train_data[column]
        model.fit(X, y)

        y_pred = model.predict(pred_data.drop(column, axis=1))
        y_pred = np.clip(y_pred, 0, max(y_pred))

        mask = data[column].isna()
        assert mask.sum() == y_pred.size, "size mismatch!"
        data.loc[mask, column] = y_pred

    return data

def preprocess(data):
    data = preprocess_base(data)
    data = add_nans_float(data, ["Episode_Length_minutes", "Guest_Popularity_percentage"])
    return data

train = preprocess(train)

I got a score of 13.19146 with this approach, which puts me at around the 80th percentile. I’m a bit surprised it’s so bad. I also tried adding squares of the numerical columns but that didn’t help so I guess there’s not much to be gained from polynomial features; however, it may be worth adding polynomial interactions between features.

After training on both train and test data for add_nans_float and adding NaN flag features, I got a score of 13.00851, ~65th percentile. I think the NaN flag is an important takeaway from this; the model needs to know where a value was predicted. I wonder if the NaN flag somehow tells the model “this is a low-confidence value”.

Looking at the top solutions

Looking at the top solution by Chris Deotte. It’s pretty complex.

Thoughts and ideas:

Apparently the top notebooks are using really high max depths for some models
What led him to a stack was training an XGB on all his previous experiments’ OOF (out of fold) and test preds, and his score jumped to the top. I think the idea behind that is that if you train a model on an ensemble of models’ outputs, you get something that mixes all of them, and for this dataset where there are a bunch of “if this then that” relationships in the data, that’s somehow better than each individual model
- I like the idea of combining a bunch of different models that approach the data in different ways
Look into target encoding
Stacking is so effective with this data set because there are so many missing values in the most important feature by far
- Stacks excel at extracting feature interaction
- The main problem to solve in this challenge is what do you do when the main predictor (ELM) is missing? When it’s missing all the other features matter a lot more
“All in all, I trained around 100 models. From the beginning, I saved every single OOF and test preds as parquet files. My Hill Climbing script reads the OOFs and computes roughly optimal blending weights (both positive and negative). Every now and then, I ran the Hill Climber and removed those OOFs and test predictions that weren’t assigned any weights to speed up hill climbing. Unexpectedly, even some very early OOFs from before I added TE were still assigned some weight. The numbers given in the description above are what remained, i.e. what Hill Climbing still considered relevant.” from Johannes Heller
- Need to learn about hill climbing vs stacking
Try making a model that trains on all columns except ELM and add it to the stack — it provides the meta-learner with a backup option for rows where ELM is missing
Figure out AutoGluon and n-gram feature interactions from this

Next up: training these models on GPU using cuDF and cuML. Training on CPU takes way too long and takes too much memory; using GPU compute + memory will allow me to use my laptop for other things while training.

Using cuDF and cuML

I installed the RAPIDS library (installation process) and made some minor edits to the pipeline to make it entirely GPU-based. Now I’m running much larger models (which kinda feels like cheating), so without adjusting the preprocessing — just running larger models — I’ve gone from a score of 13.01 (~65th percentile) to 12.69 (~33rd).

Roadmap from here

implement stacking
1. figure out some level 0s
2. figure out how to keep OOFs, especially how to compute OOFs of every single row
3. stack
write about stacking, why it’s so effective here
look at the 2nd place solution and figure out how it was so effective with just a single model

Level 0s

preprocessing

predict ELM and GPP
drop ELM
drop ELM and GPP

models

XGB
Lasso
SVR
KNN Regressor
Random Forest
AutoGluon
LightGBM
CatBoost?

hyperparams

high max depth
low max depth

How to keep OOFs

Current implementation:

proc_funcs = [preprocess0, preprocess1, preprocess2]
params_lst = [params0, params1, params2]

train = cudf.read_csv("train.csv")
test = cudf.read_csv("test.csv")

oof   = {i: cp.zeros(len(train),  dtype=cp.float32) for i in range(len(proc_funcs) * len(params_lst))}
ptest = {i: cp.zeros(len(test), dtype=cp.float32) for i in range(len(proc_funcs) * len(params_lst))}

K = 5
kf = KFold(n_splits=K, shuffle=True, random_state=1)

for fold, (tr_idx, val_idx) in enumerate(kf.split(train)):
    print("Fold: ", fold)
    tr_idx  = cp.asarray(tr_idx)
    val_idx = cp.asarray(val_idx)

    X_tr = train.drop("Listening_Time_minutes", axis=1).iloc[tr_idx]
    y_tr = train["Listening_Time_minutes"].iloc[tr_idx]
    X_val = train.drop("Listening_Time_minutes", axis=1).iloc[val_idx]

    for i, proc_func in enumerate(proc_funcs):
        X = proc_func(X_tr)
        X_val_local = proc_func(X_val)
        test_local = proc_func(test)
        for j, params in enumerate(params_lst):
            model_idx = 3*i + j
            print(f"    Evaluating model ID {model_idx} / {len(proc_funcs) * len(params_lst) - 1}")

            model = XGBRegressor(**params)
            model.fit(X, y_tr)

            oof[model_idx][val_idx] = model.predict(X_val_local)
            ptest[model_idx] = model.predict(test_local) / K

It easily lets me add new processing functions and hyperparams.

End result

I trained my level 1 stacker on the preprocessing, model, and hyperparameters listed above and ended up with a score of 12.54 (top ~22%) before moving on to Predict Calorie Expenditure - Kaggle Playground Series when it was released.

I didn’t do a great job of recording where I was, so now (writing 3 months later) I’ve forgotten the details of what I did. I may revisit this competition someday, but it’s unlikely.