Predict Calorie Expenditure Challenge

Published 2025-05-12 • Updated 2025-10-11

This was my first live/‘real’ Kaggle competition. I enjoyed the discovery process as the community identified promising new features, discussed dataset quirks, and theorized about this competition’s odd CV-LB relationship. It was also interesting to identify ways in which my thinking diverged from most, particularly with regards to the CV-LB relationship, through which I ended up gaining over 600 ranks on the private set.

I ended with a rank of 56 out of 4000+, which I’m proud of.

Here I’ll go through some key ideas as I worked through this competition, including what worked, what didn’t, and what I might have done with more time. It won’t be a thorough/exhaustive analysis.

Data analysis and initial preprocessing

The target for this competition is “Calories”; i.e., how many calories were burned during a workout given information about the person and the workout?

Provided features:

id              int64
Sex            object
Age             int64
Height        float64
Weight        float64
Duration      float64
Heart_Rate    float64
Body_Temp     float64
Calories      float64 (target)

Feature correlations with the target:

Duration      0.959908
Heart_Rate    0.908748
Body_Temp     0.828671
Age           0.145683
Weight        0.015863
id            0.001148
Height       -0.004026

The workout-specific indicators—duration, heart rate, and body temperature—are by far the best correlated features. I was surprised that height and weight are so poorly correlated with the target.

The most basic preprocessing step simply label encodes the Sex feature and adds a BMI feature:

BMI = \frac{weight}{height^2}

The resultant feature has a correlation coefficient of 0.049226, which is over 3x greater than either of the individual height and weight features.

The training dataset has a shape of (750_000, 9) while the test dataset has a shape of (250_000, 8) because it’s missing the target. Neither contains any missing values, which I was somewhat surprised by.

Hill climbing as the main approach

I was pretty sure I wanted to use a hill climber for the following reasons:

It’s generally a high-scoring approach.
Tuning a single model is more challenging than aggregating many less-tuned models on a laptop with a fairly weak GPU.
I like the idea of using a diversity of models and features.
I wanted to learn about hill climbing.

As I added models throughout the course of the competition, I was not surprised to find that top-scoring single models tended to be selected by the hill climber. I was surprised, however, to find that some models that scored in bottom third were also included, presumably because they added diversity. For example, an XGBoost model with a depth of 4 contributed where most other models had depths of 12-14. This prompted me to attempt the AlphaEvolve-like feature engineering described below.

My final hill climber, which achieved my best score on both the public and private leaderboards, contained a total of 10 CatBoost, LightGBM, and XGBoost models, out of 105 models for which I had saved out of fold predictions.

AlphaEvolve-like feature engineering

AlphaEvolve, described as “a coding agent for scientific and algorithmic discovery” by DeepMind, is an evolutionary search process that uses LLMs to generate mutations. It broadly follows these three steps:

Population generation
Fitness evaluation
Survivor selection

The population is generated by seeding for generation 0, such as through hand-writing baselines. For subsequent generations, an LLM is prompted to ‘mutate’ a provided target—either one of the baselines for generation 1 or an already-mutated target for subsequent generations—and return the mutation. A specific target—or ‘parent’—may be mutated multiple times to generate multiple children. The fitness of the child generation is then evaluated using some evaluation function, and the top k survivors become the parents of the next generation.

When I encountered this paper, I became interested in using this method for feature engineering, particularly because an LLM has a practically infinite search space, as opposed to AutoFE systems, which are bounded by predefined primitives. In addition, an LLM is theoretically able to reason about why some features might have worked better than others, although I have yet to implement this.

My goal with this process was to produce a single beneficial feature. The ‘target’ was therefore something like:

import numpy as np, pandas as pd

def build_feature(df):
  return feat

Each LLM mutator received something like the above. The generation process looked something like this:

Population generation
1. Seed the population — I did this by returning an existing feature outright
2. Subsequent generation — each parent mutated 4 children, which each received a prompt with context and instructions
Fitness evaluation
- I trained a small CatBoost model on the preprocessed training data plus a mutated feature; fitness was determined by the model’s outputted score
Survivor selection
- The top 7 mutators of the entire process — not strictly the top 7 children of the current generation — were selected to be parents of the following generation

Performing this process across 20 generations consistently yielded features that improved on what I came up with on my own. I found that one of the strengths of this approach is that the LLMs have some level of domain knowledge, so they’re able to generate intelligent and appropriate features on their own.

Because I was using a hill climber, I also used this approach to predict the residuals rather than just the target. The idea was that the hill climber thrives with a diversity of models, so adding a model trained on a feature that captures missing signal would improve the hill climber. I found that the first few iterations of generating OOFs -> generating a residual-predicting feature -> feeding to the hill climber -> repeat did improve the hill climber by a few basis points, but that the marginal gains eventually zeroed out.

One improvement would be to produce multiple features, as generated feature interactions may capture signal that a single generated feature cannot.

Final model selection

You’re allowed to pick 2 models (target predictions, technically, but it’s easier for me to think of them as models) as your final submissions. This quickly leads to a problem: how do you interpret your CV and public LB scores to maximize your private LB score?

This competition in particular sparked a lot of discussion because most people were reporting significantly lower private LB scores than CV scores, and there was rampant speculation as to why. Note that the evaluation metric for this competition was root mean squared logarithmic error, meaning that a lower score is better.

In exploring this question, I noticed that my per-fold CV scores had pretty high variance, with the mean being higher than the public LB score. In particular, it seemed that less than 1% of the data in the train set represented significant outliers — folds that contained few or no outliers had those low scores similar to the public LB scores, while most other folds presumably did contain those outliers and therefore had a higher score. Since the test split is 20% public LB and 80% private LB, I chose to trust my CV scores, and therefore selected my models that had low CV scores rather than low public LB scores.

This approach worked well, and I ended up improving by more than 600 positions on the leaderboard when the private LB was revealed. I think that this was a primary reason I did relatively well in this competition.

If I had more time

I made the above realization with only about a day to go — if I had more time, I would have tried to predict the outliers with a different model, generating a feature that flags whether the data point is likely an outlier.

I also would have refit the models included in my best hill climber on the full data. As I understand it this is standard practice, but I hadn’t saved the models (just the predictions), so I wasn’t able to do that. I wonder what difference that would have made — training on the full data rather than most of it (regardless of how many folds you use) tends to noticeably lower error, at least in my experience.