benchmarking and robust evaluators

Published 2025-05-19

“However, with these improvements we envision that the value of setting up more environments (problems) with robust evaluation functions will become more widely recognized, which in turn will result in more high-value practical discoveries going forward.”

This seems especially applicable to the natural sciences, where there’s generally a greater barrier to formalizing evaluators than in the mathematical or computational sciences. I’m imagining a drug design evaluator that combines a variety of metrics, such as binding efficiency/energy, structure prediction, and interactions with enzymes, molecules, etc. An AlphaEvolve-style agent would explore that feature space much more efficiently than any human lab.

I’m considering implementing my own (much smaller) version of AlphaEvolve for feature engineering on kaggle datasets.