LLM-Evo
Published 2025-11-10
LLM-Evo is an LLM-powered genetic algorithm inspired by AlphaEvolve: seed a population, let the model mutate promising parents, score every candidate, and keep iterating. The explorer below pulls top candidates and generation-level stats from a recorded run, and lets you see the exact code the system produced along the way.
This particular run’s objective is to improve the performance of the following function on large NumPy arrays:
The baseline implementation below is provided as the seed:
def build_func(x): return np.exp(np.sin(x) + np.square(np.cos(x)))Poke around and get a sense of how the top solutions evolved over time; I’ll then dig into what happened and how the system works.
Evaluation
The run starts with a set of seeds; here, those seeds are 20 copies of the build_func(x) function above. Each of these copies is evaluated independently, and the runtimes of each evaluation are aggregated to get a mean and standard deviation. Subsequent generations use those statistics to return a z-score as the evaluation.
Gene evaluation looks something like this:
- An
objective(code)function wraps aroundrun_code(code, func_name)to route exceptions run_codeperforms several essential operations:- Performs a basic syntax check of the provided code with
ast - Writes a helper script to a temporary directory; this script includes array generation, timing of the run, and a correctness check
- Runs the helper script in a subprocess, returning only the median time (negated to preserve higher-is-better scoring)
- Performs a basic syntax check of the provided code with
- The core
evolvefunction (not accessible to the user) converts the output ofobjective, here a negative time, to a z-score
Note that the LLMs are prompted to maintain the build_feature(x) syntax; deviations are rejected.
Here’s a snippet from that run_code function; note that func is the LLM-provided build_feature:
def f(x): return np.exp(np.sin(x) + np.square(np.cos(x)))
WARMUP_RUNS=3TIMED_RUNS=10
size = 5_000_000x = np.random.random(size).astype(np.float64)
times = []for i in range(WARMUP_RUNS + TIMED_RUNS): x_in = x.copy()
# Don't time the warm-up runs if i < WARMUP_RUNS: result = func(x_in) # Start timing after warm-up else: start = time.perf_counter() result = func(x_in) times.append(time.perf_counter() - start) if not isinstance(result, np.ndarray) or result.shape != x.shape or not np.allclose(result, f(x)): result = np.nan break
# Convert numpy arrays to lists for JSON transportif isinstance(result, np.ndarray): result = result.tolist()
out = result if np.isnan(result).any() else -np.median(times)
print(json.dumps(out))Highlights:
- Array
sizeis large enough to noticeably reduce noise while not requiring an excessively long time to operate on - Warmup runs let NumPy, the OS, and the CPU settle into steady-state behavior; otherwise the first call includes one-time costs (page faults, initialization, cache fills) that artificially inflate runtime
subprocess.runcapturesstdoutandstderr, soobjective(code)receives the output of that last print statement
This run’s highlights
Generation 1: removing cos with a trig identity
z-score: 27.4
The first big jump came right at the start, when some of the LLM mutators took advantage of the identity to rewrite the exponent of the target function as:
That cuts a heavy cos call; subsequent arithmetic operations are comparatively light. The top score from that first generation is therefore:
def build_func(x): s = np.sin(x) return np.exp(1.0 + s - s * s)Generation 2: in-place ufuncs and fewer temporaries
z-score: 33.2
Here the system leaned into out= and in-place arithmetic so the data only makes one trip through memory per transform. Less allocation and fewer passes matter when arrays are 5M elements; you start trading CPU for memory bandwidth and cache behavior, and eliminating temporaries helps both.
Here’s the second generation’s top performer, which allocates just one temporary:
def build_func(x): np.sin(x, out=x) t = x * x x -= t x += 1.0 np.exp(x, out=x) return xGeneration 5: completing the square
z-score: 36.7
After a few generations without significant breakthroughs, a creative mutator (using the explore role) rewrote the expression inside the exponent:
It took me a minute to understand why this is any faster: the call is still there and we still need 3 arithmetic operations (an addition or subtraction, a subtraction, and a multiplication). Hmmm.
Let’s take a look at the generated function:
def build_func(x): np.sin(x, out=x) np.subtract(x, 0.5, out=x) # x := sin(x) - 0.5 np.multiply(x, x, out=x) # x := (sin(x) - 0.5)^2 np.subtract(1.25, x, out=x) # x := 1.25 - (sin(x) - 0.5)^2 np.exp(x, out=x) return xNo temporaries! Every operation is an in-place ufunc on x. With the previous solution, there was no way around the operation, which meant that we needed to allocate a temporary. With this new approach, every operation can be performed in-place, and we never need to go “back” to a different operator on x at any point.
Okay, now we’re performing every operation in-place with NumPy ufuncs. There’s no more room for optimization, right? Right?
Generation 20: block-wise processing (streaming the array)
z-score: 41.2
This one was really interesting. Let’s just go straight to the code:
def build_func(x): n = x.shape[0] block = 1 << 20 # ~1,048,576 if block > n: block = n
for i in range(0, n, block): j = min(i + block, n) y = x[i:j] np.sin(y, out=y) y -= 0.5 y *= y y *= -1 y += 1.25 np.exp(y, out=y)
return xInstead of operating on the 5M-item array (40MB for float64) in one shot, this approach splits the array into blocks and operates on each view of x independently. Why might this be faster? I think it’s because of better caching behavior. Each block, which is 8MB large, fits in L3 cache (16MB on my CPU); working on 8MB chunks improves the chance that most operations stay inside L3 cache rather than repeatedly pulling the entire 40MB array from DRAM.
And, naturally, after splitting a process into independent chunks…
Generation 25: parallelism over chunks
z-score: 75.0
The winning variant kept block-wise processing and wrapped the per-block kernel in a ThreadPoolExecutor. NumPy’s ufuncs release the GIL, so threads can execute those C loops concurrently on separate chunks. Important details that made this safe and fast: disjoint slices (no overlap), in-place ops on each slice (fewer writes), and reasonable chunk size so each thread’s slice fits in cache.
def build_func(x): n = x.shape[0] block = 1 << 20 # ~1,048,576 if block > n: block = n
def process_block(start, end): y = x[start:end] # Compute s = sin(x_block) in-place np.sin(y, out=y) # Allocate per-block scratch buffer for s^2 scratch = np.empty(end - start, dtype=np.float64) # scratch = s^2 np.multiply(y, y, out=scratch) # y = 1 + s y += 1.0 # y = 1 + s - s^2 y -= scratch # exp in-place np.exp(y, out=y)
with ThreadPoolExecutor() as executor: futures = [] for i in range(0, n, block): j = min(i + block, n) futures.append(executor.submit(process_block, i, j)) for f in futures: f.result()
return xHow the system actually found these ideas
Something that’s easy to miss when looking at the final solution is that no single “AI” discovered these tricks. What’s really going on is a small ecosystem of operators, each with its own job:
mutatemakes small, targeted edits to a promising parentcrossmerges two strong ideas into a hybridexploredeliberately tries something structurally different, even if it’s probably badinsightslooks at recent generations and produces abstract recommendations for the next wave
Each generation, the system keeps a memory of high-scoring candidates, and the operator prompts use that memory in different ways. For example, mutate is local and incremental; explore is intentionally orthogonal; cross tries to splice design patterns or dataflows; and insights biases the others toward structural ideas that seem to correlate with performance.
Which operator gets to “speak” next is handled by a small multi-armed bandit. The bandit tracks an average reward for each operator and samples from a softmax over those averages plus a prior. The interesting thing is that this reward signal, because I’m using z-scores now, tends to be quite stable, so the bandit can actually learn meaningful preferences.
What I learned from this run
A few things surprised me.
First, algebraic identities matter a lot in a system like this. I probably would have tried the cos² -> 1 − sin² identity, but I’m not sure I would have seen the need for the completed-square form. The fact that a structurally different algebraic representation can reduce temporary allocations makes sense now but still feels not easy to find.
Second, block-wise processing was not something I anticipated showing up. That came out of the explore/cross interplay once the system hit a plateau: the operators started producing solutions that explicitly looped over chunks of the array. This was a really cool mutation that felt close to a “discovery” for me: it was not obvious, and no LLM I’ve asked (gpt-5, gemini-2.5-pro, claude-4.5-sonnet) found it when I asked them to optimize the original function as much as possible. This step unlocked the final trick: once you reframe the computation as a set of operations over independent blocks, parallelism becomes conceptually trivial.
Third, the whole system works better when the reward landscape is “smooth,” and switching to z-scores helped a lot here. Different problems have wildly different evaluator scales, and z-scores normalize that so the router can make meaningful decisions about operator utility.
Where I want to take this next
This run makes me think there’s a lot more room for improvement here. A few directions I’m considering:
- Adaptive chunk sizes: Right now the best block size is basically hand-tuned. It should really depend on L3 size, array dimensions, and maybe even the specific sequence of ufuncs. A tiny autotuner could figure this out much more cheaply than an LLM.
- Static analysis hints: Simple checks—like “you allocated a 40MB temporary” or “you passed through the array 3 times”—could feed into the bandit reward. Not as a hard constraint, but as a soft penalty.
- More granular operators:
mutateis currently doing several conceptual things. Splitting it into smaller, more targeted micro-operators might give the bandit a clearer set of levers. - Better insight propagation: The
insightsoperator is still pretty young in this setup. There’s room for it to learn to generalize across tasks or detect structural patterns at a higher level. - Managing LLM bias: It’s clear that using LLMs as mutators introduces a bias that isn’t present in “random” mutations. I want to be aware of how heavily the LLM is influencing the system; for example, I think it would be dangerous for the
insightsoperator to be doing a lot of heavy lifting, since it’s an LLM-determined mutation using LLM-provided guidance. I wonder if using different model families might help mitigate some of this bias.
I don’t think this system is close to “done.” If anything, this run makes me feel like I’ve just scratched the surface of what an LLM-driven evolutionary pipeline can do once it starts discovering its own engineering patterns.