EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Abstract

Learning prompts in the messy, mixed-task setting.

EEVEE is a multi-dataset test-time prompt learning framework for LLM agents. Instead of adapting to one stationary benchmark, it handles incoming task streams drawn from different datasets, domains, formats, and evaluation rules.

The core idea is to reduce cross-dataset interference with a learned router. Each input is assigned to one specialized prompt slot, so feedback from code, formula, theorem QA, and closed-book science QA no longer has to compress into one shared instruction.

Router and prompt learning are coupled: routing decides which examples each prompt sees, while prompt behavior determines whether a route is useful. EEVEE addresses this with router-prompt co-evolution and a three-stage training process.

Video

EEVEE

Why EEVEE

One learned prompt breaks down as task mixtures grow.

Prior test-time prompt learning methods are strongest when the feedback stream comes from one benchmark. Real agents are different: they see knowledge QA, symbolic reasoning, financial formulas, and code generation in the same operating loop.

In the paper's incremental setting, GEPA and ACE accumulate negative retention as more benchmarks enter the stream. EEVEE stays positive because the router separates incompatible feedback before each prompt specializes.

Incremental multi-benchmark retention results for EEVEE, ACE, and GEPA. — Retention as GPQA Diamond, Formula, TheoremQA, and HumanEval are added to the mixed stream.

Method

Route first, then specialize.

At inference time, the router selects a prompt slot for each input and the frozen target model answers with that prompt. At learning time, EEVEE alternates router evolution and prompt evolution so the assignment policy and slot behavior improve together.

EEVEE framework with router, prompt set, target model, and co-evolution loop. — The router chooses a specialized prompt slot; learning co-evolves router prompts and model prompts through mutation, analysis, reflection, scoring, and regrouping.

01

Initialize

Build a Pareto-front prompt pool and greedily retain prompts with complementary validation coverage.

02

Explore

Alternate lightweight router and prompt updates while the assignment policy is still moving.

03

Converge

Fix the stable router and spend a larger prompt-learning budget inside each routed group.

Results

Gains are largest when benchmarks are learned together.

The four-benchmark suite mixes GPQA Diamond, Formula, TheoremQA, and HumanEval. EEVEE improves both target models and avoids the severe multi-task retention loss seen in single-prompt baselines.

Main four-benchmark results. Scores are percentages averaged over three runs; colored subscripts denote differences from the corresponding target-model baseline.
Target model	Method	GPQA Diamond	Formula	TheoremQA	HumanEval	Avg.
Qwen3-4B-Instruct	Baseline	56.00	45.22	14.79	49.46	41.37
	ACE	48.93-7.07	39.67-5.55	15.84+1.05	35.23-14.23	34.92-6.45
	GEPA	50.84-5.16	49.83+4.61	19.62+4.83	30.62-18.84	37.73-3.64
	EEVEE	54.55-1.45	54.55+9.33	25.27+10.48	72.63+23.17	51.75+10.38
DeepSeek-V3.2	Baseline	64.98	30.00	21.21	42.82	39.75
	ACE	55.89-9.09	37.78+7.78	27.05+5.84	78.59+35.77	49.83+10.08
	GEPA	41.75-23.23	60.56+30.56	31.72+10.51	89.29+46.47	55.83+16.08
	EEVEE	63.08-1.90	60.55+30.55	39.84+18.63	92.82+50.00	64.07+24.32

Single benchmark check

Specialization does not sacrifice the simple setting.

EEVEE remains competitive when prompt learning is run one benchmark at a time, with especially strong Formula and HumanEval scores.

Single-benchmark prompt learning scores across GPQA Diamond, Formula, TheoremQA, HumanEval, FiNER, and IFBench. — Single-benchmark results after independent prompt learning.

Efficiency

The router adds modest overhead.

Final-test token use averages 4.32k tokens per example, close to GEPA and far below ACE's expanding playbook context.

Average token usage per test example for EEVEE, GEPA, and ACE. — Average final-test token usage for EEVEE, GEPA, and ACE.

Case study

Prompt learning captures reusable procedures.

Diagnostic retests compare empty prompts against final learned router and prompt sets. The strongest gains come when feedback can become a durable execution rule or output contract.

Formula

Formula: unit scale

Learned prompts enforce the supplied financial formula, keep the correct dollar scale, and emit strict numeric answers.

Open case details

Qwen Δ: +9.5
DeepSeek Δ: +31.7
W→R / R→W: 268 / 21
Runs+: 6/6

Task. Compute free cash flow from operating cash flow and capital expenditure.

Baseline. The model reverses the subtraction and keeps the million-scale decimal, producing a negative value.

-0.40

Learned. The final prompt applies the provided formula at dollar scale and emits a parseable numeric answer.

Answer: 400000.00

Takeaway: formula feedback turns into reusable rules for units, formula application, and final-answer formatting.

Code

HumanEval: executable body

The final prompt steers the model toward complete function bodies with indentation, accumulators, and returns.

Open case details

Qwen Δ: +23.2
DeepSeek Δ: +48.8
W→R / R→W: 193 / 16
Runs+: 6/6

Task. Complete a function that sums even values appearing at odd indices.

Baseline. The model writes a bare expression without the required indented return statement.

sum(lst[i] for i in range(1, len(lst), 2) if lst[i] % 2 == 0)

Learned. The learned prompt produces an executable function body with an accumulator and return.

total = 0
for i in range(1, len(lst), 2):
    if lst[i] % 2 == 0:
        total += lst[i]
return total

Takeaway: code feedback teaches executable output contracts, not just the underlying algorithm.

Domain QA

GPQA Diamond: knowledge underweighted

Generic reasoning can become stronger while missing domain priors, such as rocky-planet self-compression.

Open case details

Qwen Δ: -3.7
DeepSeek Δ: -7.7
W→R / R→W: 55 / 89
Runs+: 1/6

Task. Select the densest Earth-like exoplanet from mass and composition cues.

Baseline. The model uses the rocky-planet prior that higher mass increases self-compression and density.

For rocky planets of similar composition, radius grows sublinearly with
mass because stronger gravity compresses the material. A five-Earth-mass
rocky planet is therefore denser than Earth, while a half-Earth-mass
rocky planet is less dense.

Answer: the higher-mass Earth-composition option.

Learned. The final prompt performs formulaic density reasoning, treats same composition as constant density, and selects the Earth baseline.

Density is mass divided by volume. For a spherical planet,
rho = M / (4/3 pi R^3). If composition is the same, density is
constant; the radius scales with mass so that all Earth-composition
options remain approximately Earth density. Since the Earth-mass,
Earth-radius option is exactly Earth-like, choose that option.

Takeaway: stronger generic reasoning can still fail when the missing ingredient is domain knowledge.

BibTeX

Citation

@misc{xu2026eevee,
  title = {{EEVEE}: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents},
  author = {Weixian Xu and Shilong Liu and Mengdi Wang},
  year = {2026},
  eprint = {2606.11182},
  archivePrefix = {arXiv},
  url = {https://arxiv.org/abs/2606.11182}
}