Idea First, Code Later

Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming

Sama Hadhoud1   Alaa Elsetohy1   Frederikus Hudi2   Jan Christian Blaise Cruz1   Steven Halim3   Alham Fikri Aji1 1MBZUAI   2NAIST   3National University of Singapore
💡
Competitive programming primarily evaluates algorithmic problem solving rather than code generation.
Accordingly, we evaluate LLMs by separating algorithmic reasoning from implementation via natural-language editorials.

Abstract

Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.

Pipeline Overview

Competitive programming is fundamentally a problem-solving task: deriving the right algorithm under strict constraints. Yet LLM evaluations often collapse everything into pass/fail, mixing up reasoning failures (wrong idea) and implementation failures (right idea, wrong code). We fix this by making the “idea” explicit: we use natural-language editorials as an intermediate step, allowing us to evaluate planning and coding separately.

Evaluation pipeline and editorial annotation scheme

Overview of our evaluation pipeline and editorial annotation scheme. Left: three settings, w/oEd (problem → code, baseline), w/GenEd (problem → generated editorial → code), and w/GoldEd (problem plus gold editorial → code). Right: the LLM-generated editorial annotation rubric used to diagnose reasoning quality, covering Problem Understanding (PU-W, PU-M, PU-X, PU-D), Algorithm Description (ALG-TAG vs. Golden-ALG-TAG), and Algorithm Correctness (ALG-COR, correctness type, error type, and severity).

Three Evaluation Settings

We test each model in three conditions to isolate different capabilities:

w/oEd Baseline

Problem → Code

Code-only baseline: the model is asked to solve the problem directly. This mirrors standard CP benchmarks and conflates algorithmic reasoning with implementation—so when a submission fails, the failure mode is ambiguous (incorrect planning, incorrect coding, or both).

w/GenEd Self-Planning

Problem → Editorial → Code

Generated editorial: we first ask the model to write an editorial and then generate code conditioned on it. The editorial is generated once and kept fixed, helping isolate model-generated reasoning from implementation: failures can stem from an incorrect or incomplete plan, or from failing to faithfully implement a correct plan.

w/GoldEd Oracle Plan

Problem + Editorial → Code

Gold editorial: we provide an expert-written gold editorial to estimate performance under correct reasoning. This approximates an upper bound given a correct plan and isolates implementation limitations—any remaining errors therefore arise from implementation.

Dataset ( View Dataset)

We curate 83 ICPC-style problems from 7 contests spanning 2017–2025. Each comes with the original problem statement, a gold editorial from problem setters or testers, and the full official test suite.

ContestYearsProblems
CS3233 Midterm (NUS)2023–202534
ICPC Asia Pacific Championship202413
ICPC Asia Jakarta Regional2017–201936
Total83

Judging is ICPC-style: compile, run on all tests, accept only if every test passes within limits. No partial credit.

Results

🏅 Gold editorials yield large and consistent gains

On average, providing a gold editorial yields a large absolute gain: pass@1 increases from 23.2% (w/oEd) to 37.7% (w/GoldEd), i.e., +14.5 points. Gold editorials substantially improve performance across all difficulties.

🧱 Gold editorials isolate an implementation gap

However, performance remains far from saturated even with gold guidance, especially on T3 (rising only from 8.0% to 16.5% on average). This isolates a substantial residual implementation gap: models often struggle to translate a high-level algorithm into correct and efficient code.

📝 Self-generated editorials yield smaller and less reliable changes

In contrast, self-generated editorials yield much smaller and less reliable changes, and can even degrade performance (overall 23.2% w/oEd vs 23.1% w/GenEd; on T3, 8.0% to 6.5%). When the generated editorial is incomplete, overly complex, or subtly wrong, conditioning on it can lock the model into a flawed plan.

🔄 Editorials transfer across models (writer–coder composition)

Editorial transfer improves coding: using editorials from stronger writers often improves weaker coders, and across coders, every cross-model configuration performs at least as well as the coder’s w/GenEd. Overall, these results show that reasoning and implementation can be modularized, with editorials serving as a simple, model-agnostic interface.

📊 Full Results Table

Pass@1 by difficulty tertile (T1 easiest–T3 hardest). Gold editorials substantially improve performance across all difficulties (up to ∼30%), but hard problems remain challenging, indicating a residual implementation bottleneck. Self-generated editorials yield smaller, model-dependent gains (up to ∼15%) and sometimes hurt performance, highlighting a persistent problem-solving gap in model-generated reasoning.

Model Overall (83 problems) T1 – Easy (26) T2 – Medium (28) T3 – Hard (29)
w/oEdw/GenEdw/GoldEd w/oEdw/GenEdw/GoldEd w/oEdw/GenEdw/GoldEd w/oEdw/GenEdw/GoldEd
Closed Source Models
GPT-5 67.5% 68.7% +1.2 83.1% +15.7 92.3% 88.5% −3.8 96.2% +3.8 75.0% 75.0% +0.0 92.9% +17.9 37.9% 44.8% +6.9 62.1% +24.1
O3 51.8% 45.8% −6.0 63.9% +12.0 69.2% 76.9% +7.7 80.8% +11.5 60.7% 46.4% −14.3 75.0% +14.3 27.6% 17.2% −10.3 37.9% +10.3
Gemini 2.5 Pro 43.4% 45.8% +2.4 72.3% +28.9 69.2% 73.1% +3.8 92.3% +23.1 42.9% 50.0% +7.1 78.6% +35.7 20.7% 17.2% −3.4 48.3% +27.6
Gemini 2.5 Flash 38.6% 37.3% −1.2 54.2% +15.7 65.4% 73.1% +7.7 80.8% +15.4 39.3% 32.1% −7.1 53.6% +14.3 13.8% 10.3% −3.4 31.0% +17.2
Claude Opus 4 21.7% 30.1% +8.4 47.0% +25.3 42.3% 57.7% +15.4 76.9% +34.6 21.4% 32.1% +10.7 57.1% +35.7 3.4% 3.4% +0.0 10.3% +6.9
Claude Sonnet 4 16.9% 19.3% +2.4 48.2% +31.3 34.6% 38.5% +3.8 73.1% +38.5 17.9% 17.9% +0.0 57.1% +39.3 0.0% 3.4% +3.4 17.2% +17.2
GPT-4.1 13.3% 17.1% +3.8 33.7% +20.5 26.9% 34.6% +7.7 61.5% +34.6 3.6% 14.8% +11.2 32.1% +28.6 10.3% 3.4% −6.9 10.3% +0.0
GPT-4o 7.2% 3.6% −3.6 13.3% +6.0 19.2% 7.7% −11.5 38.5% +19.2 0.0% 3.6% +3.6 3.6% +3.6 3.4% 0.0% −3.4 0.0% −3.4
Closed Source Avg 32.5% 33.5% +0.9 52.0% +19.4 52.4% 56.2% +3.8 75.0% +22.6 32.6% 34.0% +1.4 56.2% +23.7 14.7% 12.5% −2.2 27.2% +12.5
Open Source Models
GPT-OSS-120B 41.0% 31.3% −9.6 59.0% +18.1 61.5% 61.5% +0.0 76.9% +15.4 50.0% 32.1% −17.9 75.0% +25.0 13.8% 3.4% −10.3 27.6% +13.8
GPT-OSS-20B 33.7% 27.7% −6.0 47.0% +13.3 57.7% 53.8% −3.8 65.4% +7.7 39.3% 28.6% −10.7 53.6% +14.3 6.9% 3.4% −3.4 24.1% +17.2
DeepSeek-R1 28.9% 45.8% +16.9 43.4% +14.5 61.5% 88.5% +26.9 65.4% +3.8 21.4% 39.3% +17.9 53.6% +32.1 6.9% 13.8% +6.9 13.8% +6.9
Qwen3-8B 15.7% 13.3% −2.4 24.1% +8.4 34.6% 34.6% +0.0 42.3% +7.7 14.3% 7.1% −7.1 25.0% +10.7 0.0% 0.0% +0.0 6.9% +6.9
DeepSeek-V3 14.5% 9.6% −4.8 28.9% +14.5 34.6% 26.9% −7.7 61.5% +26.9 10.7% 3.6% −7.1 25.0% +14.3 0.0% 0.0% +0.0 3.4% +3.4
Qwen3-Coder-480B 13.3% 10.8% −2.4 28.9% +15.7 23.1% 19.2% −3.8 61.5% +38.5 14.3% 10.7% −3.6 17.9% +3.6 3.4% 3.4% +0.0 10.3% +6.9
Kimi-K2 13.3% 13.3% +0.0 26.5% +13.3 34.6% 34.6% +0.0 50.0% +15.4 3.6% 7.1% +3.6 28.6% +25.0 3.4% 0.0% −3.4 3.4% +0.0
OlympicCoder-7B 6.0% 8.4% +2.4 10.8% +4.8 19.2% 26.9% +7.7 26.9% +7.7 0.0% 0.0% +0.0 7.1% +7.1 0.0% 0.0% +0.0 0.0% +0.0
Llama-3.1-405B 6.0% 2.4% −3.6 15.7% +9.6 19.2% 7.7% −11.5 42.3% +23.1 0.0% 0.0% +0.0 3.6% +3.6 0.0% 0.0% +0.0 3.4% +3.4
Llama-3.3-70B 4.8% 6.0% +1.2 8.4% +3.6 11.5% 11.5% +0.0 23.1% +11.5 3.6% 7.1% +3.6 0.0% −3.6 0.0% 0.0% +0.0 3.4% +3.4
Gemma-3-27B 3.6% 2.4% −1.2 8.4% +4.8 7.7% 3.8% −3.8 23.1% +15.4 3.6% 3.6% +0.0 3.6% +0.0 0.0% 0.0% +0.0 0.0% +0.0
Open Source Avg 16.4% 15.6% −0.9 27.4% +11.0 33.2% 33.6% +0.3 49.0% +15.7 14.6% 12.7% −1.9 26.6% +12.0 3.1% 2.2% −0.9 8.8% +5.6
Overall Average 23.2% 23.1% −0.1 37.7% +14.5 41.3% 43.1% +1.8 59.9% +18.6 22.2% 21.6% −0.5 39.1% +16.9 8.0% 6.5% −1.5 16.5% +8.5

Legend: +small +medium +large −decline neutral · Bold = best in category · Underline = second best

Virtual rank percentile
Mean virtual rank percentile under w/oEd, w/GenEd, and w/GoldEd (higher is better). Gold editorials yield large and consistent improvements (up to ∼0.4), yet even under gold guidance only a small number of models attain high rank percentiles (above ∼0.8), with only a handful exceeding ∼0.7.
Failure distribution
Aggregate failure verdict distribution across all editorial settings: Wrong Answer (WA), Time Limit Exceeded (TLE), Runtime Error (RTE), Compile Error (CE), and Memory Limit Exceeded (MLE). Remaining failures are dominated by WA, while TLE becomes more salient for some stronger models (notably Claude)
Cross-model transfer
Cross-model editorial transfer. Each bar reports pass@1 when a fixed coder implements an editorial written by a different model. Using editorials from stronger writers often improves weaker coders and can occasionally yield performance competitive with or exceeding the writer’s own end-to-end results.

Why Do Generated Editorials Fail?

We adopt an LLM-as-a-judge to label model-generated editorials against gold editorials using the same rubric as the expert annotator. Across models, the dominant failure mode is wrong algorithm—i.e., incorrect problem solving—rather than purely implementation errors. Frontier models produce a larger share of judge-Correct plans, but wrong algorithm remains the most common error across many models.

Editorial correctness
Six-way editorial correctness breakdown labeled by the LLM-as-a-judge. Frontier models produce more judge-\textsc{Correct} plans, but \emph{wrong algorithm}—i.e., incorrect problem solving—remains the dominant error across many models.
Verdicts by correctness
Downstream verdict distribution (PASS/WA/TLE/RTE/CE/MLE) conditioned on editorial correctness labels. Editorial correctness labels meaningfully stratify downstream outcomes.

Annotation Guidelines

We annotate each model-generated editorial against the problem statement and the gold editorial using a structured rubric with three parts: PU ALG ALG-COR. This is the same layout used in our Excel annotation template.

Excel-style annotation sheet layout

Column groups mirror the spreadsheet: metadata → Problem Understanding → Algorithm Description → Algorithm Correctness.

Metadata Problem Understanding (PU) Algorithm Description (ALG) Algorithm Correctness (ALG-COR)
Problem ID Statement Gold editorial Model ID Model editorial PU-W Type PU-M Type PU-X PU-D Comments ALG-TAG Other ALG-FREE Golden-TAG Other Golden-FREE Comments ALG-COR Correct type Why incorrect Severity Comments
arts_and_computing_students statement.md analysis.md 0 editorial.md No No Minor 2 Short notes… Greedy DP [] ~40-word summary… DP [] ~40-word summary… Notes… Correct Same as golden Notes…

Legend: Correct / PASS Incorrect / WA Suboptimal / TLE-risk N/A

Full rubric (definitions)

1) Problem Understanding (PU)

Purpose: Verify that the editorial accurately captures every essential detail of the problem statement—no mis‑read constraints, missing subtleties, or misleading additions.

FieldWhat to checkAllowed valuesNotes
PU-W Does the editorial state a wrong crucial detail that changes the problem meaning? Yes No If Yes, mark explicit vs implicit misinformation.
PU-M Was a constraint or subtlety that affects the problem's meaning skipped? Yes No If Yes, mark whether missing info is explicit vs implicit in the statement.
PU-X Extra statements that do not change correctness but muddy understanding. None Minor Major Use Major if it strongly steers the reader toward a wrong model.
PU-D How hard is the original problem statement to understand? 05 0 = very clear, 5 = extremely difficult.
2) Algorithm Description (ALG)

Purpose: Assess the high‑level idea/algorithm choice.

FieldWhat to recordAllowed valuesNotes
ALG-TAG The high‑level idea/algorithm described in the LLM‑generated editorial Choose one or more paradigms: ex: DPGreedyDFS/BFS DijkstraSegment TreeBinary Lifting FFTFlowGeometry Math/Number TheoryOther If Other, fill ALG-TAG-OTHER with specific names.
ALG-FREE Write one or two concise sentences (~ 40 words) summarising the core idea. Free text
Golden-ALG-TAG The high‑level idea/algorithm described in the gold‑standard editorial (for reference only). Choose one or more paradigms: ex: DPGreedyDFS/BFS DijkstraSegment TreeBinary Lifting FFTFlowGeometry Math/Number TheoryOther If Other, fill ALG-TAG-OTHER with specific names.
Golden-ALG-FREE Write one or two concise sentences (~ 40 words) summarising the core idea. Free text
3) Algorithm Correctness (ALG-COR)

Purpose: Judge whether the editorial’s algorithm—as described—actually solves the problem correctly and efficiently within the stated constraints.

FieldMeaningAllowed valuesExamples / notes
ALG-COR Overall correctness of the model editorial's algorithm as described. Correct Incorrect "Correct" The editorial’s method always solves the problem under the stated constraints; otherwise "Incorrect".
Correct type (if Correct) Whether it matches the gold approach. Same as golden Different from golden Same as golden: Matches the official solution exactly, Example: Tags the approach as “DP” and implements the correct recurrence exactly as in the gold editorial.Different from golden: Uses a different but equally valid approach, Example: Uses a greedy approach instead of DP, but still works within the problem constraints.
Why incorrect (if Incorrect) Primary failure mode diagnosis. Wrong algorithm Correct algorithm but incorrect approach Suboptimal (Likely TLE or MLE), but correct algorithm Suboptimal (Likely TLE or MLE), and wrong algorithm Wrong algorithm (WA):
Explanation: high‑level idea/algorithm choice itself is unsuitable—there’s no way this approach ever solves the full problem.
Example: uses dp for a graph problem Correct algorithm but incorrect approach (WA):
Explanation: The intended high‑level idea/algorithm choice is sound, but an implementation mistake breaks correctness.
Example: Implements binary search but miscomputes mid so it never converges.
Suboptimal (Likely TLE or MLE), but correct algorithm:
Explanation: The algorithm returns correct answers yet violates resource limits in worst-case scenarios. The choice of algorithm is correct, but it needs tweaking to fit the time or memory limits.
Example: Uses a higher-complexity method than required, so will time out on large inputs. Or Uses excessive memory (e.g., stores full N×N matrix) instead of a linear or in-place approach.
Suboptimal (Likely TLE or MLE), and wrong algorithm:
Explanation: The algorithm returns correct answers yet violates resource limits in worst-case scenarios because of the wrong choice of algorithm.
Example: Uses brute-force, or DP for a problem that needs greedy solution.
Severity (if Incorrect) How much fixing is needed to make it correct. Completely wrong Major edits needed Minor edits needed Completely wrong: No part of the solution can be salvaged; high‑level idea/algorithm choice and implementation bear no relation to a valid strategy.
Major edits needed: Core structure almost works but requires re-architecting key parts.
Minor edits needed: A small tweak—off-by-one fix, missing boundary check, or one initialization—will restore full correctness.
Optional: JSON schema for LLM-as-a-judge outputs
{
  "PU": { "PU-W": {...}, "PU-M": {...}, "PU-X": {...}, "PU-D": {...} },
  "ALG": { "ALG-TAG": [...], "ALG-TAG-OTHER": [...], "ALG-FREE": "...",
           "Golden-ALG-TAG": [...], "Golden-ALG-TAG-OTHER": [...], "Golden-ALG-FREE": "..." },
  "ALG-COR": {
    "overall": "Correct|Incorrect",
    "if_correct": { "correct_type": "Same as golden|Different from golden|null", "notes": "..." },
    "if_incorrect": { "why_incorrect": "...|null", "severity": "...|null", "diagnosis": "...|null" }
  }
}

Citation

@misc{hadhoud2026ideafirstcodelater,
  title={Idea First, Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLMs for Competitive Programming},
  author={Sama Hadhoud and Alaa Elsetohy and Frederikus Hudi and Jan Christian Blaise Cruz and Steven Halim and Alham Fikri Aji},
  year={2026},
  eprint={2601.11332},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2601.11332},
}