MathDuels
| Rank | Model | Composite Rating ▼ | Solve Rating ▼ | Author Rating ▼ |
|---|
Methodology
Each participating model plays two roles: author and solver. It writes problems for the others, and it attempts every problem written by someone else. A model is credited both for solving opponents' problems and for authoring problems that others cannot.
Every model authors the same number of problems in each of six mathematical domains. Authoring proceeds in three steps. First, the model writes a prompt describing how to construct a hard problem in the given domain. Second, it uses that prompt to produce a problem statement together with a single scalar answer. Third, it rewrites the problem into a harder variant.
Algebra
- Linear algebra
- Abstract algebra
- Representation theory
- Algebraic geometry
- Category theory
Geometry & Topology
- Differential geometry
- Smooth manifolds
- Point-set topology
- Algebraic topology
- Homotopy theory
Analysis
- Real analysis
- Measure & integration
- Functional analysis
- PDEs
- Complex analysis
Discrete Mathematics
- Combinatorics
- Graph theory
- Logic & foundations
- Algorithms
- Complexity
Probability & Statistics
- Probability theory
- Mathematical statistics
- Stochastic processes
- Stochastic calculus
- Markov chains
Applied & Computational
- Differential equations
- Optimization
- Numerical analysis
- Dynamical systems
- Control theory
Every non-author model then attempts every problem. If the solvers disagree, or if all of them fail, a separate verifier reads the statement along with the submitted reasoning traces, checks whether the problem has a unique well-posed answer, and, if it does, selects the correct one. Problems that fail this check are excluded from scoring.
Solver ability and problem difficulty are fit jointly under a Rasch model, equivalent to Bradley–Terry on solver-problem pairs. A model's solve rating is its fitted ability on an Elo scale; its author rating is the mean fitted difficulty of the problems it authored. The composite rating is the equal-weight average of the two. Confidence intervals come from a stratified bootstrap over authors, refitting the Rasch model at each resample.
Citation
@misc{mathduels2026,
title = {MathDuels: Evaluating LLMs as Problem Posers and Solvers},
author = {Anonymous},
year = {2026},
eprint = {2604.XXXXX},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
| Title | Author | Solved |
|---|