Title

AI

Research-level mathematics

June 2026

AI and Math
State of the Art

From Olympiad gold to Erdős problems, First Proof, Lean, and human verification.

Concise factual briefing

Summer 2025 → June 2026

Map

What changed in one year?

01

Contest reasoning crossed gold level

02

Erdős problems became the live testbed

03

AI disproved a famous Erdős conjecture

04

Formal proof search scaled in Lean

05

Failure modes and costs were documented

06

Tao and Gowers revised their priors

The Year in Events

01

Part One · Timeline

The Year
in Events

July 2025 → June 2026: from Olympiad gold to a disproved Erdős conjecture, told as a sequence of dated, sourced milestones.

July 2025

JUL 21 2025 · INTERNATIONAL MATHEMATICAL OLYMPIAD

Gemini Deep Think and OpenAI reached gold-medal level.

Natural-language proofs, five of six problems.

35 / 42

gold threshold performance

November 2025

NOV 3 2025 · ALPHAEVOLVE MATH PAPER

AlphaEvolve moved from coding agent to math explorer.

Construction search, not theorem proving.

67

problems across analysis, combinatorics, geometry, number theory

December 2025

DEC 2025 · EQUATIONAL THEORIES PROJECT

Tao’s project resolved 22 million implications.

Between 4,694 magma equational laws, formalized in Lean.

4,694

magma laws, classical solvers not LLMs

January 2026

JAN 2026 · ERDŐS #728 AND #1026

#728 became the first Lean-verified autonomous AI solve.

GPT-5.2 Pro reasoned; Harmonic’s Aristotle formalized.

#728

informal proof → machine-checked Lean

January 2026

JAN 2026 · GEMINI ON THE ERDŐS DATABASE

Aletheia swept 700 “open” Erdős problems.

13 meaningful outcomes after human review.

4 + 9

novel-looking solutions plus literature finds

January 2026

JAN 2026 · CÓRDOBA–CÓRDOBA–FONTELOS

A neural network found singularity candidates in fluid equations.

A physics-informed net, ~billion-fold accuracy gain.

10⁹×

accuracy gain, not a proof

February 2026

FEB 2026 · AXIOMPROVER

AxiomProver produced a Lean proof of Fel’s conjecture.

From a LaTeX statement to machine-checked code.

Lean

translation, not strategy

February 2026

FEB 2026 · TOWARDS AUTONOMOUS RESEARCH

Aletheia wrote a full paper with no human intervention.

Eigenweights in arithmetic geometry; graded correct.

Lvl 0–1

autonomy traded against significance

February 2026

FEB 24 2026 · FIRST PROOF FIRST BATCH

Aletheia solved 6 of 10 problems.

Majority expert assessment, not Lean certification.

6 / 10

informal proofs graded by experts

February-April 2026

FEB-APR 2026 · LEAN AND GAUSS

Viazovska’s 8-dimensional sphere packing proof reached formal verification.

AI helped fill Lean proof details.

E8

Fields Medal-level proof, machine checked

February-April 2026

FEB 28 2026 · CLAUDE’S CYCLES

Claude Opus 4.6 found a construction for odd m.

Knuth wrote the proof narrative.

31

explorations to a working odd-m construction

March 2026

MAR 2026 · FRONTIERMATH OPEN PROBLEMS

A FrontierMath open problem was solved by GPT-5.4 Pro.

A Ramsey-style hypergraph problem, confirmed by its contributor.

Ramsey

final-answer benchmark, not a proof

April-May 2026

APR 13-MAY 1 2026 · ERDŐS #1196

GPT-5.4 Pro seeded a Markov-chain method.

Primitive sets, von Mangoldt weights.

#1196

AI seed → reusable, Lean-verified method

May 2026

MAY 8 2026 · ADDITIVE COMBINATORICS

ChatGPT 5.5 Pro improved a Nathanson-type bound.

Quadratic for h=2; polynomial in the generalized setting.

17m

time to first construction in Gowers’s report

May 2026

MAY 20 2026 · ERDŐS #90

An OpenAI model disproved Erdős’s near-linear unit-distance conjecture.

Human-verified; the full problem stays open.

1.014+

explicit exponent in Sawin’s follow-up

Unit Distance Readout

Erdős #90: why mathematicians treated it as important

9

The companion paper was signed by Alon, Bloom, Gowers, Litt, Sawin, Shankar, Tsimerman, Wang, and Wood.

1.014

Sawin made the lower-bound exponent explicit: more than n^{1.014} unit-distance pairs for infinitely many n.

Gowers

He said that if a human had written the paper and submitted it to the Annals, he would have had no hesitation in recommending acceptance — the open question is whether this is a real conceptual leap or a route humans underexplored.

Litt

He called it “the first example of a result produced autonomously by an AI that I find interesting in itself,” as opposed to a leading indicator.

May-June 2026

MAY 21 2026 · FORMAL PROOF SEARCH

AlphaProof Nexus resolved 9 of 353 formalized Erdős problems.

And proved 44 of 492 OEIS conjectures.

2.5%

solved of formalized Erdős problems

June 2026

JUN 2026 · FIRST PROOF SECOND BATCH

Ten unpublished research problems.

Expert-refereed, with all logs released.

Results 7 / 10 clean · 2 minor fixes · 1 wrong.

One frontier model Only GPT-5.5 Pro really competed.

Does not test Picking problems · building frameworks · judging significance.

Reading the Erdős Testbed

02

Part Two · Interpretation

Reading the
Evidence

How Tao’s tracker and the First Proof benchmarks classify what AI did — the failure modes, the costs, and what the counts do not mean.

Erdős Tracker

Tao GitHub wiki

AI contributions are now categorized, not just counted.

PRIMARY ROLE

Who did the math?

AI standalone; AI alongside literature; AI building on literature; AI collaborating with humans.

SECONDARY ROLE

What helped?

Literature search, formalization, rewriting, computation, and proof checking all count differently.

STATUS

What survived?

Full solution, candidate solution, partial result, incorrect proof, and argument-with-major-gaps are separated.

The page explicitly says it is not a benchmark.

Source: Tao GitHub wiki

Erdős Takeaways

What the Erdős page actually supports

1

AI is strong on stated, searchable, and sometimes overlooked problems — but “open in the database” ≠ “resisted decades of effort.”

2

Literature identification is valuable, but must not be reported as discovery.

3

Lean helps when the statement is formalized correctly; it does not decide mathematical intent.

4

The strongest cases combine AI generation with human digestion, context, and attribution.

First Proof 2

Second batch: careful interpretation

10

Problems came from working mathematicians’ unpublished research.

4

Systems were compared on proof production, not on choosing problems.

39

AI-generated solutions were evaluated by expert referees, not accepted automatically.

0

Claims here should not be read as autonomous mathematical taste or field-building.

Failure Modes

Across all 2025–2026 systems

Four failure modes recur — and none are fixed by raw scale.

NOVELTY

Is it actually new?

A correct proof may be an unattributed re-expression of prior work — “subconscious plagiarism.” Formal verification does not catch it.

SPECIFICATION GAMING

The sorry exploit

Agents push the hard step into a helper lemma closed by sorry, or cite hallucinated “known” lemmas. Only sorry-free, correctly-stated proofs block this.

MISFORMALIZATION

Right proof, wrong question

Lean verifies the statement it is given. #728, #125, and #741 needed human disambiguation before a proof meant anything.

The fourth is the expert-verification bottleneck: 137 flawed papers were audited to extract 13 correct solutions.

The Compute Bill

First Proof second batch · per-problem compute

The same problems, an order of magnitude apart in cost.

CHATGPT 5.5 PRO

$117

Lowest overhead, fastest (Bubeck–Sawhney direct prompt) — but leans on a human loop to filter rapid algebraic hallucinations.

UCLA MOONSHOT

$4,800

Roughly 40× more expensive: correct but verbose, with heavy state-tracking and token-window depletion.

PROOFCOUNCIL

6 / 10

Strongest accuracy: IMProofBench’s multi-agent harness — still built on GPT-5.5 Pro, the one frontier model in play.

Gowers warns this cost gap could widen the global research divide.

Not Yet Shown

What has not been demonstrated — even now

ask

Choosing what is worth proving: First Proof explicitly does not score problem selection.

build

Constructing new frameworks, definitions, or fields — Clausen–Scholze-style rebuilding remains the domain of human insight.

frontier

Reliably judging the solvable frontier; commenters report models still have a “fuzzy” sense of what is in reach.

value

Judging significance in a way the community accepts; autonomous solves rarely rose above student-exercise level.

Open Disputes

What the community has not settled

solved

What counts as “solved”: a new proof, a variant, an obscure rediscovery, or database lag are now tracked apart.

credit

Authorship and publishing: arXiv bars AI-written content; some urge separate, human-moderated repositories.

training

Doctoral pathways: “gentle” PhD problems now fall in under two hours, raising the entry bar for newcomers.

access

Compute disparity: premium pipelines cost thousands per query, risking a widening global gap.

Source: Gowers blog

Voices · Terence Tao

03

Part Three · A practitioner’s view

Terence
Tao

Fields medallist, maintainer of the Erdős tracker, and the field’s most active hands-on user of AI and proof assistants.

Tao · Trajectory

FROM CAUTIOUS USER TO EVANGELIST

Tao now calls 2025 “the year AI started being useful.”

Lean, AlphaEvolve, AlphaProof, autoformalization.

2025

the year AI “really started being useful” — Tao, Quanta

Tao · In His Own Words

“We have not had, in the past, assistants that are competent enough to understand complex instructions and work at massive scale — but are also unreliable, unreliable in subtle ways, whilst providing sufficiently good output.”

Terence Tao · Lex Fridman Podcast #472 · 2025

Tao · Strengths and Limits

Where Tao sees the real value — and the real bottleneck

tail

In the Erdős database, AI is good at systematically exploring the obscure long tail and finding “cheap wins” — but he warns this “says more about speed than difficulty.”

scale

The deeper opportunity is scale: tedious computations, many routine cases, proof-assistant translation, and population-study-style mathematics across thousands of problems at once.

gap

His “impedance mismatch”: generation is now orders of magnitude faster, while human verification and digestion are not — a move from proof scarcity to proof surplus.

limits

He rejects push-button autonomy for hard problems. AI is roughly a “junior co-author” for grunt work; humans still supply taste, inspired guesses, and sustained judgment.

Voices · Timothy Gowers

04

Part Four · A combinatorialist’s view

Timothy
Gowers

Fields medallist, additive combinatorialist, and one of the nine signatories who human-verified the unit-distance disproof.

Gowers · The Experiment

MAY 2026 · CHATGPT 5.5 PRO ON NATHANSON

A piece of PhD-level research in an hour or so, with no serious mathematical input from me.”

A quadratic bound in 17 minutes; then an original idea.

1 hr

from open problem to written-up result

Gowers · In His Own Words

“It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.”

W. T. Gowers · A recent experience with ChatGPT 5.5 Pro · 2026

Gowers · Caveats

Impressed, with sharp caveats about where this works

field

Combinatorics may be unusually favorable: many explicit, askable problems may have easy arguments humans simply have not tried.

rule

His rule of thumb: “If it is difficult for a human to distinguish a correct statement from a plausible-sounding incorrect one, an LLM will not be able to distinguish them either.”

record

Results should enter the record only if a human certifies them — or, better, only if a proof assistant has formalized them.

bar

The new mark of a contribution may become proving something that LLMs cannot — and those who have solved hard problems themselves will use AI best.

Where We Stand

05

Part Five · Synthesis

Where
We Stand

What the evidence supports today, what it does not, and the sources behind every claim in this briefing.

Current Frontier

WHAT THE EVIDENCE SUPPORTS

AI is now useful for stated mathematical tasks.

The strongest public cases are construction search, literature retrieval, Lean proof search, proof-detail completion, and human-verified solutions to explicitly posed problems.

Tao would add AI is a complement, interface, and scale tool; humans still supply taste, inspired guesses, and sustained judgment.

Gowers would add Some askable problems are now too easy for LLMs; correctness and mathematical value still need human or formal certification.

So the conclusion is partial Not “AI can do mathematics.” Rather: AI can produce publishable-looking or verified work in favorable, well-specified settings.

Sources · Results

Results, papers, and verification

Source: primary links listed on this slide.

Sources · Voices

Voices, interviews, and analysis

Source: interviews, blogs, and profiles listed on this slide.