Research-level mathematics
June 2026
AI and Math
State of the Art
From Olympiad gold to Erdős problems, First Proof, Lean, and human verification.
Concise factual briefing
Summer 2025 → June 2026
AI
Research-level mathematics
June 2026
From Olympiad gold to Erdős problems, First Proof, Lean, and human verification.
Concise factual briefing
Summer 2025 → June 2026
01
Contest reasoning crossed gold level
02
Erdős problems became the live testbed
03
AI disproved a famous Erdős conjecture
04
Formal proof search scaled in Lean
05
Failure modes and costs were documented
06
Tao and Gowers revised their priors
01
Part One · Timeline
The Year
in Events
July 2025 → June 2026: from Olympiad gold to a disproved Erdős conjecture, told as a sequence of dated, sourced milestones.
JUL 21 2025 · INTERNATIONAL MATHEMATICAL OLYMPIAD
Gemini Deep Think and OpenAI reached gold-medal level.
Natural-language proofs, five of six problems.
35 / 42
gold threshold performance
Source: Google DeepMind IMO 2025, Axios summary
OCT 2025 · THE CAUTIONARY CASE
GPT-5 “solved ten open problems” — by finding ten papers.
Literature search, not new mathematics.
10
existing papers found — zero new proofs
Source: erdosproblems.com (Bloom’s database), Tao — AI contributions to Erdős problems
NOV 3 2025 · ALPHAEVOLVE MATH PAPER
AlphaEvolve moved from coding agent to math explorer.
Construction search, not theorem proving.
67
problems across analysis, combinatorics, geometry, number theory
DEC 2025 · EQUATIONAL THEORIES PROJECT
Tao’s project resolved 22 million implications.
Between 4,694 magma equational laws, formalized in Lean.
4,694
magma laws, classical solvers not LLMs
JAN 2026 · ERDŐS #728 AND #1026
#728 became the first Lean-verified autonomous AI solve.
GPT-5.2 Pro reasoned; Harmonic’s Aristotle formalized.
#728
informal proof → machine-checked Lean
JAN 2026 · GEMINI ON THE ERDŐS DATABASE
Aletheia swept 700 “open” Erdős problems.
13 meaningful outcomes after human review.
4 + 9
novel-looking solutions plus literature finds
JAN 2026 · CÓRDOBA–CÓRDOBA–FONTELOS
A neural network found singularity candidates in fluid equations.
A physics-informed net, ~billion-fold accuracy gain.
10⁹×
accuracy gain, not a proof
FEB 2026 · AXIOMPROVER
AxiomProver produced a Lean proof of Fel’s conjecture.
From a LaTeX statement to machine-checked code.
Lean
translation, not strategy
FEB 2026 · TOWARDS AUTONOMOUS RESEARCH
Aletheia wrote a full paper with no human intervention.
Eigenweights in arithmetic geometry; graded correct.
Lvl 0–1
autonomy traded against significance
FEB 24 2026 · FIRST PROOF FIRST BATCH
Aletheia solved 6 of 10 problems.
Majority expert assessment, not Lean certification.
6 / 10
informal proofs graded by experts
FEB-APR 2026 · LEAN AND GAUSS
Viazovska’s 8-dimensional sphere packing proof reached formal verification.
AI helped fill Lean proof details.
E8
Fields Medal-level proof, machine checked
FEB 28 2026 · CLAUDE’S CYCLES
Claude Opus 4.6 found a construction for odd m.
Knuth wrote the proof narrative.
31
explorations to a working odd-m construction
Source: Knuth, Claude’s Cycles
MAR 2026 · FRONTIERMATH OPEN PROBLEMS
A FrontierMath open problem was solved by GPT-5.4 Pro.
A Ramsey-style hypergraph problem, confirmed by its contributor.
Ramsey
final-answer benchmark, not a proof
APR 13-MAY 1 2026 · ERDŐS #1196
GPT-5.4 Pro seeded a Markov-chain method.
Primitive sets, von Mangoldt weights.
#1196
AI seed → reusable, Lean-verified method
MAY 8 2026 · ADDITIVE COMBINATORICS
ChatGPT 5.5 Pro improved a Nathanson-type bound.
Quadratic for h=2; polynomial in the generalized setting.
17m
time to first construction in Gowers’s report
MAY 20 2026 · ERDŐS #90
An OpenAI model disproved Erdős’s near-linear unit-distance conjecture.
Human-verified; the full problem stays open.
1.014+
explicit exponent in Sawin’s follow-up
9
The companion paper was signed by Alon, Bloom, Gowers, Litt, Sawin, Shankar, Tsimerman, Wang, and Wood.
1.014
Sawin made the lower-bound exponent explicit: more than n^{1.014} unit-distance pairs for infinitely many n.
Gowers
He said that if a human had written the paper and submitted it to the Annals, he would have had no hesitation in recommending acceptance — the open question is whether this is a real conceptual leap or a route humans underexplored.
Litt
He called it “the first example of a result produced autonomously by an AI that I find interesting in itself,” as opposed to a leading indicator.
MAY 21 2026 · FORMAL PROOF SEARCH
AlphaProof Nexus resolved 9 of 353 formalized Erdős problems.
And proved 44 of 492 OEIS conjectures.
2.5%
solved of formalized Erdős problems
JUN 2026 · FIRST PROOF SECOND BATCH
Ten unpublished research problems.
Expert-refereed, with all logs released.
Results 7 / 10 clean · 2 minor fixes · 1 wrong.
One frontier model Only GPT-5.5 Pro really competed.
Does not test Picking problems · building frameworks · judging significance.
Source: First Proof Second Batch, Harvard FAS
02
Part Two · Interpretation
Reading the
Evidence
How Tao’s tracker and the First Proof benchmarks classify what AI did — the failure modes, the costs, and what the counts do not mean.
Tao GitHub wiki
PRIMARY ROLE
Who did the math?
AI standalone; AI alongside literature; AI building on literature; AI collaborating with humans.
SECONDARY ROLE
What helped?
Literature search, formalization, rewriting, computation, and proof checking all count differently.
STATUS
What survived?
Full solution, candidate solution, partial result, incorrect proof, and argument-with-major-gaps are separated.
The page explicitly says it is not a benchmark.
Source: Tao GitHub wiki
1
AI is strong on stated, searchable, and sometimes overlooked problems — but “open in the database” ≠ “resisted decades of effort.”
2
Literature identification is valuable, but must not be reported as discovery.
3
Lean helps when the statement is formalized correctly; it does not decide mathematical intent.
4
The strongest cases combine AI generation with human digestion, context, and attribution.
Source: Tao GitHub wiki disclaimers
10
Problems came from working mathematicians’ unpublished research.
4
Systems were compared on proof production, not on choosing problems.
39
AI-generated solutions were evaluated by expert referees, not accepted automatically.
0
Claims here should not be read as autonomous mathematical taste or field-building.
Source: First Proof Second Batch
Across all 2025–2026 systems
NOVELTY
Is it actually new?
A correct proof may be an unattributed re-expression of prior work — “subconscious plagiarism.” Formal verification does not catch it.
SPECIFICATION GAMING
The sorry exploit
Agents push the hard step into a helper lemma closed by sorry, or cite hallucinated “known” lemmas. Only sorry-free, correctly-stated proofs block this.
MISFORMALIZATION
Right proof, wrong question
Lean verifies the statement it is given. #728, #125, and #741 needed human disambiguation before a proof meant anything.
The fourth is the expert-verification bottleneck: 137 flawed papers were audited to extract 13 correct solutions.
First Proof second batch · per-problem compute
CHATGPT 5.5 PRO
$117
Lowest overhead, fastest (Bubeck–Sawhney direct prompt) — but leans on a human loop to filter rapid algebraic hallucinations.
UCLA MOONSHOT
$4,800
Roughly 40× more expensive: correct but verbose, with heavy state-tracking and token-window depletion.
PROOFCOUNCIL
6 / 10
Strongest accuracy: IMProofBench’s multi-agent harness — still built on GPT-5.5 Pro, the one frontier model in play.
Gowers warns this cost gap could widen the global research divide.
ask
Choosing what is worth proving: First Proof explicitly does not score problem selection.
build
Constructing new frameworks, definitions, or fields — Clausen–Scholze-style rebuilding remains the domain of human insight.
frontier
Reliably judging the solvable frontier; commenters report models still have a “fuzzy” sense of what is in reach.
value
Judging significance in a way the community accepts; autonomous solves rarely rose above student-exercise level.
solved
What counts as “solved”: a new proof, a variant, an obscure rediscovery, or database lag are now tracked apart.
credit
Authorship and publishing: arXiv bars AI-written content; some urge separate, human-moderated repositories.
training
Doctoral pathways: “gentle” PhD problems now fall in under two hours, raising the entry bar for newcomers.
access
Compute disparity: premium pipelines cost thousands per query, risking a widening global gap.
Source: Gowers blog
03
Part Three · A practitioner’s view
Terence
Tao
Fields medallist, maintainer of the Erdős tracker, and the field’s most active hands-on user of AI and proof assistants.
FROM CAUTIOUS USER TO EVANGELIST
Tao now calls 2025 “the year AI started being useful.”
Lean, AlphaEvolve, AlphaProof, autoformalization.
2025
the year AI “really started being useful” — Tao, Quanta
“
“We have not had, in the past, assistants that are competent enough to understand complex instructions and work at massive scale — but are also unreliable, unreliable in subtle ways, whilst providing sufficiently good output.”
Terence Tao · Lex Fridman Podcast #472 · 2025
tail
In the Erdős database, AI is good at systematically exploring the obscure long tail and finding “cheap wins” — but he warns this “says more about speed than difficulty.”
scale
The deeper opportunity is scale: tedious computations, many routine cases, proof-assistant translation, and population-study-style mathematics across thousands of problems at once.
gap
His “impedance mismatch”: generation is now orders of magnitude faster, while human verification and digestion are not — a move from proof scarcity to proof surplus.
limits
He rejects push-button autonomy for hard problems. AI is roughly a “junior co-author” for grunt work; humans still supply taste, inspired guesses, and sustained judgment.
04
Part Four · A combinatorialist’s view
Timothy
Gowers
Fields medallist, additive combinatorialist, and one of the nine signatories who human-verified the unit-distance disproof.
MAY 2026 · CHATGPT 5.5 PRO ON NATHANSON
“A piece of PhD-level research in an hour or so, with no serious mathematical input from me.”
A quadratic bound in 17 minutes; then an original idea.
1 hr
from open problem to written-up result
“
“It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.”
W. T. Gowers · A recent experience with ChatGPT 5.5 Pro · 2026
field
Combinatorics may be unusually favorable: many explicit, askable problems may have easy arguments humans simply have not tried.
rule
His rule of thumb: “If it is difficult for a human to distinguish a correct statement from a plausible-sounding incorrect one, an LLM will not be able to distinguish them either.”
record
Results should enter the record only if a human certifies them — or, better, only if a proof assistant has formalized them.
bar
The new mark of a contribution may become proving something that LLMs cannot — and those who have solved hard problems themselves will use AI best.
05
Part Five · Synthesis
Where
We Stand
What the evidence supports today, what it does not, and the sources behind every claim in this briefing.
WHAT THE EVIDENCE SUPPORTS
AI is now useful for stated mathematical tasks.
The strongest public cases are construction search, literature retrieval, Lean proof search, proof-detail completion, and human-verified solutions to explicitly posed problems.
Tao would add AI is a complement, interface, and scale tool; humans still supply taste, inspired guesses, and sustained judgment.
Gowers would add Some askable problems are now too easy for LLMs; correctness and mathematical value still need human or formal certification.
So the conclusion is partial Not “AI can do mathematics.” Rather: AI can produce publishable-looking or verified work in favorable, well-specified settings.
Source: Tao GitHub wiki, Gowers, Tao and Klowden
Events & milestones
Papers & verification
Source: primary links listed on this slide.
Terence Tao
Timothy Gowers
Press & analysis
Critique & community
Source: interviews, blogs, and profiles listed on this slide.