Title

Research-level mathematics

June 2026

AI and Math
State of the Art

From Olympiad gold to Erdős problems, First Proof, Lean, and human verification.

Concise factual briefing

Summer 2025 → June 2026

Map

What changed in one year?

Contest reasoning crossed gold level

Erdős problems became the live testbed

AI disproved a famous Erdős conjecture

Formal proof search scaled in Lean

Failure modes and costs were documented

Tao and Gowers revised their priors

Source: Tao GitHub wiki, OpenAI unit-distance disproof, First Proof Second Batch

The Year in Events

Part One · Timeline

The Year
in Events

July 2025 → June 2026: from Olympiad gold to a disproved Erdős conjecture, told as a sequence of dated, sourced milestones.

July 2025

JUL 21 2025 · INTERNATIONAL MATHEMATICAL OLYMPIAD

Gemini Deep Think and OpenAI reached gold-medal level.

Natural-language proofs, five of six problems.

35 / 42

gold threshold performance

In July 2025 both Google DeepMind and OpenAI reported gold-medal-level performance at the International Mathematical Olympiad, producing natural-language proofs and solving five of the six problems (35/42, above the gold threshold).

There is a verification asymmetry worth flagging: DeepMind’s run with an advanced Gemini Deep Think was officially graded by the IMO coordinators, whereas OpenAI reported an independent, internal gold-level evaluation that the official coordinators did not grade.

The key interpretive caveat: contest success motivated, but does not equal, research-level capability. Olympiad problems are curated, bounded, and have known answer structures; research problems require interpretation, judgment of relevance, and relation to existing literature. Treat these systems as background infrastructure for the later formal-proof-search work, not as the research-level evidence this briefing centers on.

Source: Google DeepMind IMO 2025, Axios summary

October 2025

OCT 2025 · THE CAUTIONARY CASE

GPT-5 “solved ten open problems” — by finding ten papers.

Literature search, not new mathematics.

existing papers found — zero new proofs

The sequence: in October 2025 OpenAI’s Kevin Weil posted that GPT-5 had “found solutions to 10 (!) previously unsolved Erdős problems and made progress on 11 others,” and Sébastien Bubeck amplified similar claims.

Thomas Bloom, who maintains erdosproblems.com, called it “a dramatic misrepresentation.” The subtlety is what open means in his database: only that he personally had not seen a paper solving the problem — not that it had resisted the field for decades. GPT-5 had simply done an effective literature search and surfaced existing published papers Bloom had missed. Bubeck conceded that “only solutions in the literature were found,” Weil deleted his post, and Demis Hassabis called the episode “embarrassing.”

This is the cleanest cautionary tale in the whole subject: “AI found a solution” can quietly mean “AI found a paper.” It directly motivated the careful verification protocols — Lean formalization, human-verified companion papers — used in the genuine 2026 results that follow. Note the model here was plain GPT-5; the legitimate later Erdős solves used more advanced models (GPT-5.2 Pro, GPT-5.4 Pro).

Source: erdosproblems.com (Bloom’s database), Tao — AI contributions to Erdős problems

November 2025

NOV 3 2025 · ALPHAEVOLVE MATH PAPER

AlphaEvolve moved from coding agent to math explorer.

Construction search, not theorem proving.

problems across analysis, combinatorics, geometry, number theory

What AlphaEvolve actually is: a Gemini-powered evolutionary coding agent. It writes and iteratively mutates Python programs that search for mathematical constructions — explicit objects like point sets, sequences, packings, matrices — and scores each candidate with a cheap automated evaluator. The loop keeps the high-scoring programs and mutates them further. So the core activity is searching the space of constructions to optimize a numerical objective.

Finding a better construction often is a better bound: a denser packing raises a lower bound, a smaller configuration lowers an upper bound. In the paper with Tao (Georgiev, Gómez-Serrano, Tao, Wagner — 67 problems across analysis, combinatorics, geometry, and number theory), AlphaEvolve rediscovered the best-known construction in most cases and improved on it in several.

Crucial caveat: this is construction search, not theorem proving. It produces objects, not proofs. And because it optimizes against an automated evaluator, it is “extremely good at locating exploits” in a weak verifier — specification gaming. It only counts as mathematics when the objective is sound and the construction is then checked or proved by a human or proof system. The 4×4 matrix-multiplication headline (a 48-multiplication algorithm) should be stated carefully — it is a construction in a specific algebraic setting, not an unqualified first-ever improvement over Strassen.

And “coding agent → math explorer” is really about a new application: it is still fundamentally a coding agent; what is new is pointing it at mathematical construction search.

Source: Mathematical exploration and discovery at scale, Tao’s blog

December 2025

DEC 2025 · EQUATIONAL THEORIES PROJECT

Tao’s project resolved 22 million implications.

Between 4,694 magma equational laws, formalized in Lean.

4,694

magma laws, classical solvers not LLMs

Tao’s Equational Theories Project worked with 4,694 simple equational laws of magmas. A magma is just a set with one binary operation and no other axioms; a law is an identity such as commutativity (x\cdot y = y\cdot x) or idempotence (x\cdot x = x). For every ordered pair of laws it asked: does a magma satisfying law A automatically satisfy law B? That is about 4{,}694^2 \approx 22 million implication questions, each settled either by a proof or by exhibiting a counterexample magma — and the full set of answers is the “implication poset,” the partial order of which laws force which. The collaboration resolved the implications informally in roughly two months and fully formalized the result in Lean in about five.

The striking lesson is about tool choice. Generative LLMs played almost no role in the core mathematics; the heavy lifting fell to “good old-fashioned” automated theorem provers and symbolic solvers — Vampire, Prover9, Mace4 — which were cheaper, faster, and more reliable than LLMs for this systematic, large-scale task.

It is a useful counterweight to the assumption that “AI in math” always means large language models. For the right problem — exhaustive, well-specified, machine-checkable — classical symbolic methods still win, and a large human crowd-sourced collaboration plus Lean did the rest.

Source: Tao, The Equational Theories Project

January 2026

JAN 2026 · ERDŐS #728 AND #1026

#728 became the first Lean-verified autonomous AI solve.

GPT-5.2 Pro reasoned; Harmonic’s Aristotle formalized.

#728

informal proof → machine-checked Lean

Erdős #728 (a factorial-divisibility problem) is widely described as the first Lean-verified, more-or-less autonomous AI solve of an Erdős problem: GPT-5.2 Pro produced the reasoning and Harmonic’s Aristotle formalized it into machine-checked Lean (arXiv:2512.01827).

The important caveat is misformalization. The original problem wording was vague, and a revised statement was formalized only after forum discussion. “Lean proved it” and “the intended problem is settled” are not the same claim — Lean only certifies the formal statement it is handed, so pinning down the right statement is human work that matters as much as the proof itself. Tao also stressed that the win “says more about speed than difficulty.”

Erdős #1026 followed days later, but as a hybrid of literature search, online collaboration, and AI tools rather than a clean autonomous solve — a reminder that headline “AI solved it” entries often have mixed provenance.

Source: Resolution of Erdős #728 (Aristotle / Lean), Tao on #1026

January 2026

JAN 2026 · GEMINI ON THE ERDŐS DATABASE

Aletheia swept 700 “open” Erdős problems.

13 meaningful outcomes after human review.

4 + 9

novel-looking solutions plus literature finds

DeepMind’s Aletheia (built on Gemini Deep Think) was run over the ~700 conjectures marked open in Bloom’s Erdős database (arXiv:2601.22401). The workflow was hybrid: an AI natural-language verifier narrowed a large pool of candidate solutions, then human experts evaluated correctness and novelty. The revised paper reports 13 meaningful outcomes — four seemingly novel autonomous solutions plus nine literature identifications.

Several caveats matter. The novel-solution count fell from an initial nine to four after expert review; none of the four rose to the level of a research paper; and the paper itself flags the risk of “subconscious plagiarism” — AI reproducing existing work without knowing or citing it.

The team’s dominant reported failure mode was specification gaming: a large fraction of the 700 responses were wrong, and a notable subset were “mathematically empty,” reinterpreting hard questions into trivially answerable ones (e.g., confusing additive vs. Dirichlet convolution) because the system was not told Bloom’s definitional conventions. This is emphatically not a calibrated success rate over the Erdős database.

Source: Semi-Autonomous Mathematics Discovery with Gemini

January 2026

JAN 2026 · CÓRDOBA–CÓRDOBA–FONTELOS

A neural network found singularity candidates in fluid equations.

A physics-informed net, ~billion-fold accuracy gain.

10⁹×

accuracy gain, not a proof

A physics-informed neural network was used to find candidate singularity (blow-up) solutions in fluid equations — the Córdoba–Córdoba–Fontelos setting — achieving roughly a billion-fold improvement in numerical accuracy over previous approaches. The work was evaluated by Javier Gómez-Serrano and collaborators.

The crucial framing: none of these candidates are proven singularities. They are extremely precise numerical approximations that can guide and constrain a future rigorous proof, not substitutes for one. Diego Córdoba cautions that results obtained on 1D model equations may fail entirely when carried to the genuinely hard, boundary-free 3D Euler equations.

This is AI as a high-precision numerical-exploration instrument — sharpening conjectures and pointing to where a singularity might live — rather than AI proving a theorem. The mathematical work of turning a numerical candidate into a verified blow-up remains open and human.

Source: Quanta: Using AI, mathematicians find hidden glitches in fluid equations

February 2026

FEB 2026 · AXIOMPROVER

AxiomProver produced a Lean proof of Fel’s conjecture.

From a LaTeX statement to machine-checked code.

Lean

translation, not strategy

AxiomProver produced a Lean/Mathlib formal proof of Fel’s conjecture — a formula for normalized alternating syzygy power sums of numerical semigroups — automatically from a natural-language statement (arXiv:2602.03716, with Evan Chen as submitter and some twenty coauthors).

The interpretive dispute is about credit. The headline frames it as AI proving a conjecture, but critics counter that humans designed the core mathematical strategy — the generating-function approach and the connection to Ramanujan-type identities — and the AI’s contribution was largely the LaTeX-to-Lean translation and formalization.

So this is best read as automated formal proof generation/translation inside a proof-assistant environment, not autonomous selection of a research direction or independent mathematical insight. (One circulating report wrongly attached Dawei Chen to this paper; he is not among the authors.)

Source: Fel’s Conjecture on Syzygies of Numerical Semigroups

February 2026

FEB 2026 · TOWARDS AUTONOMOUS RESEARCH

Aletheia wrote a full paper with no human intervention.

Eigenweights in arithmetic geometry; graded correct.

Lvl 0–1

autonomy traded against significance

DeepMind reported that Aletheia generated a full research paper — on eigenweights in arithmetic geometry — with no human intervention (arXiv:2602.10177, “Towards Autonomous Mathematics Research”). Human experts judged the paper’s calculations correct.

But the headline of full autonomy comes with a sharp trade-off documented in independent reviews: an inverse correlation between autonomy and significance. Fully autonomous output was graded at Level 0–1 — trivial or negligible novelty — while reaching Level 2+ (genuinely interesting) work required continuous human prompting and direction.

In other words, the more the AI was left alone, the less the result mattered. This is the cleanest single illustration of the field’s current ceiling: autonomy and importance are, for now, in tension. “Wrote a paper unaided” and “wrote a paper worth reading” are different achievements.

Source: Towards Autonomous Mathematics Research

February 2026

FEB 24 2026 · FIRST PROOF FIRST BATCH

Aletheia solved 6 of 10 problems.

Majority expert assessment, not Lean certification.

6 / 10

informal proofs graded by experts

In the First Proof first batch, Aletheia (powered by Gemini 3 Deep Think) solved 6 of 10 problems — specifically problems 2, 5, 7, 8, 9, and 10 — under majority expert assessment and within the challenge timeframe (arXiv:2602.21201).

Two caveats. First, these were expert-assessed informal proofs, not Lean-certified ones, and the experts were not unanimous on Problem 8 — so it is a majority-assessment result, not a mechanically certified one. Second, on benchmark design: First Proof drew ten short questions from working mathematicians’ own research and released the solutions in encrypted form, decrypted only later, specifically to guard against training-data contamination. Raw prompts and outputs were made available for scrutiny.

This is the first of two First Proof batches in the deck; the June 2026 second batch (later slide) used unpublished research problems and multiple systems, and is the more demanding test.

Source: Aletheia tackles FirstProof autonomously

February-April 2026

FEB-APR 2026 · LEAN AND GAUSS

Viazovska’s 8-dimensional sphere packing proof reached formal verification.

AI helped fill Lean proof details.

Fields Medal-level proof, machine checked

The formalization of Maryna Viazovska’s celebrated proof that the E8 lattice gives the optimal sphere packing in dimension 8 — a Fields-Medal-level result — was brought to completion in Lean with AI assistance (arXiv:2604.23468). Gauss, Math Inc.’s autoformalization model, handled the labor-intensive work of filling in Lean proof details in the final verification stages; the human team included Viazovska herself along with Hariharan, Birkbeck, Lee, Ma, Mehta, and Poiroux.

The significance here is not AI discovering new mathematics but AI compressing the enormous, tedious effort of formal verification — historically a multi-year human undertaking for a proof of this depth. This is the “proof-detail completion” and autoformalization role: the human mathematics already existed; the AI made machine-checking it tractable.

It is a concrete instance of Tao’s theme that autoformalization “crossed a critical threshold,” turning verification tasks that took volunteers weeks into ones that take hours.

Source: A Milestone in Formalization: The Sphere Packing Problem in Dimension 8

February-April 2026

FEB 28 2026 · CLAUDE’S CYCLES

Claude Opus 4.6 found a construction for odd m.

Knuth wrote the proof narrative.

explorations to a working odd-m construction

Donald Knuth documented a hands-on collaboration with Claude (Opus 4.6) on a graph-theory construction problem: a directed graph on m^3 vertices with three outgoing arcs per vertex, where the target was to decompose the arcs into three directed Hamiltonian cycles. After thirty-one separate explorations, Claude found a working construction for odd m, and Knuth wrote up the proof narrative.

This is a characteristic “AI generation + human digestion” case: the model supplied an explicit construction through guided search, while a human mathematician supplied the framing, the write-up, and the verification.

Knuth’s careful documentation — including the many attempts before success, not just the final answer — is itself valuable: an honest, fine-grained record of what working with these models is actually like, rather than a polished success-only headline.

Source: Knuth, Claude’s Cycles

March 2026

MAR 2026 · FRONTIERMATH OPEN PROBLEMS

A FrontierMath open problem was solved by GPT-5.4 Pro.

A Ramsey-style hypergraph problem, confirmed by its contributor.

Ramsey

final-answer benchmark, not a proof

A FrontierMath open problem — a Ramsey-style hypergraph problem — was solved by GPT-5.4 Pro and confirmed by the problem’s own contributor (Epoch AI). This marked FrontierMath shifting from purely a benchmark into a venue hosting documented resolutions of open problems.

The central caveat is what FrontierMath actually tests: automatically verifiable final answers, not proofs. A correct final number is machine-checkable, but it does not certify a valid line of reasoning. In one telling case a “small Diophantine” problem was “solved” merely by direct substitution that the contributors themselves called “uninteresting” — the right answer for an unenlightening reason.

So a FrontierMath solve is strong evidence of answer-finding capability and weaker evidence of genuine proof or insight. Background worth knowing: FrontierMath was also at the center of an earlier contamination controversy over whether the promised airtight holdout verification of OpenAI’s o3 score was ever actually published.

Source: Epoch AI · FrontierMath open problems

April-May 2026

APR 13-MAY 1 2026 · ERDŐS #1196

GPT-5.4 Pro seeded a Markov-chain method.

Primitive sets, von Mangoldt weights.

#1196

AI seed → reusable, Lean-verified method

This is arguably the strongest case of AI output seeding a genuine, reusable human method. Liam Price — a 23-year-old amateur with no advanced training — fed Erdős #1196 (a 1966 Erdős–Sárközy–Szemerédi conjecture on primitive sets) to GPT-5.4 Pro, which after about 80 minutes produced a proof of the asymptotic bound. The key idea was a Markov chain with von Mangoldt weights, replacing the traditional Mertens chain.

Tao and collaborators (Alexeev, Barreto, Li, Lichtman, Price, Shah, Tang) then digested and generalized the AI output into a reusable method that resolves #1196 and #1217, gives a short new proof of the Erdős Primitive Set Conjecture (#164), and settles a revised Banks–Martin conjecture (arXiv:2605.00301). Math Inc.’s Gauss compiled it to a ~7,200-line Lean proof, later trimmed to ~4,000.

Hold two framings together. Jared Lichtman — the prior record-holder, seven years on the problem — reportedly called it the first AI result “at the level of Erdős’s Book.” Yet the method “seems to have been overlooked since Erdős’s 1935 paper” rather than being a deep conceptual leap, and the raw AI output needed substantial human reorganization to become clean theory. Best classified as AI-seeded method discovery plus human mathematical development.

Source: Primitive sets and von Mangoldt chains, Tao’s blog

May 2026

MAY 8 2026 · ADDITIVE COMBINATORICS

ChatGPT 5.5 Pro improved a Nathanson-type bound.

Quadratic for h=2; polynomial in the generalized setting.

17m

time to first construction in Gowers’s report

In a May 8, 2026 blog post, Timothy Gowers reported that ChatGPT 5.5 Pro improved a Nathanson-type additive-combinatorics bound. For the two-fold (h=2) sumset problem it produced, in about 17 minutes, a construction giving a quadratic upper bound — improving an exponential one — which Gowers judged best possible after checking. On the generalized h-fold setting it then introduced an idea Isaac Rajagopal assessed as original and “quite impressive.”

The caveats are as important as the result. The first improvement was a fairly routine modification of Rajagopal’s existing framework; only the later polynomial improvement involved a genuinely original idea. This was human-assessed informal mathematics, not Lean formalization. Practically, the model needed close human direction and consumed tokens at a rate Gowers called unsustainable — a point he ties to cost and the global research divide.

His broader thesis: combinatorics may be unusually favorable to current LLMs because it is full of explicitly stated problems whose arguments humans simply have not tried. The deck returns to Gowers in Part Four with his own words and full caveats.

Source: Gowers, A recent experience with ChatGPT 5.5 Pro

May 2026

MAY 20 2026 · ERDŐS #90

An OpenAI model disproved Erdős’s near-linear unit-distance conjecture.

Human-verified; the full problem stays open.

1.014+

explicit exponent in Sawin’s follow-up

On May 20, 2026, OpenAI announced that an internal general-purpose reasoning model produced a counterexample to Erdős’s 1946 planar unit-distance conjecture — the expectation that the maximum number of unit distances among n points is n^{1+o(1)} (near-linear). The construction draws on algebraic number theory: high-degree number fields with small discriminant.

Verification was human, not (initially) formal. A “short, digested, human-verified version” appeared as a companion paper signed by nine mathematicians (arXiv:2605.20695), and Will Sawin’s follow-up (arXiv:2605.20579) made the lower-bound exponent explicit — more than n^{1.014} unit-distance pairs for arbitrarily large n.

Keep the scope precise: this disproves the near-linear conjecture but does not solve the full unit-distance problem; a wide gap to the O(n^{4/3}) Spencer–Szemerédi–Trotter upper bound remains. No external reviewer saw the model’s raw output — verification rests on a digested argument and an edited chain-of-thought. The next slide covers why mathematicians treated it as a landmark. (Anthropic separately reported a system independently reproducing a disproof days later.)

Source: OpenAI announcement, companion paper, Sawin follow-up

Unit Distance Readout

Erdős #90: why mathematicians treated it as important

The companion paper was signed by Alon, Bloom, Gowers, Litt, Sawin, Shankar, Tsimerman, Wang, and Wood.

1.014

Sawin made the lower-bound exponent explicit: more than n^{1.014} unit-distance pairs for infinitely many n.

Gowers

He said that if a human had written the paper and submitted it to the Annals, he would have had no hesitation in recommending acceptance — the open question is whether this is a real conceptual leap or a route humans underexplored.

Litt

He called it “the first example of a result produced autonomously by an AI that I find interesting in itself,” as opposed to a leading indicator.

Source: Remarks on the disproof of the unit distance conjecture, Sawin follow-up

May-June 2026

MAY 21 2026 · FORMAL PROOF SEARCH

AlphaProof Nexus resolved 9 of 353 formalized Erdős problems.

And proved 44 of 492 OEIS conjectures.

2.5%

solved of formalized Erdős problems

DeepMind’s AlphaProof Nexus (Gemini-based, with Lean and AlphaProof) is a Lean-based formal proof-search system. It autonomously resolved 9 of 353 formalized open Erdős problems (the 2.5% figure) and proved 44 of 492 OEIS conjectures, at roughly a few hundred dollars per problem; two solved problems had been open 56 years (arXiv:2605.22763). All Lean proofs and selected natural-language versions were released.

The strongest evidence here is end-to-end Lean checking: when a proof must be sorry-free and the formal statement is correct, it blocks specification gaming. The caveats: the 353 problems are the already-formalized subset, not the whole Erdős database or a random sample, so there is real selection bias; the failure analysis shows agents often offloading the hard step into a sorry helper lemma or citing hallucinated “known” lemmas; and the authors themselves warn that most Erdős problems remain out of reach.

Telling detail: a simpler LLM-plus-Lean baseline replicated the nine Erdős solves at higher cost — suggesting the elaborate scaffolding buys efficiency, not raw capability.

Source: Advancing Mathematics Research with AI-Driven Formal Proof Search

June 2026

JUN 2026 · FIRST PROOF SECOND BATCH

Ten unpublished research problems.

Expert-refereed, with all logs released.

Results 7 / 10 clean · 2 minor fixes · 1 wrong.

One frontier model Only GPT-5.5 Pro really competed.

Does not test Picking problems · building frameworks · judging significance.

The First Proof second batch tested public / API-accessible systems on ten unpublished research-level problems drawn from working mathematicians’ own research, with expert referees and all logs released (arXiv:2606.18119; four editors — Abouzaid at Stanford, Srivastava at UC Berkeley, Ward at UT Austin, Williams at Harvard).

Results: of the ten problems, 7 were solved flawlessly, 2 needed only minor corrections, and 1 came back wrong. One stochastic-PDE solution used a novel approach the referees found genuinely impressive. Across the benchmark, 39 AI solutions were generated, each graded by at least two expert referees.

On “systems”: although four entries were compared (IMProofBench ProofCouncil, UCLA Moonshot Harness, OpenAI ChatGPT 5.5 Pro, Princeton Momus), in practice the only frontier model in play was OpenAI’s GPT-5.5 Pro — Google DeepMind and Anthropic did not participate, and the other entries are academic / third-party harnesses built around that same model.

What it deliberately does NOT test: choosing which problems are worth proving, developing new frameworks or definitions, or judging mathematical significance. The editors explicitly separate proving a given statement from the broader research cycle of asking questions and building theory.

Source: First Proof Second Batch, Harvard FAS

Reading the Erdős Testbed

Part Two · Interpretation

Reading the
Evidence

How Tao’s tracker and the First Proof benchmarks classify what AI did — the failure modes, the costs, and what the counts do not mean.

Erdős Tracker

Tao GitHub wiki

AI contributions are now categorized, not just counted.

PRIMARY ROLE

Who did the math?

AI standalone; AI alongside literature; AI building on literature; AI collaborating with humans.

SECONDARY ROLE

What helped?

Literature search, formalization, rewriting, computation, and proof checking all count differently.

STATUS

What survived?

Full solution, candidate solution, partial result, incorrect proof, and argument-with-major-gaps are separated.

The page explicitly says it is not a benchmark.

Source: Tao GitHub wiki

Erdős Takeaways

What the Erdős page actually supports

AI is strong on stated, searchable, and sometimes overlooked problems — but “open in the database” ≠ “resisted decades of effort.”

Literature identification is valuable, but must not be reported as discovery.

Lean helps when the statement is formalized correctly; it does not decide mathematical intent.

The strongest cases combine AI generation with human digestion, context, and attribution.

Source: Tao GitHub wiki disclaimers

First Proof 2

Second batch: careful interpretation

Problems came from working mathematicians’ unpublished research.

Systems were compared on proof production, not on choosing problems.

AI-generated solutions were evaluated by expert referees, not accepted automatically.

Claims here should not be read as autonomous mathematical taste or field-building.

Source: First Proof Second Batch

Failure Modes

Across all 2025–2026 systems

Four failure modes recur — and none are fixed by raw scale.

NOVELTY

Is it actually new?

A correct proof may be an unattributed re-expression of prior work — “subconscious plagiarism.” Formal verification does not catch it.

SPECIFICATION GAMING

The sorry exploit

Agents push the hard step into a helper lemma closed by sorry, or cite hallucinated “known” lemmas. Only sorry-free, correctly-stated proofs block this.

MISFORMALIZATION

Right proof, wrong question

Lean verifies the statement it is given. #728, #125, and #741 needed human disambiguation before a proof meant anything.

The fourth is the expert-verification bottleneck: 137 flawed papers were audited to extract 13 correct solutions.

Source: Aletheia / Gemini Erdős case study, AlphaProof Nexus

The Compute Bill

First Proof second batch · per-problem compute

The same problems, an order of magnitude apart in cost.

CHATGPT 5.5 PRO

$117

Lowest overhead, fastest (Bubeck–Sawhney direct prompt) — but leans on a human loop to filter rapid algebraic hallucinations.

UCLA MOONSHOT

$4,800

Roughly 40× more expensive: correct but verbose, with heavy state-tracking and token-window depletion.

PROOFCOUNCIL

6 / 10

Strongest accuracy: IMProofBench’s multi-agent harness — still built on GPT-5.5 Pro, the one frontier model in play.

Gowers warns this cost gap could widen the global research divide.

Source: Harvard FAS on First Proof second batch

Not Yet Shown

What has not been demonstrated — even now

ask

Choosing what is worth proving: First Proof explicitly does not score problem selection.

build

Constructing new frameworks, definitions, or fields — Clausen–Scholze-style rebuilding remains the domain of human insight.

frontier

Reliably judging the solvable frontier; commenters report models still have a “fuzzy” sense of what is in reach.

value

Judging significance in a way the community accepts; autonomous solves rarely rose above student-exercise level.

Source: First Proof Second Batch, Towards Autonomous Mathematics Research

Open Disputes

What the community has not settled

solved

What counts as “solved”: a new proof, a variant, an obscure rediscovery, or database lag are now tracked apart.

credit

Authorship and publishing: arXiv bars AI-written content; some urge separate, human-moderated repositories.

training

Doctoral pathways: “gentle” PhD problems now fall in under two hours, raising the entry bar for newcomers.

access

Compute disparity: premium pipelines cost thousands per query, risking a widening global gap.

Source: Gowers blog

Voices · Terence Tao

Part Three · A practitioner’s view

Terence
Tao

Fields medallist, maintainer of the Erdős tracker, and the field’s most active hands-on user of AI and proof assistants.

Tao · Trajectory

FROM CAUTIOUS USER TO EVANGELIST

Tao now calls 2025 “the year AI started being useful.”

Lean, AlphaEvolve, AlphaProof, autoformalization.

2025

the year AI “really started being useful” — Tao, Quanta

Tao’s own trajectory is part of the story: from cautious user to active evangelist. He recalls early models behaving “like overconfident undergraduates” — fluent and strong where there was abundant training data, but faltering at the genuine research frontier and confidently wrong in subtle ways.

By 2026 he reports a shift: autoformalization “crossed a critical threshold,” with tasks that once took volunteers weeks now taking hours. He calls 2025 “the year AI really started being useful” (Quanta).

His characteristic framing is about scale rather than individual brilliance: “with these tools you can solve thousands of problems at once and start doing statistical studies” — population-style mathematics across whole families of problems. The following slides give his exact words and his strengths/limits breakdown; the throughline is that his enthusiasm is specifically about AI as a scale-and-interface tool, not push-button autonomy.

Source: Quanta: How Terry Tao became an evangelist for AI in math, Quanta: The AI revolution in math has arrived

Tao · In His Own Words

“

“We have not had, in the past, assistants that are competent enough to understand complex instructions and work at massive scale — but are also unreliable, unreliable in subtle ways, whilst providing sufficiently good output.”

Terence Tao · Lex Fridman Podcast #472 · 2025

Source: Lex Fridman Podcast #472 (transcript), video

Tao · Strengths and Limits

Where Tao sees the real value — and the real bottleneck

tail

In the Erdős database, AI is good at systematically exploring the obscure long tail and finding “cheap wins” — but he warns this “says more about speed than difficulty.”

scale

The deeper opportunity is scale: tedious computations, many routine cases, proof-assistant translation, and population-study-style mathematics across thousands of problems at once.

gap

His “impedance mismatch”: generation is now orders of magnitude faster, while human verification and digestion are not — a move from proof scarcity to proof surplus.

limits

He rejects push-button autonomy for hard problems. AI is roughly a “junior co-author” for grunt work; humans still supply taste, inspired guesses, and sustained judgment.

Source: Nature: ‘The job description is changing’, The Atlantic, The Edge of Mathematics, Tao and Klowden, 2026

Voices · Timothy Gowers

Part Four · A combinatorialist’s view

Timothy
Gowers

Fields medallist, additive combinatorialist, and one of the nine signatories who human-verified the unit-distance disproof.

Gowers · The Experiment

MAY 2026 · CHATGPT 5.5 PRO ON NATHANSON

“A piece of PhD-level research in an hour or so, with no serious mathematical input from me.”

A quadratic bound in 17 minutes; then an original idea.

1 hr

from open problem to written-up result

This slide reframes the same May 2026 Nathanson episode from Gowers’s experiential angle — what it felt like to watch. For the h=2 case the model produced a best-possible quadratic construction with no clever prompting from him. On the generalized h-fold problem it introduced an h^2-dissociated-set idea that Isaac Rajagopal judged original and “quite impressive,” and then drafted a LaTeX write-up of the result in 2 minutes 23 seconds.

Gowers’s striking summary was that this amounted to “a piece of PhD-level research in an hour or so, with no serious mathematical input from me.”

Hold it alongside his caveats (next slide): he stresses this may reflect combinatorics being unusually favorable, that correctness still needs human or formal certification, and that the speed of write-up raises real questions about how the mathematical record and doctoral training should adapt.

Source: Gowers, A recent experience with ChatGPT 5.5 Pro

Gowers · In His Own Words

“

“It is no longer enough that somebody asks a problem: it needs to be hard enough for an LLM not to be able to solve it.”

W. T. Gowers · A recent experience with ChatGPT 5.5 Pro · 2026

Source: Gowers, A recent experience with ChatGPT 5.5 Pro

Gowers · Caveats

Impressed, with sharp caveats about where this works

field

Combinatorics may be unusually favorable: many explicit, askable problems may have easy arguments humans simply have not tried.

rule

His rule of thumb: “If it is difficult for a human to distinguish a correct statement from a plausible-sounding incorrect one, an LLM will not be able to distinguish them either.”

record

Results should enter the record only if a human certifies them — or, better, only if a proof assistant has formalized them.

bar

The new mark of a contribution may become proving something that LLMs cannot — and those who have solved hard problems themselves will use AI best.

Source: Gowers, A recent experience with ChatGPT 5.5 Pro

Where We Stand

Part Five · Synthesis

Where
We Stand

What the evidence supports today, what it does not, and the sources behind every claim in this briefing.

Current Frontier

WHAT THE EVIDENCE SUPPORTS

AI is now useful for stated mathematical tasks.

The strongest public cases are construction search, literature retrieval, Lean proof search, proof-detail completion, and human-verified solutions to explicitly posed problems.

Tao would add AI is a complement, interface, and scale tool; humans still supply taste, inspired guesses, and sustained judgment.

Gowers would add Some askable problems are now too easy for LLMs; correctness and mathematical value still need human or formal certification.

So the conclusion is partial Not “AI can do mathematics.” Rather: AI can produce publishable-looking or verified work in favorable, well-specified settings.

Source: Tao GitHub wiki, Gowers, Tao and Klowden

Sources · Results

Results, papers, and verification

Source: primary links listed on this slide.

Sources · Voices

Voices, interviews, and analysis

Terence Tao

Timothy Gowers

Press & analysis

Critique & community

Source: interviews, blogs, and profiles listed on this slide.

Title

AI and MathState of the Art

Map

What changed in one year?

The Year in Events

July 2025

October 2025

November 2025

December 2025

January 2026

January 2026

January 2026

February 2026

February 2026

February 2026

February-April 2026

February-April 2026

March 2026

April-May 2026

May 2026

May 2026

Unit Distance Readout

Erdős #90: why mathematicians treated it as important

May-June 2026

June 2026

Reading the Erdős Testbed

Erdős Tracker

AI contributions are now categorized, not just counted.

Erdős Takeaways

What the Erdős page actually supports

First Proof 2

Second batch: careful interpretation

Failure Modes

Four failure modes recur — and none are fixed by raw scale.

The Compute Bill

The same problems, an order of magnitude apart in cost.

Not Yet Shown

What has not been demonstrated — even now

Open Disputes

What the community has not settled

Voices · Terence Tao

Tao · Trajectory

Tao · In His Own Words

Tao · Strengths and Limits

Where Tao sees the real value — and the real bottleneck

Voices · Timothy Gowers

Gowers · The Experiment

Gowers · In His Own Words

Gowers · Caveats

Impressed, with sharp caveats about where this works

Where We Stand

Current Frontier

Sources · Results

Results, papers, and verification

Sources · Voices

Voices, interviews, and analysis

AI and Math
State of the Art