# [AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance problem for under $1000

> Source: <https://www.latent.space/p/ainews-openai-gpt-next-disproves>
> Published: 2026-05-21 07:28:36+00:00

# [AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance problem for under $1000

### a quiet day but a nice result in AI x mathematics

We will leave coverage of the [SpaceXAI IPO filing](https://x.com/eliebakouch/status/2057222864332320999?s=12) for the actual day of IPO. Today we celebrate OpenAI’s result, speculated to be [GPT 5.6 running for <32 hours or <$1000](https://x.com/willdepue/status/2057213893857165701), on [the planar unit distance problem](https://openai.com/index/model-disproves-discrete-geometry-conjecture/). Similar to the 2025 [IMO Gold](https://news.smol.ai/issues/25-08-11-ioi-gold) result, this is a general purpose LLM, [not an AlphaProof/Lean style dedicated model](https://x.com/polynoamial/status/2057179104315670826), which lends hope that this extended reasoning will generalize beyond math:

Among the 125 pages of output, there exists a “[page 39 moment](https://x.com/voooooogel/status/2057198687307362642)” that is getting some attention:

As the authors of [the opinion letter](https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-remarks.pdf) note, this is a disproof, not a proof, which would have been more impressive, but nevertheless points towards the way of things to come:

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits,

[544 Twitters]and no further Discords.[AINews’ website]lets you search all past issues. As a reminder,[AINews is now a section of Latent Space]. You can[opt in/out]of email frequencies!

**AI Twitter Recap**

**OpenAI’s Math Breakthrough on the Erdős Unit Distance Problem**

**A general-purpose reasoning model produced a new research result in discrete geometry**: OpenAI announced that an internal model disproved a long-standing belief around the planar** unit distance problem**, a famous Erdős problem from 1946, discovering a new family of constructions that improves on square-grid-style solutions[@OpenAI](https://x.com/OpenAI/status/2057176201782075690). OpenAI emphasized this was a**general-purpose model**, not a domain-specific math system or scaffolded solver[@OpenAI](https://x.com/OpenAI/status/2057176203166171317), and said the result points to stronger long-horizon reasoning for science broadly[@OpenAI](https://x.com/OpenAI/status/2057176204541866087).The result drew unusually strong validation from mathematicians and adjacent researchers.

**Timothy Gowers** called it the first really clear example of AI solving a**well-known** open math problem[@wtgowers](https://x.com/wtgowers/status/2057175729008153069), while OpenAI researcher**Hongxun Wu** described it as an internal reasoning-LLM milestone on “the hardest problems”[@HongxunWu](https://x.com/HongxunWu/status/2057176383106027567). Additional reactions from[@thomasfbloom](https://x.com/thomasfbloom/status/2057177152894771631),[@gdb](https://x.com/gdb/status/2057182650784452925),[@alexwei_](https://x.com/alexwei_/status/2057182873208369485), and[@polynoamial](https://x.com/polynoamial/status/2057178198228586824)converged on the same point: this appears qualitatively beyond prior “AI does olympiad math” milestones.**Notable technical subtext**: OpenAI says the model was not pushed to the limit and is intended for eventual public use[@polynoamial](https://x.com/polynoamial/status/2057179104315670826). The published reasoning summary itself is reportedly massive—around**125 pages** per[@voooooogel](https://x.com/voooooogel/status/2057198687307362642)—which helped fuel discussion about the practical role of**test-time compute** in frontier reasoning. Some observers explicitly framed this as further evidence that inference-time scaling is the paradigm carrying current progress[@](https://x.com/_arohan_/status/2057188616099725525), with others extrapolating to faster future gains in formal science and mathematics[arohan](https://x.com/_arohan_/status/2057188616099725525)[@scaling01](https://x.com/scaling01/status/2057246143881609510),[@sama](https://x.com/sama/status/2057203171198636251).

**Cohere Command A+ Open Release and Architecture Discussion**

**Cohere released Command A+ as Apache 2.0 open weights**, positioning it as its most powerful model yet and explicitly optimized for low hardware requirements[@cohere](https://x.com/cohere/status/2057120818551734589), with the licensing clarified in a follow-up[@cohere](https://x.com/cohere/status/2057122131410813016). The release is significant partly because it is Cohere’s**first fully open Apache 2 model** per[@aidangomez](https://x.com/aidangomez/status/2057142232860258527). Community reaction focused on this as a meaningful shift toward more permissive, deployable enterprise-grade open models[@nickfrosst](https://x.com/nickfrosst/status/2057132425310851104),[@ClementDelangue](https://x.com/ClementDelangue/status/2057180057756467671).The model details repeated across multiple posts: roughly

**218B MoE / 25B active**,** multimodal**,** 48 languages**, and runnable on relatively modest setups[@JayAlammar](https://x.com/JayAlammar/status/2057145838011564126),[@mervenoyann](https://x.com/mervenoyann/status/2057128432190787643).**vLLM day-0 support** landed quickly, including a note that it can run on as little as**2× H100s at W4A4**[@vllm_project](https://x.com/vllm_project/status/2057206049665622070).** Benchmarks painted a mixed but credible picture**: Artificial Analysis placed Command A+ at** 37 on its Intelligence Index**, around Claude 4.5 Haiku territory, with especially strong** non-hallucination**behavior and decent speed, but weaker scientific reasoning and coding than top peer models[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2057123594162077837). The community also dug into the architecture: unusual choices called out include a**parallel transformer block**, large** shared expert**usage,** LayerNorm over RMSNorm**, relatively low** 32-layer**depth, and atypical head/expert configurations[@eliebakouch](https://x.com/eliebakouch/status/2057198733759008989),[@rasbt](https://x.com/rasbt/status/2057241574161932339),[@stochasticchasm](https://x.com/stochasticchasm/status/2057150551696261607). This made the release notable not just as a model drop but as an architectural data point.

**Benchmarks for Agents, Memory, and Scientific Workflows**

**InferenceBench** is one of the day’s most technically substantive releases. It targets**AI R&D automation** through open-ended inference optimization tasks, and the headline is negative for current frontier agents: they struggle with**system-level engineering**, dependency management, and broad exploration, underperforming a simple baseline of** vLLM/SGLang hyperparameter tuning**[@maksym_andr](https://x.com/maksym_andr/status/2057106398228439148). The thread also reports an apparent** inverse scaling**effect, where models like** Claude Sonnet 4.6**and** GLM-5**rank well because they preserve robust final states, while larger models often produce brittle end configurations.** Terminal-Bench Science**extends agent evaluation from coding into** real scientific workflows**, with task contributions now open[@StevenDillmann](https://x.com/StevenDillmann/status/2057144415513420049). In parallel,**MINTEval** targets long-context memory systems under frequent updates and interference: average instance length is**138.8k tokens** with up to**1.8M**, yet across 7 systems the average accuracy is only** 27.9%**, with the best at** 33.4%**[@hyunji_amy_lee](https://x.com/hyunji_amy_lee/status/2057141349166768233). This complements a growing line of work arguing that memory should be a dedicated learned subsystem rather than just RAG/context stuffing[@dair_ai](https://x.com/dair_ai/status/2057182105671750047).On the human side of interaction research,

**ThoughtTrace** introduced a large-scale dataset of users’**self-reported thoughts during real LLM conversations**:** 10,174 thought annotations**,** 2,155 multi-turn conversations**,** 1,058 users**,** 20 models**. Reported gains include**+41.7%** for user behavior prediction and**+25.6%** for alignment[@chuanyang_jin](https://x.com/chuanyang_jin/status/2057111965101670842). This is one of the more concrete attempts to instrument the “latent user state” that conversation logs alone miss.

**Google I/O Follow-Through: Gemini 3.5 Flash, Omni, AI Studio, and Antigravity**

**Gemini 3.5 Flash** began broader rollout in the Gemini app, including free access globally[@GeminiApp](https://x.com/GeminiApp/status/2057140474192994356),[@GeminiApp](https://x.com/GeminiApp/status/2057237126526517727). Google framed it as its strongest**agentic and coding** model yet, claiming frontier performance at**4× the speed** of comparable models and under half the cost[@Google](https://x.com/Google/status/2057257773868388448). However, external discussion was much more mixed, with multiple posts questioning**real-world cost/performance** and token efficiency despite favorable launch-stage benchmark positioning[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2057181290412261557),[@scaling01](https://x.com/scaling01/status/2057177354582020362),[@giffmana](https://x.com/giffmana/status/2057155343390494949).**Gemini Omni** appears to have made the bigger qualitative impression than 3.5 Flash. Google positioned it as a conversational multimodal creation/editing model for video and mixed-input workflows[@Google](https://x.com/Google/status/2057180052979409172), with Gemini app demos showing conversational video editing[@GeminiApp](https://x.com/GeminiApp/status/2057159933934907825). Early reactions generally treated Omni as a more differentiated product than the core LLM refresh[@scaling01](https://x.com/scaling01/status/2057143531622334678).On tooling,

**AI Studio** pushed harder toward end-to-end developer workflow and mobile access[@GoogleAIStudio](https://x.com/GoogleAIStudio/status/2057122673558434205), while several posts tried to decode the relation between**Gemini Spark**,** Antigravity**, and Google’s internal/external agent harnesses[@simonw](https://x.com/simonw/status/2057115921551098211),[@_philschmid](https://x.com/_philschmid/status/2057136375988912176). A more concrete Antigravity-adjacent update was the launch of**Science Skills** for Google’s agent stack, integrating 30+ life-science sources such as**UniProt** and**AlphaFold DB**[@GoogleDeepMind](https://x.com/GoogleDeepMind/status/2057256257153884161).

**Agent Infrastructure, Retrieval, and Dev Tooling**

Several posts converged on the same operational lesson:

**agents fail on infra reality before they fail on demos**. That theme shows up in the qualitative thread on research agents fighting dependency conflicts and configs[@jehyeoky248](https://x.com/jehyeoky248/status/2057103859927941153), in LangChain’s push for**LangSmith Sandboxes GA**[@LangChain](https://x.com/LangChain/status/2057152025058558072), and in newer lighter-weight** code interpreter**support for deepagents as a middle ground between pure tool execution and full sandboxes[@sydneyrunkle](https://x.com/sydneyrunkle/status/2057179305948647775),[@hwchase17](https://x.com/hwchase17/status/2057214077114679386).In retrieval/search infra,

**Perplexity** described a productionized**query-aware, citation-preserving context compression** system that cuts context tokens by up to**70%** while improving answer quality, and claims**50× compression** on SimpleQA at frontier-level performance[@perplexity_ai](https://x.com/perplexity_ai/status/2057151002105753950).**Weaviate 1.37** added**MMR reranking** to improve diversity in vector retrieval for RAG/agents[@weaviate_io](https://x.com/weaviate_io/status/2057117923416629676), while**SID-1** was presented as an RL-trained agentic search model with**1.9× recall over RAG+rerank**,** 24× faster**, and** 99% cheaper**than GPT-5.1 in the cited setup[@turbopuffer](https://x.com/turbopuffer/status/2057166836031193523).** Cursor**,** VS Code**, and** Codex**all shipped notable workflow updates. Cursor added** automations**in the agents workspace[@cursor_ai](https://x.com/cursor_ai/status/2057167359593603471), VS Code shipped better markdown/HTML previews, remote session continuity, and utility-model configurability[@code](https://x.com/code/status/2057195516123808070),[@pierceboggan](https://x.com/pierceboggan/status/2057204489661407365). On the model side,**Composer 2.5** posted a strong coding-agent showing—**62** on the Artificial Analysis Coding Agent Index at much lower cost than top Opus/GPT-5.5 variants[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2057277363789197561). OpenAI also shipped**Codex on mobile**[@OpenAIDevs](https://x.com/OpenAIDevs/status/2057142816497906045).

**Top Tweets (by engagement)**

**OpenAI math milestone**: OpenAI’s announcement of the unit-distance breakthrough was the most consequential technical post in the set, both for scientific novelty and for what it implies about long-horizon reasoning[@OpenAI](https://x.com/OpenAI/status/2057176201782075690).**Cohere Command A+ open release**: One of the largest model-release stories of the day, mainly because of the** Apache 2.0**license and unusual architecture[@cohere](https://x.com/cohere/status/2057120818551734589).** Anthropic compute expansion with SpaceX/Colossus**: Anthropic is reportedly scaling up on** Colossus 2**capacity[@nottombrown](https://x.com/nottombrown/status/2057194829986300375), with follow-on posts citing a filing that values the SpaceX compute agreement at**$1.25B/month through May 2029**[@SemiAnalysis_](https://x.com/SemiAnalysis_/status/2057218890288030110).** Exa funding**: Exa raised**$250M Series C at a $2.2B valuation**, explicitly framing itself as a search lab organizing web data for agents[@ExaAILabs](https://x.com/ExaAILabs/status/2057132080317042697).

**AI Reddit Recap**

**/r/LocalLlama + /r/localLLM Recap**

**1. Qwen3.7 Preview and 27B Roadmap**

(Activity: 1292):[Qwen is cooking hard](https://www.reddit.com/r/LocalLLaMA/comments/1theffd/qwen_is_cooking_hard/)**The image is a screenshot of Chujie Zheng teasing that Qwen is “cooking hard”, quoting an announcement that Qwen3.7 Preview is now on Arena with Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview; the post claims Alibaba ranks**`#6`

**in Text and**`#5`

**in Vision. In context, the Reddit title/selftext indicate users are anticipating larger and refreshed open-weight models—especially 122B and a new 27B—though the screenshot itself is mainly a teaser rather than a technical benchmark breakdown.**Commenters are split between excitement for high-end models and practical interest in smaller local models: some want[Image](https://i.redd.it/cefjio15g12h1.png)**9B/4B** variants for low-end hardware, while others hope for**122B**, a better** 35B**, or joke that Qwen may soon be “cooking” their GPU.Several commenters focused on

**model-size coverage** rather than the current`27B`

release, saying they cannot practically run it and are hoping for smaller**Qwen**`4B`

**/**`9B`

variants for low-end or laptop GPUs. There was also interest in larger`122B`

and improved`35B`

checkpoints, though one commenter noted prior`122B`

mentions around Qwen 3.6 never materialized, raising uncertainty about whether a Qwen 3.7`122B`

will actually ship.

(Activity: 553):[Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room](https://www.reddit.com/r/LocalLLaMA/comments/1tie6gy/qwen37_max_scored_by_artificial_analysis_27b35b/)**A Reddit post highlights an**[Artificial Analysis leaderboard screenshot](https://preview.redd.it/42ak5qmus82h1.png?width=1133&format=png&auto=webp&s=744ea3dfc06c83d0c4d8aa128c39b3238b17d7be)where Qwen3.7 Max ranks`5th`

**, roughly level with GPT 5.4 (xhigh) and slightly ahead of Gemini 3.5 Flash. The author notes Qwen3.6 27B trails its Max counterpart by exactly**`6`

**points and hopes upcoming Qwen3.7 27B/35B variants land close to the Max model’s performance.** Commenters are mainly*“waiting eagerly for the open weight models”*and view the score as evidence that the**Qwen** team is now competitive with major labs, despite concerns that the Max model is not open-source. One technical concern raised is whether Qwen has fixed its prior tendency toward*“overthinking.”*Commenters focused on whether

**Qwen3.7 Max** represents a genuine architectural update versus another finetune/iteration of the**Qwen3.5/Qwen3.6** architecture; one noted that extracting more performance from the same base architecture would still be technically notable.Several users are waiting for potential

**open-weight 27B/35B variants**, but one commenter speculated there may be no** Qwen 3.7 27B**at all, arguing that “Qwen 3.7” could simply be a private large model similar to** Qwen 3.6 390B A30B**rather than a full public model family.A technical concern raised was whether the Qwen team has addressed the model’s reported

**“overthinking”** behavior, implying interest in improvements to reasoning-token efficiency, response latency, and controllability rather than just benchmark gains.

(Activity: 1162):[Qwen will release another 27B with high probability](https://www.reddit.com/r/LocalLLaMA/comments/1tiwnpc/qwen_will_release_another_27b_with_high/)**The**[image](https://i.redd.it/g5uabdvdic2h1.jpeg)is a screenshot of an X/Twitter exchange where xiong-hui (barry) chen says Qwen is**“waiting for the exact roadmap”****but believes there is a high probability of another**`27B`

**release, framed by the post title as a likely follow-up to the highly regarded Qwen 3.6 27B. The technical significance is speculation around Qwen continuing to optimize parameter efficiency / “intelligence density” in the mid-size dense-model range rather than only scaling to much larger MoE models.**Commenters mostly discuss local-inference practicality: some want a larger`122B-A10B`

**MoE** model, while others argue that`27B`

is too heavy for`16GB`

VRAM users and prefer a`35B`

/`A3B`

-style MoE that can run on consumer gaming laptops or hybrid CPU/GPU setups.Several commenters discussed the

**local-inference gap around 27B models**: users with`16GB VRAM`

argued that a`27B`

model is difficult to run at a usable quantization level, while a hypothetical**Qwen 35B MoE / A3B-style model** could be more practical via hybrid CPU/GPU inference and would remain accessible on gaming laptops.There was interest in larger

**dense Qwen variants**, especially`50B`

–`80B`

, with one commenter noting that**Qwen 27B is already very fast with MTP** and they would trade some generation speed for higher parameter count and potentially better quality.Model-size requests clustered around both

**MoE and dense scaling paths**: proposed targets included** Qwen 3.7 122B-A10B**,`50B`

–`80B`

MoE, and dense`10B`

,`20B`

,`30B`

,`50B`

, or`80B`

releases, reflecting demand for both high-end quality and locally runnable tiers.

## Keep reading with a 7-day free trial

Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.
