{"slug": "ainews-openai-gpt-next-disproves-80-year-old-erdos-planar-unit-distance-problem", "title": "[AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance problem for under $1000", "summary": "OpenAI's GPT-next model disproved the 80-year-old Erdős planar unit distance problem in under 32 hours at a cost of less than $1,000, marking the first instance of a general-purpose AI solving a well-known open mathematics problem. The result, published as a 125-page reasoning summary, was validated by prominent mathematicians including Timothy Gowers, who called it a clear breakthrough beyond prior AI math milestones. OpenAI emphasized the model is a general-purpose reasoning system, not a domain-specific solver, suggesting the extended reasoning capabilities demonstrated could generalize to other scientific fields.", "body_md": "# [AINews] OpenAI GPT-next disproves 80 year old Erdős planar unit distance problem for under $1000\n\n### a quiet day but a nice result in AI x mathematics\n\nWe will leave coverage of the [SpaceXAI IPO filing](https://x.com/eliebakouch/status/2057222864332320999?s=12) for the actual day of IPO. Today we celebrate OpenAI’s result, speculated to be [GPT 5.6 running for <32 hours or <$1000](https://x.com/willdepue/status/2057213893857165701), on [the planar unit distance problem](https://openai.com/index/model-disproves-discrete-geometry-conjecture/). Similar to the 2025 [IMO Gold](https://news.smol.ai/issues/25-08-11-ioi-gold) result, this is a general purpose LLM, [not an AlphaProof/Lean style dedicated model](https://x.com/polynoamial/status/2057179104315670826), which lends hope that this extended reasoning will generalize beyond math:\n\nAmong the 125 pages of output, there exists a “[page 39 moment](https://x.com/voooooogel/status/2057198687307362642)” that is getting some attention:\n\nAs the authors of [the opinion letter](https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-remarks.pdf) note, this is a disproof, not a proof, which would have been more impressive, but nevertheless points towards the way of things to come:\n\nAI News for 5/4/2026-5/5/2026. We checked 12 subreddits,\n\n[544 Twitters]and no further Discords.[AINews’ website]lets you search all past issues. As a reminder,[AINews is now a section of Latent Space]. You can[opt in/out]of email frequencies!\n\n**AI Twitter Recap**\n\n**OpenAI’s Math Breakthrough on the Erdős Unit Distance Problem**\n\n**A general-purpose reasoning model produced a new research result in discrete geometry**: OpenAI announced that an internal model disproved a long-standing belief around the planar** unit distance problem**, a famous Erdős problem from 1946, discovering a new family of constructions that improves on square-grid-style solutions[@OpenAI](https://x.com/OpenAI/status/2057176201782075690). OpenAI emphasized this was a**general-purpose model**, not a domain-specific math system or scaffolded solver[@OpenAI](https://x.com/OpenAI/status/2057176203166171317), and said the result points to stronger long-horizon reasoning for science broadly[@OpenAI](https://x.com/OpenAI/status/2057176204541866087).The result drew unusually strong validation from mathematicians and adjacent researchers.\n\n**Timothy Gowers** called it the first really clear example of AI solving a**well-known** open math problem[@wtgowers](https://x.com/wtgowers/status/2057175729008153069), while OpenAI researcher**Hongxun Wu** described it as an internal reasoning-LLM milestone on “the hardest problems”[@HongxunWu](https://x.com/HongxunWu/status/2057176383106027567). Additional reactions from[@thomasfbloom](https://x.com/thomasfbloom/status/2057177152894771631),[@gdb](https://x.com/gdb/status/2057182650784452925),[@alexwei_](https://x.com/alexwei_/status/2057182873208369485), and[@polynoamial](https://x.com/polynoamial/status/2057178198228586824)converged on the same point: this appears qualitatively beyond prior “AI does olympiad math” milestones.**Notable technical subtext**: OpenAI says the model was not pushed to the limit and is intended for eventual public use[@polynoamial](https://x.com/polynoamial/status/2057179104315670826). The published reasoning summary itself is reportedly massive—around**125 pages** per[@voooooogel](https://x.com/voooooogel/status/2057198687307362642)—which helped fuel discussion about the practical role of**test-time compute** in frontier reasoning. Some observers explicitly framed this as further evidence that inference-time scaling is the paradigm carrying current progress[@](https://x.com/_arohan_/status/2057188616099725525), with others extrapolating to faster future gains in formal science and mathematics[arohan](https://x.com/_arohan_/status/2057188616099725525)[@scaling01](https://x.com/scaling01/status/2057246143881609510),[@sama](https://x.com/sama/status/2057203171198636251).\n\n**Cohere Command A+ Open Release and Architecture Discussion**\n\n**Cohere released Command A+ as Apache 2.0 open weights**, positioning it as its most powerful model yet and explicitly optimized for low hardware requirements[@cohere](https://x.com/cohere/status/2057120818551734589), with the licensing clarified in a follow-up[@cohere](https://x.com/cohere/status/2057122131410813016). The release is significant partly because it is Cohere’s**first fully open Apache 2 model** per[@aidangomez](https://x.com/aidangomez/status/2057142232860258527). Community reaction focused on this as a meaningful shift toward more permissive, deployable enterprise-grade open models[@nickfrosst](https://x.com/nickfrosst/status/2057132425310851104),[@ClementDelangue](https://x.com/ClementDelangue/status/2057180057756467671).The model details repeated across multiple posts: roughly\n\n**218B MoE / 25B active**,** multimodal**,** 48 languages**, and runnable on relatively modest setups[@JayAlammar](https://x.com/JayAlammar/status/2057145838011564126),[@mervenoyann](https://x.com/mervenoyann/status/2057128432190787643).**vLLM day-0 support** landed quickly, including a note that it can run on as little as**2× H100s at W4A4**[@vllm_project](https://x.com/vllm_project/status/2057206049665622070).** Benchmarks painted a mixed but credible picture**: Artificial Analysis placed Command A+ at** 37 on its Intelligence Index**, around Claude 4.5 Haiku territory, with especially strong** non-hallucination**behavior and decent speed, but weaker scientific reasoning and coding than top peer models[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2057123594162077837). The community also dug into the architecture: unusual choices called out include a**parallel transformer block**, large** shared expert**usage,** LayerNorm over RMSNorm**, relatively low** 32-layer**depth, and atypical head/expert configurations[@eliebakouch](https://x.com/eliebakouch/status/2057198733759008989),[@rasbt](https://x.com/rasbt/status/2057241574161932339),[@stochasticchasm](https://x.com/stochasticchasm/status/2057150551696261607). This made the release notable not just as a model drop but as an architectural data point.\n\n**Benchmarks for Agents, Memory, and Scientific Workflows**\n\n**InferenceBench** is one of the day’s most technically substantive releases. It targets**AI R&D automation** through open-ended inference optimization tasks, and the headline is negative for current frontier agents: they struggle with**system-level engineering**, dependency management, and broad exploration, underperforming a simple baseline of** vLLM/SGLang hyperparameter tuning**[@maksym_andr](https://x.com/maksym_andr/status/2057106398228439148). The thread also reports an apparent** inverse scaling**effect, where models like** Claude Sonnet 4.6**and** GLM-5**rank well because they preserve robust final states, while larger models often produce brittle end configurations.** Terminal-Bench Science**extends agent evaluation from coding into** real scientific workflows**, with task contributions now open[@StevenDillmann](https://x.com/StevenDillmann/status/2057144415513420049). In parallel,**MINTEval** targets long-context memory systems under frequent updates and interference: average instance length is**138.8k tokens** with up to**1.8M**, yet across 7 systems the average accuracy is only** 27.9%**, with the best at** 33.4%**[@hyunji_amy_lee](https://x.com/hyunji_amy_lee/status/2057141349166768233). This complements a growing line of work arguing that memory should be a dedicated learned subsystem rather than just RAG/context stuffing[@dair_ai](https://x.com/dair_ai/status/2057182105671750047).On the human side of interaction research,\n\n**ThoughtTrace** introduced a large-scale dataset of users’**self-reported thoughts during real LLM conversations**:** 10,174 thought annotations**,** 2,155 multi-turn conversations**,** 1,058 users**,** 20 models**. Reported gains include**+41.7%** for user behavior prediction and**+25.6%** for alignment[@chuanyang_jin](https://x.com/chuanyang_jin/status/2057111965101670842). This is one of the more concrete attempts to instrument the “latent user state” that conversation logs alone miss.\n\n**Google I/O Follow-Through: Gemini 3.5 Flash, Omni, AI Studio, and Antigravity**\n\n**Gemini 3.5 Flash** began broader rollout in the Gemini app, including free access globally[@GeminiApp](https://x.com/GeminiApp/status/2057140474192994356),[@GeminiApp](https://x.com/GeminiApp/status/2057237126526517727). Google framed it as its strongest**agentic and coding** model yet, claiming frontier performance at**4× the speed** of comparable models and under half the cost[@Google](https://x.com/Google/status/2057257773868388448). However, external discussion was much more mixed, with multiple posts questioning**real-world cost/performance** and token efficiency despite favorable launch-stage benchmark positioning[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2057181290412261557),[@scaling01](https://x.com/scaling01/status/2057177354582020362),[@giffmana](https://x.com/giffmana/status/2057155343390494949).**Gemini Omni** appears to have made the bigger qualitative impression than 3.5 Flash. Google positioned it as a conversational multimodal creation/editing model for video and mixed-input workflows[@Google](https://x.com/Google/status/2057180052979409172), with Gemini app demos showing conversational video editing[@GeminiApp](https://x.com/GeminiApp/status/2057159933934907825). Early reactions generally treated Omni as a more differentiated product than the core LLM refresh[@scaling01](https://x.com/scaling01/status/2057143531622334678).On tooling,\n\n**AI Studio** pushed harder toward end-to-end developer workflow and mobile access[@GoogleAIStudio](https://x.com/GoogleAIStudio/status/2057122673558434205), while several posts tried to decode the relation between**Gemini Spark**,** Antigravity**, and Google’s internal/external agent harnesses[@simonw](https://x.com/simonw/status/2057115921551098211),[@_philschmid](https://x.com/_philschmid/status/2057136375988912176). A more concrete Antigravity-adjacent update was the launch of**Science Skills** for Google’s agent stack, integrating 30+ life-science sources such as**UniProt** and**AlphaFold DB**[@GoogleDeepMind](https://x.com/GoogleDeepMind/status/2057256257153884161).\n\n**Agent Infrastructure, Retrieval, and Dev Tooling**\n\nSeveral posts converged on the same operational lesson:\n\n**agents fail on infra reality before they fail on demos**. That theme shows up in the qualitative thread on research agents fighting dependency conflicts and configs[@jehyeoky248](https://x.com/jehyeoky248/status/2057103859927941153), in LangChain’s push for**LangSmith Sandboxes GA**[@LangChain](https://x.com/LangChain/status/2057152025058558072), and in newer lighter-weight** code interpreter**support for deepagents as a middle ground between pure tool execution and full sandboxes[@sydneyrunkle](https://x.com/sydneyrunkle/status/2057179305948647775),[@hwchase17](https://x.com/hwchase17/status/2057214077114679386).In retrieval/search infra,\n\n**Perplexity** described a productionized**query-aware, citation-preserving context compression** system that cuts context tokens by up to**70%** while improving answer quality, and claims**50× compression** on SimpleQA at frontier-level performance[@perplexity_ai](https://x.com/perplexity_ai/status/2057151002105753950).**Weaviate 1.37** added**MMR reranking** to improve diversity in vector retrieval for RAG/agents[@weaviate_io](https://x.com/weaviate_io/status/2057117923416629676), while**SID-1** was presented as an RL-trained agentic search model with**1.9× recall over RAG+rerank**,** 24× faster**, and** 99% cheaper**than GPT-5.1 in the cited setup[@turbopuffer](https://x.com/turbopuffer/status/2057166836031193523).** Cursor**,** VS Code**, and** Codex**all shipped notable workflow updates. Cursor added** automations**in the agents workspace[@cursor_ai](https://x.com/cursor_ai/status/2057167359593603471), VS Code shipped better markdown/HTML previews, remote session continuity, and utility-model configurability[@code](https://x.com/code/status/2057195516123808070),[@pierceboggan](https://x.com/pierceboggan/status/2057204489661407365). On the model side,**Composer 2.5** posted a strong coding-agent showing—**62** on the Artificial Analysis Coding Agent Index at much lower cost than top Opus/GPT-5.5 variants[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2057277363789197561). OpenAI also shipped**Codex on mobile**[@OpenAIDevs](https://x.com/OpenAIDevs/status/2057142816497906045).\n\n**Top Tweets (by engagement)**\n\n**OpenAI math milestone**: OpenAI’s announcement of the unit-distance breakthrough was the most consequential technical post in the set, both for scientific novelty and for what it implies about long-horizon reasoning[@OpenAI](https://x.com/OpenAI/status/2057176201782075690).**Cohere Command A+ open release**: One of the largest model-release stories of the day, mainly because of the** Apache 2.0**license and unusual architecture[@cohere](https://x.com/cohere/status/2057120818551734589).** Anthropic compute expansion with SpaceX/Colossus**: Anthropic is reportedly scaling up on** Colossus 2**capacity[@nottombrown](https://x.com/nottombrown/status/2057194829986300375), with follow-on posts citing a filing that values the SpaceX compute agreement at**$1.25B/month through May 2029**[@SemiAnalysis_](https://x.com/SemiAnalysis_/status/2057218890288030110).** Exa funding**: Exa raised**$250M Series C at a $2.2B valuation**, explicitly framing itself as a search lab organizing web data for agents[@ExaAILabs](https://x.com/ExaAILabs/status/2057132080317042697).\n\n**AI Reddit Recap**\n\n**/r/LocalLlama + /r/localLLM Recap**\n\n**1. Qwen3.7 Preview and 27B Roadmap**\n\n(Activity: 1292):[Qwen is cooking hard](https://www.reddit.com/r/LocalLLaMA/comments/1theffd/qwen_is_cooking_hard/)**The image is a screenshot of Chujie Zheng teasing that Qwen is “cooking hard”, quoting an announcement that Qwen3.7 Preview is now on Arena with Qwen3.7-Max-Preview and Qwen3.7-Plus-Preview; the post claims Alibaba ranks**`#6`\n\n**in Text and**`#5`\n\n**in Vision. In context, the Reddit title/selftext indicate users are anticipating larger and refreshed open-weight models—especially 122B and a new 27B—though the screenshot itself is mainly a teaser rather than a technical benchmark breakdown.**Commenters are split between excitement for high-end models and practical interest in smaller local models: some want[Image](https://i.redd.it/cefjio15g12h1.png)**9B/4B** variants for low-end hardware, while others hope for**122B**, a better** 35B**, or joke that Qwen may soon be “cooking” their GPU.Several commenters focused on\n\n**model-size coverage** rather than the current`27B`\n\nrelease, saying they cannot practically run it and are hoping for smaller**Qwen**`4B`\n\n**/**`9B`\n\nvariants for low-end or laptop GPUs. There was also interest in larger`122B`\n\nand improved`35B`\n\ncheckpoints, though one commenter noted prior`122B`\n\nmentions around Qwen 3.6 never materialized, raising uncertainty about whether a Qwen 3.7`122B`\n\nwill actually ship.\n\n(Activity: 553):[Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room](https://www.reddit.com/r/LocalLLaMA/comments/1tie6gy/qwen37_max_scored_by_artificial_analysis_27b35b/)**A Reddit post highlights an**[Artificial Analysis leaderboard screenshot](https://preview.redd.it/42ak5qmus82h1.png?width=1133&format=png&auto=webp&s=744ea3dfc06c83d0c4d8aa128c39b3238b17d7be)where Qwen3.7 Max ranks`5th`\n\n**, roughly level with GPT 5.4 (xhigh) and slightly ahead of Gemini 3.5 Flash. The author notes Qwen3.6 27B trails its Max counterpart by exactly**`6`\n\n**points and hopes upcoming Qwen3.7 27B/35B variants land close to the Max model’s performance.** Commenters are mainly*“waiting eagerly for the open weight models”*and view the score as evidence that the**Qwen** team is now competitive with major labs, despite concerns that the Max model is not open-source. One technical concern raised is whether Qwen has fixed its prior tendency toward*“overthinking.”*Commenters focused on whether\n\n**Qwen3.7 Max** represents a genuine architectural update versus another finetune/iteration of the**Qwen3.5/Qwen3.6** architecture; one noted that extracting more performance from the same base architecture would still be technically notable.Several users are waiting for potential\n\n**open-weight 27B/35B variants**, but one commenter speculated there may be no** Qwen 3.7 27B**at all, arguing that “Qwen 3.7” could simply be a private large model similar to** Qwen 3.6 390B A30B**rather than a full public model family.A technical concern raised was whether the Qwen team has addressed the model’s reported\n\n**“overthinking”** behavior, implying interest in improvements to reasoning-token efficiency, response latency, and controllability rather than just benchmark gains.\n\n(Activity: 1162):[Qwen will release another 27B with high probability](https://www.reddit.com/r/LocalLLaMA/comments/1tiwnpc/qwen_will_release_another_27b_with_high/)**The**[image](https://i.redd.it/g5uabdvdic2h1.jpeg)is a screenshot of an X/Twitter exchange where xiong-hui (barry) chen says Qwen is**“waiting for the exact roadmap”****but believes there is a high probability of another**`27B`\n\n**release, framed by the post title as a likely follow-up to the highly regarded Qwen 3.6 27B. The technical significance is speculation around Qwen continuing to optimize parameter efficiency / “intelligence density” in the mid-size dense-model range rather than only scaling to much larger MoE models.**Commenters mostly discuss local-inference practicality: some want a larger`122B-A10B`\n\n**MoE** model, while others argue that`27B`\n\nis too heavy for`16GB`\n\nVRAM users and prefer a`35B`\n\n/`A3B`\n\n-style MoE that can run on consumer gaming laptops or hybrid CPU/GPU setups.Several commenters discussed the\n\n**local-inference gap around 27B models**: users with`16GB VRAM`\n\nargued that a`27B`\n\nmodel is difficult to run at a usable quantization level, while a hypothetical**Qwen 35B MoE / A3B-style model** could be more practical via hybrid CPU/GPU inference and would remain accessible on gaming laptops.There was interest in larger\n\n**dense Qwen variants**, especially`50B`\n\n–`80B`\n\n, with one commenter noting that**Qwen 27B is already very fast with MTP** and they would trade some generation speed for higher parameter count and potentially better quality.Model-size requests clustered around both\n\n**MoE and dense scaling paths**: proposed targets included** Qwen 3.7 122B-A10B**,`50B`\n\n–`80B`\n\nMoE, and dense`10B`\n\n,`20B`\n\n,`30B`\n\n,`50B`\n\n, or`80B`\n\nreleases, reflecting demand for both high-end quality and locally runnable tiers.\n\n## Keep reading with a 7-day free trial\n\nSubscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.", "url": "https://wpnews.pro/news/ainews-openai-gpt-next-disproves-80-year-old-erdos-planar-unit-distance-problem", "canonical_source": "https://www.latent.space/p/ainews-openai-gpt-next-disproves", "published_at": "2026-05-21 07:28:36+00:00", "updated_at": "2026-05-25 00:18:52.169309+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-research", "large-language-models", "generative-ai"], "entities": ["OpenAI", "GPT-next", "GPT 5.6", "Will Depue", "AlphaProof", "Lean", "IMO Gold", "Latent Space"], "alternates": {"html": "https://wpnews.pro/news/ainews-openai-gpt-next-disproves-80-year-old-erdos-planar-unit-distance-problem", "markdown": "https://wpnews.pro/news/ainews-openai-gpt-next-disproves-80-year-old-erdos-planar-unit-distance-problem.md", "text": "https://wpnews.pro/news/ainews-openai-gpt-next-disproves-80-year-old-erdos-planar-unit-distance-problem.txt", "jsonld": "https://wpnews.pro/news/ainews-openai-gpt-next-disproves-80-year-old-erdos-planar-unit-distance-problem.jsonld"}}