cd/entity/MLX· home› entities› MLX

grep -l @mlx /news/*.json | wc -l → 49

MLX

mentions 49 type Organization page 1/3 feed RSS

// recent coverage 49 mentions

23:48

2026-07-23

promptcube3.com

artificial-intelligence

MoE Model Layout: A Deep Dive into Weight Reordering

A new weight reordering technique for Mixture-of-Experts (MoE) models optimizes physical file layout to minimize NVMe seek overhead, achieving massive gains in explicit-read inference engines like MLX…

18:26

2026-07-23

developer.apple.com

machine-learning

Explore distributed inference and training with MLX [video]

Apple's MLX team introduced distributed inference and training capabilities for machine learning workloads across multiple Macs at WWDC26, demonstrating how to scale large language models using RDMA o…

13:49

2026-07-23

dev.to

machine-learning

Take your benchmark to the people who can kill it

A developer's MoE model file reordering technique, mbolt, achieved +32.3% decode throughput and −26.3% time-to-first-token on a 235B-parameter model running on a 48 GB MacBook, as independently verifi…

04:00

2026-07-23

machinebrief.com

large-language-models

BaseRT: Advancing Best-in-Class LLM Inference with Apple M5 Neural Accelerators

Apple's M5 generation introduces a dedicated Neural Accelerator on every GPU core, and BaseRT, a native Metal inference runtime, exploits these units to deliver up to 6.4× higher prompt-processing thr…

21:10

2026-07-22

byteiota.com

artificial-intelligence

BaseRT: Run Local LLMs on Apple Silicon 6x Faster

A new LLM inference runtime called BaseRT achieves up to 6.4x faster local inference on Apple Silicon than llama.cpp by writing directly to Apple's Metal GPU API, skipping intermediate frameworks like…

17:56

2026-07-22

github.com

artificial-intelligence

Show HN: Cactus Hybrid: We taught Gemma 4 to know when it's wrong

Cactus released Gemma 4 E2B Hybrid, a small on-device model that outputs a confidence score (0-1) for each answer, allowing developers to route low-confidence queries to a larger model. The model matc…

03:57

2026-07-22

dev.to

machine-learning

Running Qwen3 Through the ExecuTorch MLX Delegate: Up to 4.52x Faster on M1 Max

ExecuTorch's experimental MLX delegate, released in May 2026, enables PyTorch models to run on Apple Silicon GPUs. Testing Qwen3-0.6B on an M1 Max, MLX INT4 achieved 188.9 tokens/s, 4.52x faster than …

03:02

2026-07-22

github.com

machine-learning

Running Laguna S 2.1 locally on Apple Silicon: 52 tok/s with 38.5 GB peak memory

The mlx-community/Laguna-S-2.1-oQ2e quantized model running in-process through mlx-vlm on a 128 GB Apple M5 Max achieved a perfect overall score of 1.000 across six tasks, with 40.85 generation tok/s …

14:22

2026-07-21

simonwillison.net

artificial-intelligence

Nativ: Run AI models locally on your Mac

Prince Canuma's Nativ is a new macOS desktop application that wraps Apple's MLX framework to run AI models locally, offering both a chat interface and a localhost API server. The app automatically det…

07:07

2026-07-17

github.com

artificial-intelligence

Show HN: Velora – On-device macOS dictation (Whisper and a local LLM, no cloud)

Velora, an open-source macOS dictation tool that runs entirely on-device using Apple Silicon and MLX, has been released under an MIT license. The tool uses Whisper for speech-to-text and a local LLM f…

15:01

2026-07-16

pub.towardsai.net

large-language-models

What Is Inference Engineering? The Layer Doing 80% of Your LLM Bill.

A five-engineer team on Sonnet 4.6 saw a $4,800 monthly bill for Claude Code sessions, with only $960 attributed to the model itself and $3,840 coming from inference engineering — the layer between th…

22:59

2026-07-15

eaon.dev

ai-tools

Eaon (Preview)- Private all in one AI super app

Eaon, a free and open-source native macOS app, lets users switch between 49 AI models from a single interface, supporting both cloud keys and fully offline local models via Ollama, llama.cpp, or MLX. …

19:04

2026-07-14

sourcefeed.dev

artificial-intelligence

Bonsai 27B Puts Real Agents on Phones

PrismML shipped Bonsai 27B, a 27B-parameter model based on Qwen3.6 27B, with a 1-bit variant packing to 3.9 GB that fits on an iPhone 17 Pro, clearing the memory gate that blocked prior builds of this…

18:02

2026-07-14

9to5mac.com

artificial-intelligence

PrismML releases Bonsai 27B, claiming first major AI model of its size fit for iPhone

PrismML released Bonsai 27B, claiming it is the first major AI model of its size that can run on an iPhone, iPad, or Mac, with a 1-bit version fitting in about 4 GB of memory. PrismML CEO Babak Hassib…

18:13

2026-07-12

dev.to

artificial-intelligence

My SaaS was silently broken for 4 days because an AI My app was broken for four weeks and I didn't notice. So I got rid of the API.

A solo developer running brevio.news discovered that a hosted LLM API silently broke the product's weekly brief feature for four weeks after the provider retired a model ID. The developer fixed the is…

07:53

2026-07-10

dev.to

artificial-intelligence

When Your AI Coding Tool Reads Your Code, Where Does It Actually Go

Recent events, including Claude Code restricting access in China and a Chinese government security bulletin on AI coding tools, have renewed concerns about where code data goes when processed by AI as…

01:32

2026-07-10

digital-foundry-eight.vercel.app

large-language-models

I benchmarked every model that fits on an iPhone

An independent benchmark of on-device LLMs on iPhone A17 Pro found Apple's system model achieves ~149 tok/s with only 12MB peak app memory, while 4B-class open models like Qwen3 4B and Llama 3.2 3B tr…

03:30

2026-07-08

byteiota.com

large-language-models

Ollama 0.31: Gemma 4 Runs 90% Faster on Apple Silicon

Ollama 0.31 introduces multi-token prediction for Gemma 4 on Apple Silicon, cutting token generation time by 90% on Aider's coding benchmark. The update automatically accelerates local coding agents w…

19:15

2026-07-07

github.com

large-language-models

Local AI is re-reading its own prompt

A local AI assistant called Kira discovered that nearly half of its model time was spent re-reading boilerplate prompt text—a 'prefill tax'—after migrating from Ollama to Apple's MLX framework, which …

06:59

2026-07-04

maloyan.xyz

large-language-models

Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM

Qwen open-sourced the 35-billion parameter Mixture of Experts model Qwen 3.6-35B-A3B, which activates only 3 billion parameters per token and runs on a $599 Mac Mini M4 with 16GB RAM at 17 tok/s with …

page 1 / 3 next →

// co-occurs with top 8 entities

llama.cpp 15 Apple 14 Apple Silicon 13 Hugging Face 11 Ollama 10 vLLM 7 Claude Code 6 Qwen 5

// topics top 6 topics

large language models 36 ai infrastructure 34 ai tools 30 artificial intelligence 21 ai products 20 developer tools 19