VRAM

mentions 5 type Organization feed RSS

// recent coverage 5 mentions

03:34

2026-07-08

dev.to

large-language-models

CPU vs GPU: Why Large Language Models Need GPUs — What Really Happens After You Press Enter?

A developer explains the technical journey from pressing Enter to receiving a response from a large language model (LLM) like ChatGPT, Gemini, or Claude. The process involves tokenization, embedding, …

04:34

2026-06-27

news.ycombinator.com

artificial-intelligence

Ask HN: How much memory is useable by GPU in MacBook?

MacBook GPU memory allocation depends on total system memory: up to 36GB allows 66% GPU usage, while 36GB or more allows 75%. Users can increase allocation via Terminal for AI models. Shared memory el…

00:18

2026-06-19

dev.to

ai-agents

How I Run a 50-Agent AI Workforce on a Single 6GB GPU

A developer describes running ~50 local AI agents on a single 6GB GPU by using a lock-based queue, an eviction monitor, a resource governor, and a model router. The system serializes GPU access so onl…

18:57

2026-06-16

injuly.in

large-language-models

Inference cost at scale with napkin math

A technical analysis calculates the dollar cost per user for serving large language models at scale using napkin math, breaking down GPU resources, matrix multiplication costs, and attention mechanism…

08:34

2026-05-23

dev.to

large-language-models

We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.

Based on the article, the authors replaced their RAG pipeline with a persistent KV cache system, which stores the full document's attention state after a single prefill and reuses it for every query. …

// co-occurs with top 8 entities

GPU 4 LLM 3 CPU 2 FP-8 1 FP-16 1 SRAM 1 KV-Cache 1 RoPE 1