{"slug": "active-page-tackling-local-ai-for-transforming-passive-reading-into-active", "title": "Active Page: Tackling Local AI for Transforming Passive Reading into Active Recall", "summary": "Active Page is a local-first application that uses the Gemma 4 E2B model to combat the \"forgetting curve\" by automatically generating contextual quizzes from reading material. It runs entirely on the user's machine for zero operational costs and privacy, featuring a streak system to encourage daily learning habits. The app optimizes performance through techniques like prefix caching and an asynchronous pre-fetching pipeline to minimize latency and maintain reading immersion.", "body_md": "This is a submission for the Gemma 4 Challenge: Build with Gemma 4\nMost readers suffer from the \"forgetting curve.\" By the time we finish the later chapters of a dense book, the foundational concepts from the introduction have already begun to blur.\nAs a middle school student trying to learn something new with reading books and scientific journal article, I wanted a better way to retain knowledge. My inspiration came from observing National Science Olympiad winners, my friend and other figure, who maintain peak retention not through passive rereading, but through consistent daily answering a lot of questions.\nActive Page is a local-first application that transforms passive reading into an interactive learning experience. It automatically generates high-quality, analytical, and contextual quizzes directly from your reading material for immediate memory reinforcement. To help users build a sustainable learning habit, Active Page also features a built-in streak mechanics system to keep readers motivated daily. 🔥🔥\nBecause Active Page run locally, it has operational costs at zero (beside the use of the device) and side benefit of reading books without internet. While local compute constraints often drive developers toward over-engineering, Active Page takes a more elegant path.\nActive Page is a privacy-first, local-LLM-powered reading companion designed to solve the \"forgetting curve.\" By leveraging the cutting-edge Gemma 4 E2B model, it transforms passive reading into an interactive learning session through real-time, contextual active recall—running entirely on your machine.\nThe init.sh\nscript automates the heavy lifting: it manages dependencies via uv, compiles llama.cpp for your specific hardware, and pulls the optimized Gemma 4 E2B weights.\nbash init.sh\nNote for Silicon/AMD: If using Apple M-Series or AMD GPUs, edit init.sh to enable GGML_METAL=ON or GGML_HIPBLAS=ON respectively for hardware acceleration.\nLaunch the inference engine and the interactive web interface simultaneously:\nbash run.sh\nAccess the application at: http://localhost:8000\nSystem Crashing / Out of Memory in the init.sh If your ram or CPU is limited, adjust the pararrel of building…\nI selected the Gemma-4-E2B model because it perfectly balances performance and efficiency for local deployment. It leverages Per-Layer Embeddings (PLE) and a hybrid attention mechanism combining Sliding Window Attention (SWA) with Grouped Query Attention (GQE). This architecture allows it to have 128K context window while deliver output quality that rivals much larger models while remaining lightweight and fast enough for edge devices.\nBeyond simply powering the app, Gemma-4-E2B design unlocked sophisticated long-context capabilities on-device. Its compact size enables aggressive KV cache usage for manipulation, which is essential for maintaining a seamless, responsive reading experience with active recall across extended contexts.\nThe \"memory\" of an AI (KV Cache) is usually treated as a linear path. In most apps, the book data is treated as a fresh prompt every time, which is slow and memory-intensive.\nThe \"memory\" of an AI (KV Cache) is usually treated as a linear path. In most apps, the book data is treated as a fresh prompt every time, which is slow and memory-intensive.\nI inverted this structure to maximize Prefix Caching:\nFor tackling memory constrain and decode speed, we use this technique to solved it, which also come from google.\nEven with an optimized KV cache, generating multiple-choice questions (MCQs) quiz requires a slight processing window. Forcing a reader to wait at a loading spinner when a quiz triggers would break their reading immersion.\nActive Page completely cut local execution latency by decoupling the generation engine from the UI through an Asynchronous Pre-Fetching Pipeline:", "url": "https://wpnews.pro/news/active-page-tackling-local-ai-for-transforming-passive-reading-into-active", "canonical_source": "https://dev.to/muhammad_dafi_5eebbcb5d63/active-page-tackling-local-ai-for-transforming-passive-reading-into-active-recall-4hoj", "published_at": "2026-05-24 06:35:06+00:00", "updated_at": "2026-05-24 07:18:46.457869+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "products", "research", "open-source"], "entities": ["Gemma 4", "Active Page", "National Science Olympiad"], "alternates": {"html": "https://wpnews.pro/news/active-page-tackling-local-ai-for-transforming-passive-reading-into-active", "markdown": "https://wpnews.pro/news/active-page-tackling-local-ai-for-transforming-passive-reading-into-active.md", "text": "https://wpnews.pro/news/active-page-tackling-local-ai-for-transforming-passive-reading-into-active.txt", "jsonld": "https://wpnews.pro/news/active-page-tackling-local-ai-for-transforming-passive-reading-into-active.jsonld"}}