# Active Page: Tackling Local AI for Transforming Passive Reading into Active Recall

> Source: <https://dev.to/muhammad_dafi_5eebbcb5d63/active-page-tackling-local-ai-for-transforming-passive-reading-into-active-recall-4hoj>
> Published: 2026-05-24 06:35:06+00:00

This is a submission for the Gemma 4 Challenge: Build with Gemma 4
Most readers suffer from the "forgetting curve." By the time we finish the later chapters of a dense book, the foundational concepts from the introduction have already begun to blur.
As a middle school student trying to learn something new with reading books and scientific journal article, I wanted a better way to retain knowledge. My inspiration came from observing National Science Olympiad winners, my friend and other figure, who maintain peak retention not through passive rereading, but through consistent daily answering a lot of questions.
Active Page is a local-first application that transforms passive reading into an interactive learning experience. It automatically generates high-quality, analytical, and contextual quizzes directly from your reading material for immediate memory reinforcement. To help users build a sustainable learning habit, Active Page also features a built-in streak mechanics system to keep readers motivated daily. 🔥🔥
Because Active Page run locally, it has operational costs at zero (beside the use of the device) and side benefit of reading books without internet. While local compute constraints often drive developers toward over-engineering, Active Page takes a more elegant path.
Active Page is a privacy-first, local-LLM-powered reading companion designed to solve the "forgetting curve." By leveraging the cutting-edge Gemma 4 E2B model, it transforms passive reading into an interactive learning session through real-time, contextual active recall—running entirely on your machine.
The init.sh
script automates the heavy lifting: it manages dependencies via uv, compiles llama.cpp for your specific hardware, and pulls the optimized Gemma 4 E2B weights.
bash init.sh
Note for Silicon/AMD: If using Apple M-Series or AMD GPUs, edit init.sh to enable GGML_METAL=ON or GGML_HIPBLAS=ON respectively for hardware acceleration.
Launch the inference engine and the interactive web interface simultaneously:
bash run.sh
Access the application at: http://localhost:8000
System Crashing / Out of Memory in the init.sh If your ram or CPU is limited, adjust the pararrel of building…
I selected the Gemma-4-E2B model because it perfectly balances performance and efficiency for local deployment. It leverages Per-Layer Embeddings (PLE) and a hybrid attention mechanism combining Sliding Window Attention (SWA) with Grouped Query Attention (GQE). This architecture allows it to have 128K context window while deliver output quality that rivals much larger models while remaining lightweight and fast enough for edge devices.
Beyond simply powering the app, Gemma-4-E2B design unlocked sophisticated long-context capabilities on-device. Its compact size enables aggressive KV cache usage for manipulation, which is essential for maintaining a seamless, responsive reading experience with active recall across extended contexts.
The "memory" of an AI (KV Cache) is usually treated as a linear path. In most apps, the book data is treated as a fresh prompt every time, which is slow and memory-intensive.
The "memory" of an AI (KV Cache) is usually treated as a linear path. In most apps, the book data is treated as a fresh prompt every time, which is slow and memory-intensive.
I inverted this structure to maximize Prefix Caching:
For tackling memory constrain and decode speed, we use this technique to solved it, which also come from google.
Even with an optimized KV cache, generating multiple-choice questions (MCQs) quiz requires a slight processing window. Forcing a reader to wait at a loading spinner when a quiz triggers would break their reading immersion.
Active Page completely cut local execution latency by decoupling the generation engine from the UI through an Asynchronous Pre-Fetching Pipeline: