{"slug": "llm-insights-local-demo-for-people-comments-and-ideas", "title": "LLM-Insights, local demo for people comments and ideas", "summary": "LLM-Insights released a local-first testing and optimization tool for iterative content creation that runs multi-model A/B tests, refines prompts using rubric-based grading, and generates scored synthetic data on user hardware. The system sends prompts to competing LLM models, grades responses against configurable rubrics, and automatically rewrites prompts using grader feedback to improve results across cycles. The tool operates locally by default via Ollama with optional cloud API support, ensuring no data leaves the machine unless users choose a cloud provider.", "body_md": "A local-first testing and optimization harness for iterative content creation — run multi-model A/B tests, refine prompts automatically with rubric-based grading, and generate scored synthetic data. Built for brand content workflows, prompt engineering, and LLM evaluation on your own hardware.\n\nYou write a prompt — a piece of brand copy, a product description, a creative brief, or any content task. The tool sends it to two competing LLM models, grades both answers against a configurable rubric, optionally rewrites the prompt using grader feedback, and repeats the cycle — keeping the best answer each round. Every variable is controlled from the UI: which models compete, what the rubric measures, how categories are weighted, and when the loop stops.\n\nThe pipeline runs locally by default using Ollama, with optional cloud API support (Mistral, Google Gemini) for hybrid setups. No data leaves your machine unless you choose a cloud provider.\n\nEach run produces a structured record of prompts, answers, scores, token counts, and model metadata — useful as refined synthetic data, prompt optimization logs, or content quality benchmarks.\n\n**Custom Grading Rubrics**— Define up to 8 grading categories, each with its own free-text rubric description, dedicated grader model, and weight. The default rubric covers accuracy, clarity, conciseness, creativity, and structure. Save named configurations and switch between them at any time.**Automatic Prompt Optimization**— The system rewrites your prompt after each iteration using grader feedback, category weights, and best answers as context. Techniques are applied automatically while preserving the original intent, including Zero-Shot Prompting, Few-Shot Prompting, Chain-of-Thought (CoT), Self-Consistency, Least-to-Most Prompting, Tree of Thoughts (ToT), Directional Stimulus Prompting, Role Prompting, Generated Knowledge Prompting, Chain-of-Verification (CoVe), and Skeleton-of-Thought.**Multi-Model A/B Testing**— Assign different models to each answering slot and compare their outputs head-to-head. The Advanced panel supports per-iteration model assignments for systematic cross-model comparisons.**Parallel Multi-Category Grading**— Layer 3 grades each category in parallel using a thread pool, grouped by grader model. Failed graders fall back to a default score without stopping the pipeline. Retries with backoff are built in.**Synthetic Data Generation**— Every run produces structured (prompt, answer, multi-dimensional scores) tuples and (original prompt, improved prompt) pairs. The JSONL ledger records prompts, replies, models, scores, and token counts. Multi-prompt sessions chain the best answer from the previous prompt as context into the next.**Token Tracking**— Input, output, and total token counts are recorded per model per layer per iteration and aggregated by provider. Token usage is visible in the deeper analysis charts.**Tie Detection**— When multiple iterations produce the same best score, the system identifies tied answers, deduplicates by text similarity, and reports alternatives.**Session Review and Analysis**— Browse, load, and analyze past runs with per-prompt iteration stats, score grids, and an in-depth analysis modal featuring average grade bar charts, radar overlays, per-category score breakdowns, token usage charts, runtime comparisons, and adjustable weight sliders for live what-if recalculation.**REST API**— All UI actions are backed by JSON endpoints (`/iteration`\n\n,`/is-processing`\n\n,`/get_backup_data`\n\n,`/update_weights`\n\n,`/save_advanced_models`\n\n,`/grader_settings`\n\n,`/grader_setting/<name>`\n\n, and more). Programmatic access to model selection, weight management, grader configuration, and session backup is available out of the box.\n\n``` php\nflowchart LR\n    L0[\"Layer 0\\n(brainstorm, optional)\"] --> loop\n    subgraph loop [\"Repeat × N iterations\"]\n        direction LR\n        L1A[\"Layer 1A\\n(answer)\"] --> G1[\"Layer 3\\n(grade)\"]\n        G1 --> L2[\"Layer 2\\n(rewrite prompt)\"]\n        L2 --> L1B[\"Layer 1B\\n(answer)\"]\n        L1B --> G2[\"Layer 3\\n(grade)\"]\n        G2 --> W[\"Pick winner\"]\n    end\n```\n\n| Layer | Role |\n|---|---|\nLayer 0 |\nOptional brainstorming step. Generates concise alternative ideas or directions before the loop begins. |\nLayer 1A / 1B |\nTwo competing answer models. Each produces a full response to the prompt (or improved prompt). |\nLayer 2 |\nPrompt improver. Rewrites the prompt using grader feedback, best answers, and micro-replies as context. |\nLayer 3 |\nMulti-category grader. Each category is evaluated independently by its own small LLM in parallel, scores are weighted and combined. Failed categories receive a default score so the pipeline continues. |\n\nThe loop ends when the first of these is met:\n\n- The best score reaches the target grade (default: 100).\n- Degradation break is enabled and the score drops from the previous iteration.\n- The maximum number of iterations is reached (default: 5).\n\n| Page | Path | Purpose |\n|---|---|---|\nLogin |\n`/login` |\nSimple authentication with an animated background |\nMain Analysis |\n`/` |\nRun experiments, configure models and toggles, view live results and charts |\nConfig Graders |\n`/config_graders` |\nCreate and edit grading rubrics — categories, rubric text, grader models, weights |\nReview History |\n`/review_chats` |\nBrowse saved runs, load or delete past sessions, open the deeper analysis modal |\n\n| Control | Purpose |\n|---|---|\nLayer 0 Model (Ideas) |\nSelects the brainstorming model that runs before the loop |\nAnswer Model 1 (Layer 1A) |\nFirst answer model in each iteration |\nAnswer Model 2 (Layer 1B) |\nSecond answer model in each iteration |\nPrompt Improver (Layer 2) |\nModel that rewrites prompts using grader feedback |\nAdvanced Panel |\nPer-iteration model assignment for Layers 1A, 1B, and 2. Locks main selectors when saved |\nSystem Profile |\nFilters model dropdowns by hardware tier (Simple / Medium / Powerful) — browser-side only, groups models by parameter size |\n\n| Control | Purpose |\n|---|---|\nAdvise Models by Domain |\nFilters model dropdowns to show only models suited to a task domain: Coding, Creative, Science, Experimental, or Balanced |\nWeight Preset |\nApplies a predefined weight profile across grading categories (Balanced, Accuracy, Creativity, Conciseness) |\nBreak Target Grade |\nStop the loop when this score is reached (1--100) |\nIterations |\nMaximum refinement rounds per prompt (1--5) |\nDegradation Break |\nStop if the score drops from the previous iteration |\nChange Prompt |\nEnable or disable Layer 2 prompt rewriting |\nGive Ideas |\nEnable or disable Layer 0 brainstorming |\nLast Best Answer Retention |\nFeed the best answer from the previous iteration as context into the next |\nGrade vs. Current / First Prompt |\nChoose whether graders judge the answer against the current or the first prompt in the session |\n\n| Control | Purpose |\n|---|---|\nWeight Inputs |\nAdjust category weights (auto-normalized). Apply and Reset buttons |\nGrader Setting Selector |\nSwitch between saved grading rubrics |\nConfig Graders Link |\nOpens the rubric editor page |\n\n| Button | Purpose |\n|---|---|\nSTART ANALYSIS |\nRuns the iterative analysis loop |\nClear Chat |\nBacks up and resets all runtime state |\nUpload Chat |\nImports a previously exported JSON backup |\nDownload Chat |\nExports the session as a human-readable text log or a full restorable JSON backup |\nReview History |\nOpens the Review page |\n\n| Control | Purpose |\n|---|---|\nLoad Setting |\nSelect and load an existing grading rubric |\nEdit / Cancel |\nToggle edit mode for the grading keys table |\nKey Name |\nCategory name (auto-lowercased, spaces converted to underscores) |\nRubric |\nFree-text description of scoring criteria |\nGrader Model |\nSelect which small LLM evaluates this category |\nWeight % |\nHow much this category counts toward the overall score |\nAdd / Remove Keys |\nAdd a row (max 8) or remove an existing one |\nWeight Total Indicator |\nLive sum — green at 100%, red otherwise |\nSave Setting |\nPersist the configuration (blocked if incomplete or named `default` ) |\n\n| Control | Purpose |\n|---|---|\nChat List |\nBrowse all saved backups, newest first |\nPrompt Summary |\nScores, categories, models, and iterations for each prompt |\nIteration Cards |\nLayer 1A vs. 1B detail with winner, model, and runtime |\nAnalyze Deeper |\nModal with average grade bar chart, radar overlay, per-category score breakdowns, token usage breakdown by provider, runtime comparison chart, adjustable weight sliders with live score recalculation for what-if analysis, weight reset, and the grader setting name from the original run |\nLoad This Chat |\nRestore a backup into the active session |\nDelete Chat |\nRemove a backup file permanently |\nUpload |\nImport and restore a JSON backup |\n\nCalls are routed automatically based on the model name:\n\n| Provider | Models | Transport |\n|---|---|---|\nOllama |\nAll models not listed below (local inference) | `ollama.chat()` , threaded with timeout |\nMistral API |\n`mistral-small-2506` , `voxtral-mini-2507` , `open-mistral-nemo-2407` |\nREST with retry, exponential backoff, and rate-limit handling |\nGoogle Gemini API |\n`gemini-2.5-flash` , `gemini-2.5-pro` |\nREST with retry, constant backoff, and rate-limit handling |\nGLM-4 (HuggingFace) |\n`glm-4-9b` , `glm-4-9b-chat` |\nLocal `transformers` , cached, preloaded at startup |\n\n28 preconfigured models are available across layers, including gemma, granite, llama, qwen, deepseek-r1, deepseek-coder-v2, falcon3, phi4, devstral, solar, codellama, dolphin3, olmo2, starcoder2, and gpt-oss. All API calls are routed automatically based on the model name and include timeout handling; failures in any layer are caught gracefully so the pipeline continues.\n\n**Python 3.10+** installed and running (required for local model inference unless you configure cloud-only providers)[Ollama](https://ollama.com/)- A\n`.env`\n\nfile with your credentials (see below)\n\n```\ngit clone https://github.com/yuvhaim-gif/LLM_InSight.git\ncd LLM_InSight\npython -m venv venv\nsource venv/bin/activate   # Linux / macOS\nvenv\\Scripts\\activate      # Windows\npip install -r requirements.txt\ncp .env.example .env       # then edit .env with your credentials\n```\n\nCopy `.env.example`\n\nto `.env`\n\nand fill in your values.\n\n| Variable | Required | Purpose |\n|---|---|---|\n`APP_USER` |\nYes | Login username |\n`APP_PASS` |\nYes | Login password |\n`FLASK_SECRET` |\nYes | Flask session secret (any random string) |\n`MISTRAL_API_KEY` |\nNo | Enables Mistral models. If omitted, those models return errors when called |\n`GOOGLE_API_KEY` |\nNo | Enables Google Gemini models. If omitted, those models return errors when called |\n`LANGCHAIN_API_KEY` |\nNo | Enables\n|\n\n`LANGCHAIN_PROJECT`\n\n`llminsight`\n\n)`PORT`\n\n`5000`\n\n)`SSL_CERT_PATH`\n\n/ `SSL_KEY_PATH`\n\n**Minimal .env** (Ollama-only, no cloud APIs):\n\n```\nAPP_USER=admin\nAPP_PASS=changeme\nFLASK_SECRET=changeme\n```\n\nThe app runs with just these three variables. Missing optional keys are noted at startup; models routed to a provider without a key return error responses, but the app itself continues to work normally.\n\nImportant:The default Layer 2 (prompt improver) model is`open-mistral-nemo-2407`\n\n, which requires`MISTRAL_API_KEY`\n\n. If you are running Ollama-only without a Mistral key, either disable theChange Prompttoggle in the UI or change`DEFAULT_LAYER2_MODEL`\n\nin`config.py`\n\nto an Ollama model (e.g.,`gemma2:9b`\n\n).\n\nPull the default models used by each layer (skip any you don't plan to use):\n\n```\nollama pull gemma:7b-instruct-q4_K_M   # Layer 1A default\nollama pull granite4:latest              # Layer 1B default\nollama pull gemma2:9b                    # Layer 0 default\nollama pull phi3:mini                    # Layer 3 grader (accuracy)\nollama pull gemma2:2b                    # Layer 3 grader (clarity)\nollama pull qwen2.5:1.5b                # Layer 3 grader (conciseness, structure)\nollama pull llama3.2:3b                  # Layer 3 grader (creativity)\n```\n\nThe full list of preconfigured models is in `config.py`\n\n.\n\n```\npython main.py\n```\n\nOpen `http://localhost:5000`\n\nand sign in with the credentials from your `.env`\n\nfile.\n\nIf you only want to use a subset of providers, leave the corresponding API key out of `.env`\n\n:\n\n**No Mistral**: omit`MISTRAL_API_KEY`\n\n. Avoid selecting Mistral models in the UI and update`DEFAULT_LAYER2_MODEL`\n\nin`config.py`\n\nto an Ollama or Gemini model.**No Google Gemini**: omit`GOOGLE_API_KEY`\n\n. Avoid selecting Gemini models in the UI.**No LangSmith**: omit`LANGCHAIN_API_KEY`\n\n. Tracing fails silently; the app works normally.**No GLM-4**: remove`glm-4-9b`\n\nand`glm-4-9b-chat`\n\nfrom the model lists in`config.py`\n\n. Optionally remove`transformers`\n\nand`torch`\n\nfrom`requirements.txt`\n\nto save disk space.**No Ollama**: remove Ollama-only models from the model lists in`config.py`\n\n, remove`ollama`\n\nfrom`requirements.txt`\n\n, and update the default model constants (`DEFAULT_LAYER1A_MODEL`\n\n,`DEFAULT_LAYER1B_MODEL`\n\n,`DEFAULT_LAYER0_MODEL`\n\n,`LAYER3_GRADER_MODELS`\n\n).\n\n- Pull it:\n`ollama pull your-model-name`\n\n- Add the model name to the appropriate list(s) in\n`config.py`\n\n- It appears in the UI dropdowns immediately\n\n- Add your API key to\n`.env`\n\nand load it in`secrets_config.py`\n\n- Add a routing check and call function in\n`ai/api_calls.py`\n\n(follow the existing Gemini/Mistral pattern) - Add the model names to the lists in\n`config.py`\n\nAdd the model name to `AVAILABLE_GRADER_MODELS`\n\nin `config.py`\n\n. The model must be available via Ollama. It will appear in the grader model dropdown on the Config Graders page.\n\nEdit the `DEFAULT_*`\n\nconstants in `config.py`\n\n(`DEFAULT_LAYER1A_MODEL`\n\n, `DEFAULT_LAYER1B_MODEL`\n\n, `DEFAULT_LAYER0_MODEL`\n\n, `DEFAULT_LAYER2_MODEL`\n\n, `LAYER3_GRADER_MODELS`\n\n).\n\n```\npip install -r requirements-dev.txt\npytest tests/ -v --tb=short\n```\n\nThe contract tests validate backup schema, restore behavior, advanced model map compatibility, auth matrix, and provider routing. Tests use monkeypatched temp directories and an isolated SQLite database — no production files are touched, no AI models are called, and no `.env`\n\nfile is required.\n\n**Session state**: authentication, selected models, weights, toggles, prompt history, advanced model maps, and the active grader setting name are stored in the server-side session and a SQLite database.**Runtime files**:`ledger.jsonl`\n\n(append-only event log),`iteration_history.json`\n\n,`best_best_layer1.json`\n\n,`console_output.txt`\n\n, and`graderdata/`\n\n(JSONL grader settings).**Browser storage**:`localStorage`\n\n(domain filter, weight preset, system type) and`sessionStorage`\n\n(review-to-main handoff).**Lifecycle**: startup, login, clear-chat, logout, exit, window close, and process signals each trigger backups of runtime files before clearing them.**JSON export**(version 2.0): captures console output, prompt history, iteration history, best-best cache, ledger entries, and full session state. Restorable via upload or the Review page.\n\nLangSmith/LangChain tracing is available on the orchestrating iterative loop and every individual AI layer (Layer 0, Layer 1, Layer 2, Layer 3) via `@traceable`\n\ndecorators. Set `LANGCHAIN_API_KEY`\n\nin `.env`\n\nto enable it. If the key is missing or invalid, tracing is disabled and the app continues to function normally.\n\n| Path | Purpose |\n|---|---|\n`main.py` |\nApplication entry point |\n`config.py` |\nModels, paths, default weights |\n`secrets_config.py` |\nCredentials loaded from `.env` |\n`graderdata/` |\nJSONL grader setting files |\n`routes/` |\n`web_routes.py` , `api_routes.py` , `review_routes.py` |\n`ai/` |\n`iterative_loop.py` , `iteration_summary.py` , `layer0.py` , `layer1.py` , `layer2.py` , `layer3.py` , `api_calls.py` |\n`models.py` |\nPydantic schemas (`Layer2Response` , `Layer2Critique` ) |\n`utils/` |\n`session.py` , `session_keys.py` , `file_io.py` , `common.py` , `text_processing.py` , `validation.py` , `grader_settings.py` |\n`state.py` , `db.py` |\nHybrid state management (SQLite + in-memory) |\n`templates/` |\nJinja2 templates (login, main, review, config_graders) with shared partials |\n`static/` |\nCSS, JavaScript, and assets |\n`tests/` |\nPytest contract tests |\n\n| Package | Purpose |\n|---|---|\n|\n\n[Pydantic](https://docs.pydantic.dev/)[ollama](https://github.com/ollama/ollama-python)[requests](https://requests.readthedocs.io/)[python-dotenv](https://github.com/theskumar/python-dotenv)`.env`\n\ninto environment variables[transformers](https://huggingface.co/docs/transformers/)[torch](https://pytorch.org/)[Chart.js](https://www.chartjs.org/)+[chartjs-plugin-datalabels](https://chartjs-plugin-datalabels.netlify.app/)[langsmith](https://docs.smith.langchain.com/)[pytest](https://docs.pytest.org/)[Architecture](/yuvhaim-gif/LLM_InSight/blob/main/docs/ARCHITECTURE.md)— system design and component layout[Implementation](/yuvhaim-gif/LLM_InSight/blob/main/docs/IMPLEMENTATION.md)— route contracts, JSON schemas, layer behavior[Refactoring Notes](/yuvhaim-gif/LLM_InSight/blob/main/docs/REFACTORING.md)— maintenance guidance and implementation notes[User Guide](/yuvhaim-gif/LLM_InSight/blob/main/docs/user%20guide.md)— end-user walkthrough\n\nContributions are welcome. If you'd like to help improve LLM InSights, please open an issue to discuss your idea before submitting a pull request. Bug reports, feature suggestions, and documentation improvements are all appreciated.\n\nThis project is released under the [MIT License](/yuvhaim-gif/LLM_InSight/blob/main/LICENSE).", "url": "https://wpnews.pro/news/llm-insights-local-demo-for-people-comments-and-ideas", "canonical_source": "https://github.com/yuvhaim-gif/LLM_InSight", "published_at": "2026-05-25 14:58:18+00:00", "updated_at": "2026-05-25 15:15:10.031318+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "ai-tools", "natural-language-processing", "ai-products"], "entities": ["Ollama", "Mistral", "Google Gemini"], "alternates": {"html": "https://wpnews.pro/news/llm-insights-local-demo-for-people-comments-and-ideas", "markdown": "https://wpnews.pro/news/llm-insights-local-demo-for-people-comments-and-ideas.md", "text": "https://wpnews.pro/news/llm-insights-local-demo-for-people-comments-and-ideas.txt", "jsonld": "https://wpnews.pro/news/llm-insights-local-demo-for-people-comments-and-ideas.jsonld"}}