Security Checks with Local LLMs An experiment using local Large Language Models (LLMs) to perform security checks and code quality reviews, motivated by rising costs and new limits for cloud-based LLM APIs. The author selected the `qwen2.5-coder:14b-instruct-q5_K_M` model via Ollama, running on a MacBook Air M5 with 24GB RAM, and created a custom bash script to automate file scanning with configurable prompts and cooldown delays. The article concludes that using a 32k context window provides a good balance between execution speed and hardware temperature, with plans for further experimentation. Continuing articles AI-Powered Repository Security Check with Antigravity Workflow https://dev.to/gdg/ai-powered-repository-security-check-with-antigravity-workflow-5hee and https://dev.to/gdg/how-to-build-a-custom-ai-quality-gate-on-cloud-run-from-zero-to-production-1odp https://dev.to/gdg/how-to-build-a-custom-ai-quality-gate-on-cloud-run-from-zero-to-production-1odp I've decided to try to outsource some checks to local LLM. This article describes my experiment and outcomes. Will be glad to read your questions, proposals, opinions or advices 🙌 You can listen a podcast generated based on this publication thanks NotebookLM : Intro Last changes in limits management for popular LLM APIs make me thinking about FinOps management. Why should I spend expensive cloud tokens for simple tasks? Also I have a lot of talks at last security and AI events which led me to begin experiments with local LLMs in terms of code generation and code quality checks. Hardware The hardware for experiments is MacBook Air M5 24GB RAM. I bought it especially for diving into ML topics but it was underloaded since today. Pains The first pain was an introduction of new limits for the Antigravity IDE. Along with models list changing it led me to think about optimizing my development and security flows which were intended to use cheaper Antigravity tokens prior to more expensive Vertex AI tokens. The second pain was the FOMO effect about Machine Learning and MLOps itself. Solution Track After some iterations with Ollama and local models I've selected the qwen2.5-coder:14b-instruct-q5 K M as a base model with optimized context window: % cat Modelfile-qwen-32k FROM qwen2.5-coder:14b-instruct-q5 K M PARAMETER num ctx 32000 % ollama create qwen-coder-32k -f ./Modelfile-qwen-32k ... % ollama list NAME ID SIZE MODIFIED qwen-coder-32k:latest dc3c4762d967 10 GB 2 hours ago qwen-coder-64k:latest 42f060e717dd 10 GB 2 hours ago qwen2.5-coder:14b-instruct-q5 K M 05d16c5ac1c1 10 GB 2 hours ago gemma4:e4b c6eb396dbd59 9.6 GB 25 hours ago gemma4:e2b 7fbdbf8f5e45 7.2 GB 25 hours ago The 32k window provided me with quite quick execution and a trade-off between the speed and the temperature of my laptop. I think this configuration will be a subject of experiments in near future. Then I've realized that I have to decompose tasks and give some rest time between requests to my hardware. So the unified script was born: bash /bin/bash Default values OUTPUT DIR="." MODEL NAME="qwen-coder-32k" COEFF=2 PROMPT FILE="" show help { echo "Usage: $0 -d