Writing post-mortem root-cause summaries is time-consuming and inconsistent. Junior SREs miss contributing factors. Senior SREs write summaries that vary in depth and structure. Zero-shot LLMs produce verbose, generic output that does not follow SRE conventions.
Fine-tuning a small model on real incident data produces structured, concise summaries that follow your organisation's format at a fraction of the cost of a large model.
Diffrent type of approaches and what you get:
Manual SRE writing : Inconsistent, time-consuming, expertise-dependent
Zero-shot large model : Generic format, verbose, high cost per call
Qwen2.5-0.5B fine-tuned : SRE-format outputs, fast, cheap, runs on CPU or consumer GPU
The key advantages of this approach:
qwen3.6-plus:free
and gpt-5.4-nano
baselinesThe fine-tuned adapter is published at: daksh-neo/postmortem-qwen2.5-0.5b-lora
After training, the LoRA weights are saved to models/postmortem-lora/hf_export/
and pushed to HuggingFace.
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # fill in OPENROUTER_API_KEY
export $(cat .env | xargs)
Environment Variables
cp .env.example .env
OPENROUTER_API_KEY=your_openrouter_api_key_here
OPENROUTER_API_KEY
is required only for running baseline evaluations against zero-shot models via OpenRouter. The fine-tuning and local evaluation steps run without it.
The full pipeline runs in four steps:
Each step is independent, you can run baseline evaluation before fine-tuning to establish the gap the fine-tuned model closes, and run evaluation again after to measure the improvement.
Evaluation Rubric
Every generated summary is scored against a four-criterion rubric. Each criterion carries equal weight:
Pass threshold: 0.60 weighted score or above.
qwen/qwen3.6-plus:free
(zero-shot) - 20–35%
openai/gpt-5.4-nano
(zero-shot) - 35–50%
Qwen2.5-0.5B (fine-tuned, 3 epochs) - > 60%
The fine-tuned 0.5B model outperforms both zero-shot baselines on rubric compliance because it has been trained specifically on the output format the rubric measures, not on general-purpose tasks.
ml_project_0901/
├── scrape_postmortems.py # Data collection
├── baseline.py # Zero-shot baseline via OpenRouter
├── finetune.py # LoRA fine-tuning
├── eval.py # Evaluation + comparison
├── requirements.txt
├── .env.example
├── .gitignore
├── LICENSE
├── CONTRIBUTING.md
├── architecture.excalidraw
├── infographic.svg
├── data/
│ ├── train.jsonl # 700 training examples
│ ├── test_100.jsonl # 100 held-out test examples
│ ├── rubric.json # Scoring rubric
│ └── baseline_results.jsonl
└── models/
└── postmortem-lora/
└── hf_export/ # Push to HuggingFace after training
This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.
The requirement was a complete fine-tuning pipeline for a small model on SRE post-mortem data, with data scraping, zero-shot baseline comparison, 4-bit LoRA fine-tuning, and structured rubric-based evaluation. NEO planned, wrote, tested, and verified every file in the repository without human intervention: the data scraper producing 700 training examples and 100 held-out test examples, the baseline evaluator running zero-shot prompts against OpenRouter models, the LoRA fine-tuning script with the full model configuration, the rubric-based evaluator producing the comparison table, and the HuggingFace export pipeline pushing the trained adapter to daksh-neo/postmortem-qwen2.5-0.5b-lora
.
Use it to replace inconsistent manual post-mortem writing in your team.
Train on your own organisation's incident data by replacing data/train.jsonl
with your own incident timeline to root-cause summary pairs. The rubric in data/rubric.json
can be adapted to match your org's specific post-mortem format the evaluation pipeline measures compliance against whatever criteria you define.
Use the baseline comparison to justify the fine-tuning investment.
Run python baseline.py
before fine-tuning to measure what zero-shot models produce on your data. Run python eval.py
after fine-tuning to see the improvement. The comparison table gives you a concrete before-and-after that makes the case for domain-specific fine-tuning over general-purpose models.
Use the published adapter directly without retraining.
The fine-tuned LoRA adapter is available at daksh-neo/postmortem-qwen2.5-0.5b-lora on HuggingFace. You can load it directly without running the training pipeline - useful for teams that want to evaluate the output before committing to their own fine-tuning run.
Extend it to other structured generation tasks.
The four-step pipeline - scrape, baseline, fine-tune, evaluate is domain-agnostic. Any task where structured output format matters more than general knowledge is a candidate: alert triage summaries, change request descriptions, deployment notes. Swap the training data and rubric criteria, and the rest of the pipeline runs unchanged.
Zero-shot large models produce verbose, generic post-mortem summaries that do not follow SRE conventions. A fine-tuned 0.5B model trained on 700 domain-specific examples outperforms them on every criterion of the rubric - timeline reference, contributing factors, specific component identification, and concrete prevention actions, while running on a consumer GPU and costing a fraction per call.
The code is at https://github.com/dakshjain-1616/postmortem-finetune
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code