| # π§ͺ NVIDIA GenAI LLM Certification Lab | |
| ## Production-Ready LLM Fine-Tuning & Optimization Pipeline | |
| Estimated Time: 3-4 hours | |
| Exam Topics Covered: | |
| - Fine-Tuning (13%) | |
| - Model Optimization (17%) | |
| - Deployment (9%) | |
| - GPU Acceleration (14%) | |
| - Data Preparation (9%) | |
| - Prompt Engineering (13%) | |
| --- | |
| ## Lab Objectives | |
| Build an end-to-end pipeline that: | |
| 1. Prepares and cleans domain-specific training data | |
| 2. Fine-tunes a base model using LoRA with 4-bit quantization | |
| 3. Applies post-training quantization (GPTQ) | |
| 4. Evaluates for hallucinations using self-consistency | |
| 5. Packages for production deployment with Docker/Kubernetes | |
| --- | |
| ## Phase 1: Environment Setup (15 min) | |
| ### Requirements | |
| - Python 3.10+ | |
| - CUDA-capable GPU (8GB+ VRAM recommended) | |
| - Git | |
| ### Dependencies | |
| bash | | | pip install torch transformers peft bitsandbytes datasets accelerate evaluate | | | pip install fastapi uvicorn pydantic | | | pip install auto-gptq optimum # For GPTQ quantization | | | # Optional: pip install wandb # For experiment tracking | | | | |
| ### Verification Script | |
| Create src/verify_gpu.py: | |
| python | | | import torch | | | print(f"PyTorch version: {torch.__version__}") | | | print(f"CUDA available: {torch.cuda.is_available()}") | | | if torch.cuda.is_available(): | | | print(f"GPU: {torch.cuda.get_device_name(0)}") | | | print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") | | | | |
| Success Criteria: GPU detected, CUDA working. | |
| --- | |
| ## Phase 2: Dataset Preparation (Exam: 9% - Data Preparation) | |
| ### Task | |
| Create a domain-specific instruction-following dataset with proper cleaning and validation. | |
| ### Requirements | |
| 1. Source or create 500+ examples in a specific domain: | |
| - Technical support Q&A | |
| - Medical information | |
| - Legal document summarization | |
| - Code explanation | |
| 2. Data cleaning pipeline must: | |
| - Remove exact duplicates | |
| - Filter by token length (min: 20 tokens, max: 2048 tokens) | |
| - Validate JSON structure | |
| - Split: Train 80% / Val 10% / Test 10% | |
| 3. Format as Alpaca-style JSON: | |
| json | | | { | | | "instruction": "Explain the concept...", | | | "input": "Additional context...", | | | "output": "The answer..." | | | } | | | | |
| ### Deliverable | |
| File: src/data/prepare_dataset.py | |
| Required Functions: | |
| python | | | def load_and_clean_data(raw_path: str) -> pd.DataFrame: | | | """Load raw data, remove duplicates, filter length.""" | | | pass | | | def tokenize_dataset(examples, tokenizer, max_length: int = 512): | | | """Tokenize with proper padding/truncation.""" | | | pass | | | def save_splits(train, val, test, output_dir: str): | | | """Save to disk in HuggingFace datasets format.""" | | | pass | | | | |
| Output Structure: | |
| | | | data/ | | | βββ raw/ # Original data files | | | βββ processed/ | | | βββ train/ # HuggingFace dataset format | | | βββ validation/ | | | βββ test/ | | | | |
| ### Grading Criteria | |
| - [ ] Dataset loads without errors | |
| - [ ] Duplicates removed (report count before/after) | |
| - [ ] Token length filtering implemented | |
| - [ ] Splits are stratified (balanced distribution) | |
| - [ ] Tokenization handles padding correctly | |
| --- | |
| ## Phase 3: LoRA Fine-Tuning (Exam: 13% - Fine-Tuning) | |
| ### Task | |
| Fine-tune a base model using LoRA with QLoRA (4-bit quantization). | |
| ### Model Selection | |
| Choose one: | |
| - microsoft/phi-2 (2.7B, fast) | |
| - TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B, fastest) | |
| - meta-llama/Llama-2-7b-hf (requires HuggingFace token) | |
| ### LoRA Configuration | |
| python | | | peft_config = LoraConfig( | | | r=16, # Rank | | | lora_alpha=32, # Scaling | | | target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], | | | lora_dropout=0.05, | | | bias="none", | | | task_type="CAUSAL_LM" | | | ) | | | | |
| ### Quantization Configuration | |
| python | | | bnb_config = BitsAndBytesConfig( | | | load_in_4bit=True, | | | bnb_4bit_quant_type="nf4", | | | bnb_4bit_compute_dtype=torch.float16, | | | bnb_4bit_use_double_quant=True, | | | ) | | | | |
| ### Training Configuration | |
| python | | | training_args = TrainingArguments( | | | output_dir="./results", | | | num_train_epochs=3, | | | per_device_train_batch_size=4, | | | gradient_accumulation_steps=4, # Effective batch = 16 | | | learning_rate=2e-4, | | | logging_steps=10, | | | save_strategy="epoch", | | | fp16=True, | | | gradient_checkpointing=True, # Memory efficiency | | | optim="paged_adamw_8bit", # QLoRA optimizer | | | ) | | | | |
| ### Deliverable | |
| File: src/training/train_lora.py | |
| Requirements: | |
| - Load base model with quantization | |
| - Apply LoRA adapters | |
| - Train for 3 epochs | |
| - Save adapter weights to models/lora-adapter/ | |
| - Export LoRA config to configs/lora_config.json | |
| - Log training metrics (loss per epoch) | |
| Expected Outputs: | |
| | | | models/ | | | βββ lora-adapter/ | | | βββ adapter_config.json | | | βββ adapter_model.bin | | | βββ README.md | | | | |
| ### Grading Criteria | |
| - [ ] Model loads in 4-bit mode | |
| - [ ] LoRA applied to correct modules | |
| - [ ] Training completes without OOM | |
| - [ ] Loss decreases over epochs | |
| - [ ] Adapter files saved correctly | |
| --- | |
| ## Phase 4: Model Optimization (Exam: 17% - Model Optimization) | |
| ### Task | |
| Apply post-training quantization and benchmark performance. | |
| ### Quantization Method | |
| Implement GPTQ 4-bit quantization on the merged model (base + LoRA). | |
| ### Steps | |
| 1. Merge LoRA weights into base model | |
| 2. Quantize to 4-bit using GPTQ | |
| 3. Compare three variants: | |
| | Variant | Memory | Speed | Notes | | |
| |---------|--------|-------|-------| | |
| | Base FP16 | Baseline | Baseline | Original model | | |
| | LoRA (4-bit base) | ~6GB | Fast | QLoRA during training | | |
| | GPTQ 4-bit | ~4GB | Fastest | Merged + quantized | | |
| ### Deliverable | |
| File: src/optimization/quantize.py | |
| Required Functions: | |
| python | | | def merge_and_quantize_gptq( | | | base_model_path: str, | | | adapter_path: str, | | | output_path: str, | | | bits: int = 4 | | | ): | | | """Merge LoRA and apply GPTQ quantization.""" | | | pass | | | def benchmark_inference(model, tokenizer, test_prompts: list) -> dict: | | | """Return memory usage and tokens/sec.""" | | | pass | | | def compare_models(models_dict: dict, test_prompts: list) -> pd.DataFrame: | | | """Compare all variants, return results table.""" | | | pass | | | | |
| Output: results/optimization_report.json | |
| json | | | { | | | "base_fp16": {"memory_gb": 5.4, "tokens_per_sec": 45.2}, | | | "lora_4bit": {"memory_gb": 3.8, "tokens_per_sec": 52.1}, | | | "gptq_4bit": {"memory_gb": 2.1, "tokens_per_sec": 67.3} | | | } | | | | |
| ### Grading Criteria | |
| - [ ] GPTQ quantization executes without error | |
| - [ ] Memory usage measured correctly | |
| - [ ] Speed benchmark implemented (warm-up + timed) | |
| - [ ] Comparison table generated | |
| - [ ] Trade-offs documented | |
| --- | |
| ## Phase 5: Evaluation & Hallucination Detection (Exam: 13% - Prompt Engineering) | |
| ### Task | |
| Build an evaluation pipeline that detects hallucinations using self-consistency. | |
| ### Hallucination Detection Strategy | |
| 1. Self-Consistency: Generate 3 responses per prompt, measure agreement | |
| 2. NLI Verification: Check if output contradicts source context | |
| 3. Perplexity Scoring: High perplexity = potential hallucination | |
| ### Test Prompts | |
| Create 50 prompts across categories: | |
| - Factual questions (likely to hallucinate dates/names) | |
| - Mathematical reasoning | |
| - Recent events (post-training knowledge cutoff) | |
| - Counterfactuals | |
| ### Deliverable | |
| File: src/evaluation/evaluate.py | |
| Required Functions: | |
| python | | | def generate_multiple_responses( | | | model, tokenizer, prompt: str, | | | n: int = 3, temperature: float = 0.7 | | | ) -> list: | | | """Generate n diverse responses.""" | | | pass | | | def calculate_self_consistency(responses: list) -> float: | | | """Return agreement score 0-1 using embeddings.""" | | | pass | | | def detect_hallucination_nli( | | | context: str, generated: str, nli_model | | | ) -> bool: | | | """Use NLI model to detect contradictions.""" | | | pass | | | def calculate_perplexity(model, tokenizer, text: str) -> float: | | | """Calculate perplexity on text.""" | | | pass | | | def run_evaluation(model_path: str, test_data_path: str) -> dict: | | | """Main evaluation loop, returns full report.""" | | | pass | | | | |
| Output: results/evaluation_report.json | |
| json | | | { | | | "overall_metrics": { | | | "avg_perplexity": 12.4, | | | "avg_self_consistency": 0.78, | | | "hallucination_rate": 0.15 | | | }, | | | "per_category": { | | | "factual": {"hallucination_rate": 0.22}, | | | "math": {"hallucination_rate": 0.08}, | | | "recent_events": {"hallucination_rate": 0.35} | | | }, | | | "examples": [ | | | { | | | "prompt": "Who won the 2024 US Presidential election?", | | | "responses": [...], | | | "consistency": 0.33, | | | "flagged": true | | | } | | | ] | | | } | | | | |
| ### Grading Criteria | |
| - [ ] 50 test prompts created | |
| - [ ] Self-consistency implemented (embeddings or n-gram overlap) | |
| - [ ] Perplexity calculation correct | |
| - [ ] Hallucination rate computed per category | |
| - [ ] Specific failure cases documented | |
| --- | |
| ## Phase 6: Deployment Packaging (Exam: 9% - Model Deployment) | |
| ### Task | |
| Containerize the inference API for production deployment. | |
| ### API Requirements | |
| Create FastAPI server with: | |
| - POST /generate - Text generation endpoint | |
| - GET /health - Health check | |
| - Request validation with Pydantic | |
| - Dynamic batching for concurrent requests | |
| - Proper error handling | |
| ### API Schema | |
| python | | | class GenerateRequest(BaseModel): | | | prompt: str | | | max_tokens: int = 512 | | | temperature: float = 0.7 | | | top_p: float = 0.9 | | | class GenerateResponse(BaseModel): | | | generated_text: str | | | tokens_generated: int | | | inference_time_ms: float | | | | |
| ### Deliverables | |
| File: src/api/inference_server.py | |
| python | | | from fastapi import FastAPI | | | from pydantic import BaseModel | | | import torch | | | from transformers import AutoModelForCausalLM, AutoTokenizer | | | app = FastAPI() | | | # Load model on startup | | | @app.on_event("startup") | | | async def load_model(): | | | # Load your quantized model | | | pass | | | @app.post("/generate") | | | async def generate(request: GenerateRequest): | | | # Generate response | | | pass | | | @app.get("/health") | | | async def health(): | | | return {"status": "healthy", "gpu": torch.cuda.is_available()} | | | | |
| File: Dockerfile | |
| dockerfile | | | FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 | | | WORKDIR /app | | | RUN apt-get update && apt-get install -y python3 python3-pip | | | COPY requirements.txt . | | | RUN pip3 install -r requirements.txt | | | COPY src/ ./src/ | | | COPY models/ ./models/ | | | EXPOSE 8000 | | | CMD ["uvicorn", "src.api.inference_server:app", "--host", "0.0.0.0", "--port", "8000"] | | | | |
| File: docker-compose.yml | |
| yaml | | | version: '3.8' | | | services: | | | llm-api: | | | build: . | | | ports: | | | - "8000:8000" | | | deploy: | | | resources: | | | reservations: | | | devices: | | | - driver: nvidia | | | count: 1 | | | capabilities: [gpu] | | | environment: | | | - CUDA_VISIBLE_DEVICES=0 | | | | |
| File: k8s/deployment.yaml | |
| yaml | | | apiVersion: apps/v1 | | | kind: Deployment | | | metadata: | | | name: llm-inference | | | spec: | | | replicas: 1 | | | selector: | | | matchLabels: | | | app: llm-inference | | | template: | | | metadata: | | | labels: | | | app: llm-inference | | | spec: | | | containers: | | | - name: llm-api | | | image: llm-production-lab:latest | | | resources: | | | limits: | | | nvidia.com/gpu: 1 | | | ports: | | | - containerPort: 8000 | | | | |
| ### Testing Commands | |
| bash | | | # Build and run | | | docker-compose up --build | | | # Test health | | | curl http://localhost:8000/health | | | # Test generation | | | curl -X POST http://localhost:8000/generate \ | | | -H "Content-Type: application/json" \ | | | -d '{"prompt": "What is LoRA?", "max_tokens": 100}' | | | | |
| ### Grading Criteria | |
| - [ ] FastAPI server starts | |
| - [ ] /health endpoint responds | |
| - [ ] /generate returns valid JSON | |
| - [ ] Docker image builds successfully | |
| - [ ] Container runs with GPU access | |
| - [ ] Kubernetes manifest valid (can kubectl apply --dry-run) | |
| --- | |
| ## Phase 7: Documentation | |
| ### Required: README.md | |
| Must include: | |
| 1. Architecture Overview (ASCII diagram acceptable) | |
| | | | Raw Data β Clean β Tokenize β Train (LoRA) β Quantize β Deploy | | | β β β β β β | | | Phase 2 Phase 2 Phase 2 Phase 3 Phase 4 Phase 6 | | | | |
| 2. Setup Instructions | |
| - Prerequisites | |
| - Installation steps | |
| - Environment variables needed | |
| 3. Performance Benchmarks Table | |
| | Model Variant | VRAM (GB) | Tokens/sec | Perplexity | | |
| |--------------|-----------|------------|------------| | |
| | Base FP16 | 5.4 | 45 | 8.2 | | |
| | LoRA 4-bit | 3.8 | 52 | 8.4 | | |
| | GPTQ 4-bit | 2.1 | 67 | 9.1 | | |
| 4. Trade-offs Documented | |
| - Why LoRA vs full fine-tuning? | |
| - Why GPTQ vs AWQ? | |
| - Batching strategy chosen | |
| 5. Known Limitations | |
| - Model size constraints | |
| - Inference latency under load | |
| - Hallucination detection false positive rate | |
| --- | |
| ## π Final Submission Checklist | |
| Your repository should contain: | |
| | | | llm-production-lab/ | | | βββ README.md # Required documentation | | | βββ requirements.txt # All dependencies | | | βββ Dockerfile # Multi-stage build | | | βββ docker-compose.yml # GPU-enabled compose | | | βββ configs/ | | | β βββ lora_config.json # Exported LoRA config | | | βββ src/ | | | β βββ verify_gpu.py # Phase 1 verification | | | β βββ data/ | | | β β βββ prepare_dataset.py # Phase 2 | | | β βββ training/ | | | β β βββ train_lora.py # Phase 3 | | | β βββ optimization/ | | | β β βββ quantize.py # Phase 4 | | | β βββ evaluation/ | | | β β βββ evaluate.py # Phase 5 | | | β βββ api/ | | | β βββ inference_server.py # Phase 6 | | | βββ data/ | | | β βββ processed/ # Gitignore large files | | | β βββ sample/ # Include 5 sample records | | | βββ models/ | | | β βββ lora-adapter/ # Gitignore, add README | | | βββ k8s/ | | | β βββ deployment.yaml # Phase 6 | | | βββ results/ | | | βββ optimization_report.json # Phase 4 | | | βββ evaluation_report.json # Phase 5 | | | | |
| ### .gitignore Recommendations | |
| | | | data/processed/* | | | !data/processed/sample/ | | | models/lora-adapter/* | | | !models/lora-adapter/README.md | | | results/*.json | | | __pycache__/ | | | *.pyc | | | .ipynb_checkpoints/ | | | | |
| --- | |
| ## π― Grading Rubric | |
| | Phase | Weight | Criteria | | |
| |-------|--------|----------| | |
| | Data Prep | 10% | Clean dataset, proper splits | | |
| | LoRA Training | 20% | Model trains, loss decreases, adapters saved | | |
| | Quantization | 25% | GPTQ working, benchmarks reported | | |
| | Evaluation | 20% | Hallucination detection implemented | | |
| | Deployment | 15% | Container builds, API responds | | |
| | Documentation | 10% | README complete, trade-offs explained | | |
| Total: 100 points | |
| - 90-100: Exam-ready (strong understanding of all topics) | |
| - 80-89: Good (minor gaps, solid foundation) | |
| - 70-79: Adequate (some areas need review) | |
| - Below 70: Needs revision | |
| --- | |
| ## π References | |
| - LoRA Paper: Hu et al. (2021) - "Low-Rank Adaptation of Large Language Models" | |
| - QLoRA: Dettmers et al. (2023) - "QLoRA: Efficient Finetuning of Quantized LLMs" | |
| - GPTQ: Frantar et al. (2022) - "GPTQ: Accurate Post-Training Quantization" | |
| - Self-Consistency: Wang et al. (2022) - "Self-Consistency Improves Chain of Thought Reasoning" | |
| --- | |
| ## π Bonus Challenges (Optional) | |
| 1. DeepSpeed Integration: Add ZeRO-3 for distributed training support | |
| 2. Speculative Decoding: Implement draft model for 2x inference speed | |
| 3. vLLM Integration: Replace transformers with vLLM for production serving | |
| 4. RLHF/DPO: Add preference tuning step after SFT | |
| 5. Multi-GPU: Scale training across multiple GPUs with FSDP | |
| --- | |
| Good luck! Push your repo when ready and I'll provide detailed feedback. |
Show HN: Appaloft β deploy to your own servers from CLI, GitHub Actions, or AI