# NVIDIA GenAI LLM Certification Lab

> Source: <https://gist.github.com/cmcintosh/78aa2ab81cc413154c978ec15c8a78ab>
> Published: 2026-06-24 13:01:09+00:00

| # 🧪 NVIDIA GenAI LLM Certification Lab | |
| ## Production-Ready LLM Fine-Tuning & Optimization Pipeline | |
| **Estimated Time:** 3-4 hours | |
| **Exam Topics Covered:** | |
| - Fine-Tuning (13%) | |
| - Model Optimization (17%) | |
| - Deployment (9%) | |
| - GPU Acceleration (14%) | |
| - Data Preparation (9%) | |
| - Prompt Engineering (13%) | |
| --- | |
| ## Lab Objectives | |
| Build an end-to-end pipeline that: | |
| 1. Prepares and cleans domain-specific training data | |
| 2. Fine-tunes a base model using LoRA with 4-bit quantization | |
| 3. Applies post-training quantization (GPTQ) | |
| 4. Evaluates for hallucinations using self-consistency | |
| 5. Packages for production deployment with Docker/Kubernetes | |
| --- | |
| ## Phase 1: Environment Setup (15 min) | |
| ### Requirements | |
| - Python 3.10+ | |
| - CUDA-capable GPU (8GB+ VRAM recommended) | |
| - Git | |
| ### Dependencies | |
| ``` bash | |
| pip install torch transformers peft bitsandbytes datasets accelerate evaluate | |
| pip install fastapi uvicorn pydantic | |
| pip install auto-gptq optimum # For GPTQ quantization | |
| # Optional: pip install wandb # For experiment tracking | |
| ``` | |
| ### Verification Script | |
| Create `src/verify_gpu.py`: | |
| ``` python | |
| import torch | |
| print(f"PyTorch version: {torch.__version__}") | |
| print(f"CUDA available: {torch.cuda.is_available()}") | |
| if torch.cuda.is_available(): | |
| print(f"GPU: {torch.cuda.get_device_name(0)}") | |
| print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") | |
| ``` | |
| **Success Criteria:** GPU detected, CUDA working. | |
| --- | |
| ## Phase 2: Dataset Preparation (Exam: 9% - Data Preparation) | |
| ### Task | |
| Create a domain-specific instruction-following dataset with proper cleaning and validation. | |
| ### Requirements | |
| 1. Source or create 500+ examples in a specific domain: | |
| - Technical support Q&A | |
| - Medical information | |
| - Legal document summarization | |
| - Code explanation | |
| 2. Data cleaning pipeline must: | |
| - Remove exact duplicates | |
| - Filter by token length (min: 20 tokens, max: 2048 tokens) | |
| - Validate JSON structure | |
| - Split: Train 80% / Val 10% / Test 10% | |
| 3. Format as Alpaca-style JSON: | |
| ``` json | |
| { | |
| "instruction": "Explain the concept...", | |
| "input": "Additional context...", | |
| "output": "The answer..." | |
| } | |
| ``` | |
| ### Deliverable | |
| **File:** `src/data/prepare_dataset.py` | |
| **Required Functions:** | |
| ``` python | |
| def load_and_clean_data(raw_path: str) -> pd.DataFrame: | |
| """Load raw data, remove duplicates, filter length.""" | |
| pass | |
| def tokenize_dataset(examples, tokenizer, max_length: int = 512): | |
| """Tokenize with proper padding/truncation.""" | |
| pass | |
| def save_splits(train, val, test, output_dir: str): | |
| """Save to disk in HuggingFace datasets format.""" | |
| pass | |
| ``` | |
| **Output Structure:** | |
| ``` | |
| data/ | |
| ├── raw/ # Original data files | |
| └── processed/ | |
| ├── train/ # HuggingFace dataset format | |
| ├── validation/ | |
| └── test/ | |
| ``` | |
| ### Grading Criteria | |
| - [ ] Dataset loads without errors | |
| - [ ] Duplicates removed (report count before/after) | |
| - [ ] Token length filtering implemented | |
| - [ ] Splits are stratified (balanced distribution) | |
| - [ ] Tokenization handles padding correctly | |
| --- | |
| ## Phase 3: LoRA Fine-Tuning (Exam: 13% - Fine-Tuning) | |
| ### Task | |
| Fine-tune a base model using LoRA with QLoRA (4-bit quantization). | |
| ### Model Selection | |
| Choose one: | |
| - `microsoft/phi-2` (2.7B, fast) | |
| - `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (1.1B, fastest) | |
| - `meta-llama/Llama-2-7b-hf` (requires HuggingFace token) | |
| ### LoRA Configuration | |
| ``` python | |
| peft_config = LoraConfig( | |
| r=16, # Rank | |
| lora_alpha=32, # Scaling | |
| target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], | |
| lora_dropout=0.05, | |
| bias="none", | |
| task_type="CAUSAL_LM" | |
| ) | |
| ``` | |
| ### Quantization Configuration | |
| ``` python | |
| bnb_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype=torch.float16, | |
| bnb_4bit_use_double_quant=True, | |
| ) | |
| ``` | |
| ### Training Configuration | |
| ``` python | |
| training_args = TrainingArguments( | |
| output_dir="./results", | |
| num_train_epochs=3, | |
| per_device_train_batch_size=4, | |
| gradient_accumulation_steps=4, # Effective batch = 16 | |
| learning_rate=2e-4, | |
| logging_steps=10, | |
| save_strategy="epoch", | |
| fp16=True, | |
| gradient_checkpointing=True, # Memory efficiency | |
| optim="paged_adamw_8bit", # QLoRA optimizer | |
| ) | |
| ``` | |
| ### Deliverable | |
| **File:** `src/training/train_lora.py` | |
| **Requirements:** | |
| - Load base model with quantization | |
| - Apply LoRA adapters | |
| - Train for 3 epochs | |
| - Save adapter weights to `models/lora-adapter/` | |
| - Export LoRA config to `configs/lora_config.json` | |
| - Log training metrics (loss per epoch) | |
| **Expected Outputs:** | |
| ``` | |
| models/ | |
| └── lora-adapter/ | |
| ├── adapter_config.json | |
| ├── adapter_model.bin | |
| └── README.md | |
| ``` | |
| ### Grading Criteria | |
| - [ ] Model loads in 4-bit mode | |
| - [ ] LoRA applied to correct modules | |
| - [ ] Training completes without OOM | |
| - [ ] Loss decreases over epochs | |
| - [ ] Adapter files saved correctly | |
| --- | |
| ## Phase 4: Model Optimization (Exam: 17% - Model Optimization) | |
| ### Task | |
| Apply post-training quantization and benchmark performance. | |
| ### Quantization Method | |
| Implement **GPTQ 4-bit** quantization on the merged model (base + LoRA). | |
| ### Steps | |
| 1. Merge LoRA weights into base model | |
| 2. Quantize to 4-bit using GPTQ | |
| 3. Compare three variants: | |
| | Variant | Memory | Speed | Notes | | |
| |---------|--------|-------|-------| | |
| | Base FP16 | Baseline | Baseline | Original model | | |
| | LoRA (4-bit base) | ~6GB | Fast | QLoRA during training | | |
| | GPTQ 4-bit | ~4GB | Fastest | Merged + quantized | | |
| ### Deliverable | |
| **File:** `src/optimization/quantize.py` | |
| **Required Functions:** | |
| ``` python | |
| def merge_and_quantize_gptq( | |
| base_model_path: str, | |
| adapter_path: str, | |
| output_path: str, | |
| bits: int = 4 | |
| ): | |
| """Merge LoRA and apply GPTQ quantization.""" | |
| pass | |
| def benchmark_inference(model, tokenizer, test_prompts: list) -> dict: | |
| """Return memory usage and tokens/sec.""" | |
| pass | |
| def compare_models(models_dict: dict, test_prompts: list) -> pd.DataFrame: | |
| """Compare all variants, return results table.""" | |
| pass | |
| ``` | |
| **Output:** `results/optimization_report.json` | |
| ``` json | |
| { | |
| "base_fp16": {"memory_gb": 5.4, "tokens_per_sec": 45.2}, | |
| "lora_4bit": {"memory_gb": 3.8, "tokens_per_sec": 52.1}, | |
| "gptq_4bit": {"memory_gb": 2.1, "tokens_per_sec": 67.3} | |
| } | |
| ``` | |
| ### Grading Criteria | |
| - [ ] GPTQ quantization executes without error | |
| - [ ] Memory usage measured correctly | |
| - [ ] Speed benchmark implemented (warm-up + timed) | |
| - [ ] Comparison table generated | |
| - [ ] Trade-offs documented | |
| --- | |
| ## Phase 5: Evaluation & Hallucination Detection (Exam: 13% - Prompt Engineering) | |
| ### Task | |
| Build an evaluation pipeline that detects hallucinations using self-consistency. | |
| ### Hallucination Detection Strategy | |
| 1. **Self-Consistency:** Generate 3 responses per prompt, measure agreement | |
| 2. **NLI Verification:** Check if output contradicts source context | |
| 3. **Perplexity Scoring:** High perplexity = potential hallucination | |
| ### Test Prompts | |
| Create 50 prompts across categories: | |
| - Factual questions (likely to hallucinate dates/names) | |
| - Mathematical reasoning | |
| - Recent events (post-training knowledge cutoff) | |
| - Counterfactuals | |
| ### Deliverable | |
| **File:** `src/evaluation/evaluate.py` | |
| **Required Functions:** | |
| ``` python | |
| def generate_multiple_responses( | |
| model, tokenizer, prompt: str, | |
| n: int = 3, temperature: float = 0.7 | |
| ) -> list: | |
| """Generate n diverse responses.""" | |
| pass | |
| def calculate_self_consistency(responses: list) -> float: | |
| """Return agreement score 0-1 using embeddings.""" | |
| pass | |
| def detect_hallucination_nli( | |
| context: str, generated: str, nli_model | |
| ) -> bool: | |
| """Use NLI model to detect contradictions.""" | |
| pass | |
| def calculate_perplexity(model, tokenizer, text: str) -> float: | |
| """Calculate perplexity on text.""" | |
| pass | |
| def run_evaluation(model_path: str, test_data_path: str) -> dict: | |
| """Main evaluation loop, returns full report.""" | |
| pass | |
| ``` | |
| **Output:** `results/evaluation_report.json` | |
| ``` json | |
| { | |
| "overall_metrics": { | |
| "avg_perplexity": 12.4, | |
| "avg_self_consistency": 0.78, | |
| "hallucination_rate": 0.15 | |
| }, | |
| "per_category": { | |
| "factual": {"hallucination_rate": 0.22}, | |
| "math": {"hallucination_rate": 0.08}, | |
| "recent_events": {"hallucination_rate": 0.35} | |
| }, | |
| "examples": [ | |
| { | |
| "prompt": "Who won the 2024 US Presidential election?", | |
| "responses": [...], | |
| "consistency": 0.33, | |
| "flagged": true | |
| } | |
| ] | |
| } | |
| ``` | |
| ### Grading Criteria | |
| - [ ] 50 test prompts created | |
| - [ ] Self-consistency implemented (embeddings or n-gram overlap) | |
| - [ ] Perplexity calculation correct | |
| - [ ] Hallucination rate computed per category | |
| - [ ] Specific failure cases documented | |
| --- | |
| ## Phase 6: Deployment Packaging (Exam: 9% - Model Deployment) | |
| ### Task | |
| Containerize the inference API for production deployment. | |
| ### API Requirements | |
| Create FastAPI server with: | |
| - `POST /generate` - Text generation endpoint | |
| - `GET /health` - Health check | |
| - Request validation with Pydantic | |
| - Dynamic batching for concurrent requests | |
| - Proper error handling | |
| ### API Schema | |
| ``` python | |
| class GenerateRequest(BaseModel): | |
| prompt: str | |
| max_tokens: int = 512 | |
| temperature: float = 0.7 | |
| top_p: float = 0.9 | |
| class GenerateResponse(BaseModel): | |
| generated_text: str | |
| tokens_generated: int | |
| inference_time_ms: float | |
| ``` | |
| ### Deliverables | |
| **File:** `src/api/inference_server.py` | |
| ``` python | |
| from fastapi import FastAPI | |
| from pydantic import BaseModel | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| app = FastAPI() | |
| # Load model on startup | |
| @app.on_event("startup") | |
| async def load_model(): | |
| # Load your quantized model | |
| pass | |
| @app.post("/generate") | |
| async def generate(request: GenerateRequest): | |
| # Generate response | |
| pass | |
| @app.get("/health") | |
| async def health(): | |
| return {"status": "healthy", "gpu": torch.cuda.is_available()} | |
| ``` | |
| **File:** `Dockerfile` | |
| ``` dockerfile | |
| FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 | |
| WORKDIR /app | |
| RUN apt-get update && apt-get install -y python3 python3-pip | |
| COPY requirements.txt . | |
| RUN pip3 install -r requirements.txt | |
| COPY src/ ./src/ | |
| COPY models/ ./models/ | |
| EXPOSE 8000 | |
| CMD ["uvicorn", "src.api.inference_server:app", "--host", "0.0.0.0", "--port", "8000"] | |
| ``` | |
| **File:** `docker-compose.yml` | |
| ``` yaml | |
| version: '3.8' | |
| services: | |
| llm-api: | |
| build: . | |
| ports: | |
| - "8000:8000" | |
| deploy: | |
| resources: | |
| reservations: | |
| devices: | |
| - driver: nvidia | |
| count: 1 | |
| capabilities: [gpu] | |
| environment: | |
| - CUDA_VISIBLE_DEVICES=0 | |
| ``` | |
| **File:** `k8s/deployment.yaml` | |
| ``` yaml | |
| apiVersion: apps/v1 | |
| kind: Deployment | |
| metadata: | |
| name: llm-inference | |
| spec: | |
| replicas: 1 | |
| selector: | |
| matchLabels: | |
| app: llm-inference | |
| template: | |
| metadata: | |
| labels: | |
| app: llm-inference | |
| spec: | |
| containers: | |
| - name: llm-api | |
| image: llm-production-lab:latest | |
| resources: | |
| limits: | |
| nvidia.com/gpu: 1 | |
| ports: | |
| - containerPort: 8000 | |
| ``` | |
| ### Testing Commands | |
| ``` bash | |
| # Build and run | |
| docker-compose up --build | |
| # Test health | |
| curl http://localhost:8000/health | |
| # Test generation | |
| curl -X POST http://localhost:8000/generate \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"prompt": "What is LoRA?", "max_tokens": 100}' | |
| ``` | |
| ### Grading Criteria | |
| - [ ] FastAPI server starts | |
| - [ ] `/health` endpoint responds | |
| - [ ] `/generate` returns valid JSON | |
| - [ ] Docker image builds successfully | |
| - [ ] Container runs with GPU access | |
| - [ ] Kubernetes manifest valid (can `kubectl apply --dry-run`) | |
| --- | |
| ## Phase 7: Documentation | |
| ### Required: `README.md` | |
| Must include: | |
| 1. **Architecture Overview** (ASCII diagram acceptable) | |
| ``` | |
| Raw Data → Clean → Tokenize → Train (LoRA) → Quantize → Deploy | |
| ↓ ↓ ↓ ↓ ↓ ↓ | |
| Phase 2 Phase 2 Phase 2 Phase 3 Phase 4 Phase 6 | |
| ``` | |
| 2. **Setup Instructions** | |
| - Prerequisites | |
| - Installation steps | |
| - Environment variables needed | |
| 3. **Performance Benchmarks Table** | |
| | Model Variant | VRAM (GB) | Tokens/sec | Perplexity | | |
| |--------------|-----------|------------|------------| | |
| | Base FP16 | 5.4 | 45 | 8.2 | | |
| | LoRA 4-bit | 3.8 | 52 | 8.4 | | |
| | GPTQ 4-bit | 2.1 | 67 | 9.1 | | |
| 4. **Trade-offs Documented** | |
| - Why LoRA vs full fine-tuning? | |
| - Why GPTQ vs AWQ? | |
| - Batching strategy chosen | |
| 5. **Known Limitations** | |
| - Model size constraints | |
| - Inference latency under load | |
| - Hallucination detection false positive rate | |
| --- | |
| ## 📋 Final Submission Checklist | |
| Your repository should contain: | |
| ``` | |
| llm-production-lab/ | |
| ├── README.md # Required documentation | |
| ├── requirements.txt # All dependencies | |
| ├── Dockerfile # Multi-stage build | |
| ├── docker-compose.yml # GPU-enabled compose | |
| ├── configs/ | |
| │ └── lora_config.json # Exported LoRA config | |
| ├── src/ | |
| │ ├── verify_gpu.py # Phase 1 verification | |
| │ ├── data/ | |
| │ │ └── prepare_dataset.py # Phase 2 | |
| │ ├── training/ | |
| │ │ └── train_lora.py # Phase 3 | |
| │ ├── optimization/ | |
| │ │ └── quantize.py # Phase 4 | |
| │ ├── evaluation/ | |
| │ │ └── evaluate.py # Phase 5 | |
| │ └── api/ | |
| │ └── inference_server.py # Phase 6 | |
| ├── data/ | |
| │ └── processed/ # Gitignore large files | |
| │ └── sample/ # Include 5 sample records | |
| ├── models/ | |
| │ └── lora-adapter/ # Gitignore, add README | |
| ├── k8s/ | |
| │ └── deployment.yaml # Phase 6 | |
| └── results/ | |
| ├── optimization_report.json # Phase 4 | |
| └── evaluation_report.json # Phase 5 | |
| ``` | |
| ### .gitignore Recommendations | |
| ``` | |
| data/processed/* | |
| !data/processed/sample/ | |
| models/lora-adapter/* | |
| !models/lora-adapter/README.md | |
| results/*.json | |
| __pycache__/ | |
| *.pyc | |
| .ipynb_checkpoints/ | |
| ``` | |
| --- | |
| ## 🎯 Grading Rubric | |
| | Phase | Weight | Criteria | | |
| |-------|--------|----------| | |
| | Data Prep | 10% | Clean dataset, proper splits | | |
| | LoRA Training | 20% | Model trains, loss decreases, adapters saved | | |
| | Quantization | 25% | GPTQ working, benchmarks reported | | |
| | Evaluation | 20% | Hallucination detection implemented | | |
| | Deployment | 15% | Container builds, API responds | | |
| | Documentation | 10% | README complete, trade-offs explained | | |
| **Total: 100 points** | |
| - 90-100: Exam-ready (strong understanding of all topics) | |
| - 80-89: Good (minor gaps, solid foundation) | |
| - 70-79: Adequate (some areas need review) | |
| - Below 70: Needs revision | |
| --- | |
| ## 📚 References | |
| - **LoRA Paper:** Hu et al. (2021) - "Low-Rank Adaptation of Large Language Models" | |
| - **QLoRA:** Dettmers et al. (2023) - "QLoRA: Efficient Finetuning of Quantized LLMs" | |
| - **GPTQ:** Frantar et al. (2022) - "GPTQ: Accurate Post-Training Quantization" | |
| - **Self-Consistency:** Wang et al. (2022) - "Self-Consistency Improves Chain of Thought Reasoning" | |
| --- | |
| ## 🚀 Bonus Challenges (Optional) | |
| 1. **DeepSpeed Integration:** Add ZeRO-3 for distributed training support | |
| 2. **Speculative Decoding:** Implement draft model for 2x inference speed | |
| 3. **vLLM Integration:** Replace transformers with vLLM for production serving | |
| 4. **RLHF/DPO:** Add preference tuning step after SFT | |
| 5. **Multi-GPU:** Scale training across multiple GPUs with FSDP | |
| --- | |
| **Good luck! Push your repo when ready and I'll provide detailed feedback.** |