NVIDIA GenAI LLM Certification Lab

NVIDIA has released a GenAI LLM Certification Lab that guides developers through building a production-ready fine-tuning and optimization pipeline. The lab covers data preparation, LoRA fine-tuning with 4-bit quantization, GPTQ quantization, hallucination evaluation, and deployment with Docker/Kubernetes. It is designed to take 3-4 hours and includes exam topics such as fine-tuning, model optimization, and GPU acceleration.

| 🧪 NVIDIA GenAI LLM Certification Lab | | | Production-Ready LLM Fine-Tuning & Optimization Pipeline | | | Estimated Time: 3-4 hours | | | Exam Topics Covered: | | | - Fine-Tuning 13% | | | - Model Optimization 17% | | | - Deployment 9% | | | - GPU Acceleration 14% | | | - Data Preparation 9% | | | - Prompt Engineering 13% | | | --- | | | Lab Objectives | | | Build an end-to-end pipeline that: | | | 1. Prepares and cleans domain-specific training data | | | 2. Fine-tunes a base model using LoRA with 4-bit quantization | | | 3. Applies post-training quantization GPTQ | | | 4. Evaluates for hallucinations using self-consistency | | | 5. Packages for production deployment with Docker/Kubernetes | | | --- | | | Phase 1: Environment Setup 15 min | | | Requirements | | | - Python 3.10+ | | | - CUDA-capable GPU 8GB+ VRAM recommended | | | - Git | | | Dependencies | | | bash | | | pip install torch transformers peft bitsandbytes datasets accelerate evaluate | | | pip install fastapi uvicorn pydantic | | | pip install auto-gptq optimum For GPTQ quantization | | | Optional: pip install wandb For experiment tracking | | | | | | Verification Script | | | Create src/verify gpu.py : | | | python | | | import torch | | | print f"PyTorch version: {torch. version }" | | | print f"CUDA available: {torch.cuda.is available }" | | | if torch.cuda.is available : | | | print f"GPU: {torch.cuda.get device name 0 }" | | | print f"VRAM: {torch.cuda.get device properties 0 .total memory / 1e9:.2f} GB" | | | | | | Success Criteria: GPU detected, CUDA working. | | | --- | | | Phase 2: Dataset Preparation Exam: 9% - Data Preparation | | | Task | | | Create a domain-specific instruction-following dataset with proper cleaning and validation. | | | Requirements | | | 1. Source or create 500+ examples in a specific domain: | | | - Technical support Q&A | | | - Medical information | | | - Legal document summarization | | | - Code explanation | | | 2. Data cleaning pipeline must: | | | - Remove exact duplicates | | | - Filter by token length min: 20 tokens, max: 2048 tokens | | | - Validate JSON structure | | | - Split: Train 80% / Val 10% / Test 10% | | | 3. Format as Alpaca-style JSON: | | | json | | | { | | | "instruction": "Explain the concept...", | | | "input": "Additional context...", | | | "output": "The answer..." | | | } | | | | | | Deliverable | | | File: src/data/prepare dataset.py | | | Required Functions: | | | python | | | def load and clean data raw path: str - pd.DataFrame: | | | """Load raw data, remove duplicates, filter length.""" | | | pass | | | def tokenize dataset examples, tokenizer, max length: int = 512 : | | | """Tokenize with proper padding/truncation.""" | | | pass | | | def save splits train, val, test, output dir: str : | | | """Save to disk in HuggingFace datasets format.""" | | | pass | | | | | | Output Structure: | | | | | | data/ | | | ├── raw/ Original data files | | | └── processed/ | | | ├── train/ HuggingFace dataset format | | | ├── validation/ | | | └── test/ | | | | | | Grading Criteria | | | - Dataset loads without errors | | | - Duplicates removed report count before/after | | | - Token length filtering implemented | | | - Splits are stratified balanced distribution | | | - Tokenization handles padding correctly | | | --- | | | Phase 3: LoRA Fine-Tuning Exam: 13% - Fine-Tuning | | | Task | | | Fine-tune a base model using LoRA with QLoRA 4-bit quantization . | | | Model Selection | | | Choose one: | | | - microsoft/phi-2 2.7B, fast | | | - TinyLlama/TinyLlama-1.1B-Chat-v1.0 1.1B, fastest | | | - meta-llama/Llama-2-7b-hf requires HuggingFace token | | | LoRA Configuration | | | python | | | peft config = LoraConfig | | | r=16, Rank | | | lora alpha=32, Scaling | | | target modules= "q proj", "v proj", "k proj", "o proj" , | | | lora dropout=0.05, | | | bias="none", | | | task type="CAUSAL LM" | | | | | | | | | Quantization Configuration | | | python | | | bnb config = BitsAndBytesConfig | | | load in 4bit=True, | | | bnb 4bit quant type="nf4", | | | bnb 4bit compute dtype=torch.float16, | | | bnb 4bit use double quant=True, | | | | | | | | | Training Configuration | | | python | | | training args = TrainingArguments | | | output dir="./results", | | | num train epochs=3, | | | per device train batch size=4, | | | gradient accumulation steps=4, Effective batch = 16 | | | learning rate=2e-4, | | | logging steps=10, | | | save strategy="epoch", | | | fp16=True, | | | gradient checkpointing=True, Memory efficiency | | | optim="paged adamw 8bit", QLoRA optimizer | | | | | | | | | Deliverable | | | File: src/training/train lora.py | | | Requirements: | | | - Load base model with quantization | | | - Apply LoRA adapters | | | - Train for 3 epochs | | | - Save adapter weights to models/lora-adapter/ | | | - Export LoRA config to configs/lora config.json | | | - Log training metrics loss per epoch | | | Expected Outputs: | | | | | | models/ | | | └── lora-adapter/ | | | ├── adapter config.json | | | ├── adapter model.bin | | | └── README.md | | | | | | Grading Criteria | | | - Model loads in 4-bit mode | | | - LoRA applied to correct modules | | | - Training completes without OOM | | | - Loss decreases over epochs | | | - Adapter files saved correctly | | | --- | | | Phase 4: Model Optimization Exam: 17% - Model Optimization | | | Task | | | Apply post-training quantization and benchmark performance. | | | Quantization Method | | | Implement GPTQ 4-bit quantization on the merged model base + LoRA . | | | Steps | | | 1. Merge LoRA weights into base model | | | 2. Quantize to 4-bit using GPTQ | | | 3. Compare three variants: | | | | Variant | Memory | Speed | Notes | | | | |---------|--------|-------|-------| | | | | Base FP16 | Baseline | Baseline | Original model | | | | | LoRA 4-bit base | ~6GB | Fast | QLoRA during training | | | | | GPTQ 4-bit | ~4GB | Fastest | Merged + quantized | | | | Deliverable | | | File: src/optimization/quantize.py | | | Required Functions: | | | python | | | def merge and quantize gptq | | | base model path: str, | | | adapter path: str, | | | output path: str, | | | bits: int = 4 | | | : | | | """Merge LoRA and apply GPTQ quantization.""" | | | pass | | | def benchmark inference model, tokenizer, test prompts: list - dict: | | | """Return memory usage and tokens/sec.""" | | | pass | | | def compare models models dict: dict, test prompts: list - pd.DataFrame: | | | """Compare all variants, return results table.""" | | | pass | | | | | | Output: results/optimization report.json | | | json | | | { | | | "base fp16": {"memory gb": 5.4, "tokens per sec": 45.2}, | | | "lora 4bit": {"memory gb": 3.8, "tokens per sec": 52.1}, | | | "gptq 4bit": {"memory gb": 2.1, "tokens per sec": 67.3} | | | } | | | | | | Grading Criteria | | | - GPTQ quantization executes without error | | | - Memory usage measured correctly | | | - Speed benchmark implemented warm-up + timed | | | - Comparison table generated | | | - Trade-offs documented | | | --- | | | Phase 5: Evaluation & Hallucination Detection Exam: 13% - Prompt Engineering | | | Task | | | Build an evaluation pipeline that detects hallucinations using self-consistency. | | | Hallucination Detection Strategy | | | 1. Self-Consistency: Generate 3 responses per prompt, measure agreement | | | 2. NLI Verification: Check if output contradicts source context | | | 3. Perplexity Scoring: High perplexity = potential hallucination | | | Test Prompts | | | Create 50 prompts across categories: | | | - Factual questions likely to hallucinate dates/names | | | - Mathematical reasoning | | | - Recent events post-training knowledge cutoff | | | - Counterfactuals | | | Deliverable | | | File: src/evaluation/evaluate.py | | | Required Functions: | | | python | | | def generate multiple responses | | | model, tokenizer, prompt: str, | | | n: int = 3, temperature: float = 0.7 | | | - list: | | | """Generate n diverse responses.""" | | | pass | | | def calculate self consistency responses: list - float: | | | """Return agreement score 0-1 using embeddings.""" | | | pass | | | def detect hallucination nli | | | context: str, generated: str, nli model | | | - bool: | | | """Use NLI model to detect contradictions.""" | | | pass | | | def calculate perplexity model, tokenizer, text: str - float: | | | """Calculate perplexity on text.""" | | | pass | | | def run evaluation model path: str, test data path: str - dict: | | | """Main evaluation loop, returns full report.""" | | | pass | | | | | | Output: results/evaluation report.json | | | json | | | { | | | "overall metrics": { | | | "avg perplexity": 12.4, | | | "avg self consistency": 0.78, | | | "hallucination rate": 0.15 | | | }, | | | "per category": { | | | "factual": {"hallucination rate": 0.22}, | | | "math": {"hallucination rate": 0.08}, | | | "recent events": {"hallucination rate": 0.35} | | | }, | | | "examples": | | | { | | | "prompt": "Who won the 2024 US Presidential election?", | | | "responses": ... , | | | "consistency": 0.33, | | | "flagged": true | | | } | | | | | | } | | | | | | Grading Criteria | | | - 50 test prompts created | | | - Self-consistency implemented embeddings or n-gram overlap | | | - Perplexity calculation correct | | | - Hallucination rate computed per category | | | - Specific failure cases documented | | | --- | | | Phase 6: Deployment Packaging Exam: 9% - Model Deployment | | | Task | | | Containerize the inference API for production deployment. | | | API Requirements | | | Create FastAPI server with: | | | - POST /generate - Text generation endpoint | | | - GET /health - Health check | | | - Request validation with Pydantic | | | - Dynamic batching for concurrent requests | | | - Proper error handling | | | API Schema | | | python | | | class GenerateRequest BaseModel : | | | prompt: str | | | max tokens: int = 512 | | | temperature: float = 0.7 | | | top p: float = 0.9 | | | class GenerateResponse BaseModel : | | | generated text: str | | | tokens generated: int | | | inference time ms: float | | | | | | Deliverables | | | File: src/api/inference server.py | | | python | | | from fastapi import FastAPI | | | from pydantic import BaseModel | | | import torch | | | from transformers import AutoModelForCausalLM, AutoTokenizer | | | app = FastAPI | | | Load model on startup | | | @app.on event "startup" | | | async def load model : | | | Load your quantized model | | | pass | | | @app.post "/generate" | | | async def generate request: GenerateRequest : | | | Generate response | | | pass | | | @app.get "/health" | | | async def health : | | | return {"status": "healthy", "gpu": torch.cuda.is available } | | | | | | File: Dockerfile | | | dockerfile | | | FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 | | | WORKDIR /app | | | RUN apt-get update && apt-get install -y python3 python3-pip | | | COPY requirements.txt . | | | RUN pip3 install -r requirements.txt | | | COPY src/ ./src/ | | | COPY models/ ./models/ | | | EXPOSE 8000 | | | CMD "uvicorn", "src.api.inference server:app", "--host", "0.0.0.0", "--port", "8000" | | | | | | File: docker-compose.yml | | | yaml | | | version: '3.8' | | | services: | | | llm-api: | | | build: . | | | ports: | | | - "8000:8000" | | | deploy: | | | resources: | | | reservations: | | | devices: | | | - driver: nvidia | | | count: 1 | | | capabilities: gpu | | | environment: | | | - CUDA VISIBLE DEVICES=0 | | | | | | File: k8s/deployment.yaml | | | yaml | | | apiVersion: apps/v1 | | | kind: Deployment | | | metadata: | | | name: llm-inference | | | spec: | | | replicas: 1 | | | selector: | | | matchLabels: | | | app: llm-inference | | | template: | | | metadata: | | | labels: | | | app: llm-inference | | | spec: | | | containers: | | | - name: llm-api | | | image: llm-production-lab:latest | | | resources: | | | limits: | | | nvidia.com/gpu: 1 | | | ports: | | | - containerPort: 8000 | | | | | | Testing Commands | | | bash | | | Build and run | | | docker-compose up --build | | | Test health | | | curl http://localhost:8000/health | | | Test generation | | | curl -X POST http://localhost:8000/generate \ | | | -H "Content-Type: application/json" \ | | | -d '{"prompt": "What is LoRA?", "max tokens": 100}' | | | | | | Grading Criteria | | | - FastAPI server starts | | | - /health endpoint responds | | | - /generate returns valid JSON | | | - Docker image builds successfully | | | - Container runs with GPU access | | | - Kubernetes manifest valid can kubectl apply --dry-run | | | --- | | | Phase 7: Documentation | | | Required: README.md | | | Must include: | | | 1. Architecture Overview ASCII diagram acceptable | | | | | | Raw Data → Clean → Tokenize → Train LoRA → Quantize → Deploy | | | ↓ ↓ ↓ ↓ ↓ ↓ | | | Phase 2 Phase 2 Phase 2 Phase 3 Phase 4 Phase 6 | | | | | | 2. Setup Instructions | | | - Prerequisites | | | - Installation steps | | | - Environment variables needed | | | 3. Performance Benchmarks Table | | | | Model Variant | VRAM GB | Tokens/sec | Perplexity | | | | |--------------|-----------|------------|------------| | | | | Base FP16 | 5.4 | 45 | 8.2 | | | | | LoRA 4-bit | 3.8 | 52 | 8.4 | | | | | GPTQ 4-bit | 2.1 | 67 | 9.1 | | | | 4. Trade-offs Documented | | | - Why LoRA vs full fine-tuning? | | | - Why GPTQ vs AWQ? | | | - Batching strategy chosen | | | 5. Known Limitations | | | - Model size constraints | | | - Inference latency under load | | | - Hallucination detection false positive rate | | | --- | | | 📋 Final Submission Checklist | | | Your repository should contain: | | | | | | llm-production-lab/ | | | ├── README.md Required documentation | | | ├── requirements.txt All dependencies | | | ├── Dockerfile Multi-stage build | | | ├── docker-compose.yml GPU-enabled compose | | | ├── configs/ | | | │ └── lora config.json Exported LoRA config | | | ├── src/ | | | │ ├── verify gpu.py Phase 1 verification | | | │ ├── data/ | | | │ │ └── prepare dataset.py Phase 2 | | | │ ├── training/ | | | │ │ └── train lora.py Phase 3 | | | │ ├── optimization/ | | | │ │ └── quantize.py Phase 4 | | | │ ├── evaluation/ | | | │ │ └── evaluate.py Phase 5 | | | │ └── api/ | | | │ └── inference server.py Phase 6 | | | ├── data/ | | | │ └── processed/ Gitignore large files | | | │ └── sample/ Include 5 sample records | | | ├── models/ | | | │ └── lora-adapter/ Gitignore, add README | | | ├── k8s/ | | | │ └── deployment.yaml Phase 6 | | | └── results/ | | | ├── optimization report.json Phase 4 | | | └── evaluation report.json Phase 5 | | | | | | .gitignore Recommendations | | | | | | data/processed/ | | | data/processed/sample/ | | | models/lora-adapter/ | | | models/lora-adapter/README.md | | | results/ .json | | | pycache / | | | .pyc | | | .ipynb checkpoints/ | | | | | | --- | | | 🎯 Grading Rubric | | | | Phase | Weight | Criteria | | | | |-------|--------|----------| | | | | Data Prep | 10% | Clean dataset, proper splits | | | | | LoRA Training | 20% | Model trains, loss decreases, adapters saved | | | | | Quantization | 25% | GPTQ working, benchmarks reported | | | | | Evaluation | 20% | Hallucination detection implemented | | | | | Deployment | 15% | Container builds, API responds | | | | | Documentation | 10% | README complete, trade-offs explained | | | | Total: 100 points | | | - 90-100: Exam-ready strong understanding of all topics | | | - 80-89: Good minor gaps, solid foundation | | | - 70-79: Adequate some areas need review | | | - Below 70: Needs revision | | | --- | | | 📚 References | | | - LoRA Paper: Hu et al. 2021 - "Low-Rank Adaptation of Large Language Models" | | | - QLoRA: Dettmers et al. 2023 - "QLoRA: Efficient Finetuning of Quantized LLMs" | | | - GPTQ: Frantar et al. 2022 - "GPTQ: Accurate Post-Training Quantization" | | | - Self-Consistency: Wang et al. 2022 - "Self-Consistency Improves Chain of Thought Reasoning" | | | --- | | | 🚀 Bonus Challenges Optional | | | 1. DeepSpeed Integration: Add ZeRO-3 for distributed training support | | | 2. Speculative Decoding: Implement draft model for 2x inference speed | | | 3. vLLM Integration: Replace transformers with vLLM for production serving | | | 4. RLHF/DPO: Add preference tuning step after SFT | | | 5. Multi-GPU: Scale training across multiple GPUs with FSDP | | | --- | | | Good luck Push your repo when ready and I'll provide detailed feedback. |