NVIDIA GenAI LLM Certification Lab

wpnews.pro

| # 🧪 NVIDIA GenAI LLM Certification Lab | | | ## Production-Ready LLM Fine-Tuning & Optimization Pipeline | | | Estimated Time: 3-4 hours | | | Exam Topics Covered: | | | - Fine-Tuning (13%) | | | - Model Optimization (17%) | | | - Deployment (9%) | | | - GPU Acceleration (14%) | | | - Data Preparation (9%) | | | - Prompt Engineering (13%) | | | --- | | | ## Lab Objectives | | | Build an end-to-end pipeline that: | | | 1. Prepares and cleans domain-specific training data | | | 2. Fine-tunes a base model using LoRA with 4-bit quantization | | | 3. Applies post-training quantization (GPTQ) | | | 4. Evaluates for hallucinations using self-consistency | | | 5. Packages for production deployment with Docker/Kubernetes | | | --- | | | ## Phase 1: Environment Setup (15 min) | | | ### Requirements | | | - Python 3.10+ | | | - CUDA-capable GPU (8GB+ VRAM recommended) | | | - Git | | | ### Dependencies | | | bash | | | pip install torch transformers peft bitsandbytes datasets accelerate evaluate | | | pip install fastapi uvicorn pydantic | | | pip install auto-gptq optimum # For GPTQ quantization | | | # Optional: pip install wandb # For experiment tracking | | | | | | ### Verification Script | | | Create src/verify_gpu.py: | | | python | | | import torch | | | print(f"PyTorch version: {torch.__version__}") | | | print(f"CUDA available: {torch.cuda.is_available()}") | | | if torch.cuda.is_available(): | | | print(f"GPU: {torch.cuda.get_device_name(0)}") | | | print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") | | | | | | Success Criteria: GPU detected, CUDA working. | | | --- | | | ## Phase 2: Dataset Preparation (Exam: 9% - Data Preparation) | | | ### Task | | | Create a domain-specific instruction-following dataset with proper cleaning and validation. | | | ### Requirements | | | 1. Source or create 500+ examples in a specific domain: | | | - Technical support Q&A | | | - Medical information | | | - Legal document summarization | | | - Code explanation | | | 2. Data cleaning pipeline must: | | | - Remove exact duplicates | | | - Filter by token length (min: 20 tokens, max: 2048 tokens) | | | - Validate JSON structure | | | - Split: Train 80% / Val 10% / Test 10% | | | 3. Format as Alpaca-style JSON: | | | json | | | { | | | "instruction": "Explain the concept...", | | | "input": "Additional context...", | | | "output": "The answer..." | | | } | | | | | | ### Deliverable | | | File: src/data/prepare_dataset.py | | | Required Functions: | | | python | | | def load_and_clean_data(raw_path: str) -> pd.DataFrame: | | | """Load raw data, remove duplicates, filter length.""" | | | pass | | | def tokenize_dataset(examples, tokenizer, max_length: int = 512): | | | """Tokenize with proper padding/truncation.""" | | | pass | | | def save_splits(train, val, test, output_dir: str): | | | """Save to disk in HuggingFace datasets format.""" | | | pass | | | | | | Output Structure: | | | | | | data/ | | | ├── raw/ # Original data files | | | └── processed/ | | | ├── train/ # HuggingFace dataset format | | | ├── validation/ | | | └── test/ | | | | | | ### Grading Criteria | | | - [ ] Dataset loads without errors | | | - [ ] Duplicates removed (report count before/after) | | | - [ ] Token length filtering implemented | | | - [ ] Splits are stratified (balanced distribution) | | | - [ ] Tokenization handles padding correctly | | | --- | | | ## Phase 3: LoRA Fine-Tuning (Exam: 13% - Fine-Tuning) | | | ### Task | | | Fine-tune a base model using LoRA with QLoRA (4-bit quantization). | | | ### Model Selection | | | Choose one: | | | - microsoft/phi-2 (2.7B, fast) | | | - TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B, fastest) | | | - meta-llama/Llama-2-7b-hf (requires HuggingFace token) | | | ### LoRA Configuration | | | python | | | peft_config = LoraConfig( | | | r=16, # Rank | | | lora_alpha=32, # Scaling | | | target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], | | | lora_dropout=0.05, | | | bias="none", | | | task_type="CAUSAL_LM" | | | ) | | | | | | ### Quantization Configuration | | | python | | | bnb_config = BitsAndBytesConfig( | | | load_in_4bit=True, | | | bnb_4bit_quant_type="nf4", | | | bnb_4bit_compute_dtype=torch.float16, | | | bnb_4bit_use_double_quant=True, | | | ) | | | | | | ### Training Configuration | | | python | | | training_args = TrainingArguments( | | | output_dir="./results", | | | num_train_epochs=3, | | | per_device_train_batch_size=4, | | | gradient_accumulation_steps=4, # Effective batch = 16 | | | learning_rate=2e-4, | | | logging_steps=10, | | | save_strategy="epoch", | | | fp16=True, | | | gradient_checkpointing=True, # Memory efficiency | | | optim="paged_adamw_8bit", # QLoRA optimizer | | | ) | | | | | | ### Deliverable | | | File: src/training/train_lora.py | | | Requirements: | | | - Load base model with quantization | | | - Apply LoRA adapters | | | - Train for 3 epochs | | | - Save adapter weights to models/lora-adapter/ | | | - Export LoRA config to configs/lora_config.json | | | - Log training metrics (loss per epoch) | | | Expected Outputs: | | | | | | models/ | | | └── lora-adapter/ | | | ├── adapter_config.json | | | ├── adapter_model.bin | | | └── README.md | | | | | | ### Grading Criteria | | | - [ ] Model loads in 4-bit mode | | | - [ ] LoRA applied to correct modules | | | - [ ] Training completes without OOM | | | - [ ] Loss decreases over epochs | | | - [ ] Adapter files saved correctly | | | --- | | | ## Phase 4: Model Optimization (Exam: 17% - Model Optimization) | | | ### Task | | | Apply post-training quantization and benchmark performance. | | | ### Quantization Method | | | Implement GPTQ 4-bit quantization on the merged model (base + LoRA). | | | ### Steps | | | 1. Merge LoRA weights into base model | | | 2. Quantize to 4-bit using GPTQ | | | 3. Compare three variants: | | | | Variant | Memory | Speed | Notes | | | | |---------|--------|-------|-------| | | | | Base FP16 | Baseline | Baseline | Original model | | | | | LoRA (4-bit base) | ~6GB | Fast | QLoRA during training | | | | | GPTQ 4-bit | ~4GB | Fastest | Merged + quantized | | | | ### Deliverable | | | File: src/optimization/quantize.py | | | Required Functions: | | | python | | | def merge_and_quantize_gptq( | | | base_model_path: str, | | | adapter_path: str, | | | output_path: str, | | | bits: int = 4 | | | ): | | | """Merge LoRA and apply GPTQ quantization.""" | | | pass | | | def benchmark_inference(model, tokenizer, test_prompts: list) -> dict: | | | """Return memory usage and tokens/sec.""" | | | pass | | | def compare_models(models_dict: dict, test_prompts: list) -> pd.DataFrame: | | | """Compare all variants, return results table.""" | | | pass | | | | | | Output: results/optimization_report.json | | | json | | | { | | | "base_fp16": {"memory_gb": 5.4, "tokens_per_sec": 45.2}, | | | "lora_4bit": {"memory_gb": 3.8, "tokens_per_sec": 52.1}, | | | "gptq_4bit": {"memory_gb": 2.1, "tokens_per_sec": 67.3} | | | } | | | | | | ### Grading Criteria | | | - [ ] GPTQ quantization executes without error | | | - [ ] Memory usage measured correctly | | | - [ ] Speed benchmark implemented (warm-up + timed) | | | - [ ] Comparison table generated | | | - [ ] Trade-offs documented | | | --- | | | ## Phase 5: Evaluation & Hallucination Detection (Exam: 13% - Prompt Engineering) | | | ### Task | | | Build an evaluation pipeline that detects hallucinations using self-consistency. | | | ### Hallucination Detection Strategy | | | 1. Self-Consistency: Generate 3 responses per prompt, measure agreement | | | 2. NLI Verification: Check if output contradicts source context | | | 3. Perplexity Scoring: High perplexity = potential hallucination | | | ### Test Prompts | | | Create 50 prompts across categories: | | | - Factual questions (likely to hallucinate dates/names) | | | - Mathematical reasoning | | | - Recent events (post-training knowledge cutoff) | | | - Counterfactuals | | | ### Deliverable | | | File: src/evaluation/evaluate.py | | | Required Functions: | | | python | | | def generate_multiple_responses( | | | model, tokenizer, prompt: str, | | | n: int = 3, temperature: float = 0.7 | | | ) -> list: | | | """Generate n diverse responses.""" | | | pass | | | def calculate_self_consistency(responses: list) -> float: | | | """Return agreement score 0-1 using embeddings.""" | | | pass | | | def detect_hallucination_nli( | | | context: str, generated: str, nli_model | | | ) -> bool: | | | """Use NLI model to detect contradictions.""" | | | pass | | | def calculate_perplexity(model, tokenizer, text: str) -> float: | | | """Calculate perplexity on text.""" | | | pass | | | def run_evaluation(model_path: str, test_data_path: str) -> dict: | | | """Main evaluation loop, returns full report.""" | | | pass | | | | | | Output: results/evaluation_report.json | | | json | | | { | | | "overall_metrics": { | | | "avg_perplexity": 12.4, | | | "avg_self_consistency": 0.78, | | | "hallucination_rate": 0.15 | | | }, | | | "per_category": { | | | "factual": {"hallucination_rate": 0.22}, | | | "math": {"hallucination_rate": 0.08}, | | | "recent_events": {"hallucination_rate": 0.35} | | | }, | | | "examples": [ | | | { | | | "prompt": "Who won the 2024 US Presidential election?", | | | "responses": [...], | | | "consistency": 0.33, | | | "flagged": true | | | } | | | ] | | | } | | | | | | ### Grading Criteria | | | - [ ] 50 test prompts created | | | - [ ] Self-consistency implemented (embeddings or n-gram overlap) | | | - [ ] Perplexity calculation correct | | | - [ ] Hallucination rate computed per category | | | - [ ] Specific failure cases documented | | | --- | | | ## Phase 6: Deployment Packaging (Exam: 9% - Model Deployment) | | | ### Task | | | Containerize the inference API for production deployment. | | | ### API Requirements | | | Create FastAPI server with: | | | - POST /generate - Text generation endpoint | | | - GET /health - Health check | | | - Request validation with Pydantic | | | - Dynamic batching for concurrent requests | | | - Proper error handling | | | ### API Schema | | | python | | | class GenerateRequest(BaseModel): | | | prompt: str | | | max_tokens: int = 512 | | | temperature: float = 0.7 | | | top_p: float = 0.9 | | | class GenerateResponse(BaseModel): | | | generated_text: str | | | tokens_generated: int | | | inference_time_ms: float | | | | | | ### Deliverables | | | File: src/api/inference_server.py | | | python | | | from fastapi import FastAPI | | | from pydantic import BaseModel | | | import torch | | | from transformers import AutoModelForCausalLM, AutoTokenizer | | | app = FastAPI() | | | # Load model on startup | | | @app.on_event("startup") | | | async def load_model(): | | | # Load your quantized model | | | pass | | | @app.post("/generate") | | | async def generate(request: GenerateRequest): | | | # Generate response | | | pass | | | @app.get("/health") | | | async def health(): | | | return {"status": "healthy", "gpu": torch.cuda.is_available()} | | | | | | File: Dockerfile | | | dockerfile | | | FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 | | | WORKDIR /app | | | RUN apt-get update && apt-get install -y python3 python3-pip | | | COPY requirements.txt . | | | RUN pip3 install -r requirements.txt | | | COPY src/ ./src/ | | | COPY models/ ./models/ | | | EXPOSE 8000 | | | CMD ["uvicorn", "src.api.inference_server:app", "--host", "0.0.0.0", "--port", "8000"] | | | | | | File: docker-compose.yml | | | yaml | | | version: '3.8' | | | services: | | | llm-api: | | | build: . | | | ports: | | | - "8000:8000" | | | deploy: | | | resources: | | | reservations: | | | devices: | | | - driver: nvidia | | | count: 1 | | | capabilities: [gpu] | | | environment: | | | - CUDA_VISIBLE_DEVICES=0 | | | | | | File: k8s/deployment.yaml | | | yaml | | | apiVersion: apps/v1 | | | kind: Deployment | | | metadata: | | | name: llm-inference | | | spec: | | | replicas: 1 | | | selector: | | | matchLabels: | | | app: llm-inference | | | template: | | | metadata: | | | labels: | | | app: llm-inference | | | spec: | | | containers: | | | - name: llm-api | | | image: llm-production-lab:latest | | | resources: | | | limits: | | | nvidia.com/gpu: 1 | | | ports: | | | - containerPort: 8000 | | | | | | ### Testing Commands | | | bash | | | # Build and run | | | docker-compose up --build | | | # Test health | | | curl http://localhost:8000/health | | | # Test generation | | | curl -X POST http://localhost:8000/generate \ | | | -H "Content-Type: application/json" \ | | | -d '{"prompt": "What is LoRA?", "max_tokens": 100}' | | | | | | ### Grading Criteria | | | - [ ] FastAPI server starts | | | - [ ] /health endpoint responds | | | - [ ] /generate returns valid JSON | | | - [ ] Docker image builds successfully | | | - [ ] Container runs with GPU access | | | - [ ] Kubernetes manifest valid (can kubectl apply --dry-run) | | | --- | | | ## Phase 7: Documentation | | | ### Required: README.md | | | Must include: | | | 1. Architecture Overview (ASCII diagram acceptable) | | | | | | Raw Data → Clean → Tokenize → Train (LoRA) → Quantize → Deploy | | | ↓ ↓ ↓ ↓ ↓ ↓ | | | Phase 2 Phase 2 Phase 2 Phase 3 Phase 4 Phase 6 | | | | | | 2. Setup Instructions | | | - Prerequisites | | | - Installation steps | | | - Environment variables needed | | | 3. Performance Benchmarks Table | | | | Model Variant | VRAM (GB) | Tokens/sec | Perplexity | | | | |--------------|-----------|------------|------------| | | | | Base FP16 | 5.4 | 45 | 8.2 | | | | | LoRA 4-bit | 3.8 | 52 | 8.4 | | | | | GPTQ 4-bit | 2.1 | 67 | 9.1 | | | | 4. Trade-offs Documented | | | - Why LoRA vs full fine-tuning? | | | - Why GPTQ vs AWQ? | | | - Batching strategy chosen | | | 5. Known Limitations | | | - Model size constraints | | | - Inference latency under load | | | - Hallucination detection false positive rate | | | --- | | | ## 📋 Final Submission Checklist | | | Your repository should contain: | | | | | | llm-production-lab/ | | | ├── README.md # Required documentation | | | ├── requirements.txt # All dependencies | | | ├── Dockerfile # Multi-stage build | | | ├── docker-compose.yml # GPU-enabled compose | | | ├── configs/ | | | │ └── lora_config.json # Exported LoRA config | | | ├── src/ | | | │ ├── verify_gpu.py # Phase 1 verification | | | │ ├── data/ | | | │ │ └── prepare_dataset.py # Phase 2 | | | │ ├── training/ | | | │ │ └── train_lora.py # Phase 3 | | | │ ├── optimization/ | | | │ │ └── quantize.py # Phase 4 | | | │ ├── evaluation/ | | | │ │ └── evaluate.py # Phase 5 | | | │ └── api/ | | | │ └── inference_server.py # Phase 6 | | | ├── data/ | | | │ └── processed/ # Gitignore large files | | | │ └── sample/ # Include 5 sample records | | | ├── models/ | | | │ └── lora-adapter/ # Gitignore, add README | | | ├── k8s/ | | | │ └── deployment.yaml # Phase 6 | | | └── results/ | | | ├── optimization_report.json # Phase 4 | | | └── evaluation_report.json # Phase 5 | | | | | | ### .gitignore Recommendations | | | | | | data/processed/* | | | !data/processed/sample/ | | | models/lora-adapter/* | | | !models/lora-adapter/README.md | | | results/*.json | | | __pycache__/ | | | *.pyc | | | .ipynb_checkpoints/ | | | | | | --- | | | ## 🎯 Grading Rubric | | | | Phase | Weight | Criteria | | | | |-------|--------|----------| | | | | Data Prep | 10% | Clean dataset, proper splits | | | | | LoRA Training | 20% | Model trains, loss decreases, adapters saved | | | | | Quantization | 25% | GPTQ working, benchmarks reported | | | | | Evaluation | 20% | Hallucination detection implemented | | | | | Deployment | 15% | Container builds, API responds | | | | | Documentation | 10% | README complete, trade-offs explained | | | | Total: 100 points | | | - 90-100: Exam-ready (strong understanding of all topics) | | | - 80-89: Good (minor gaps, solid foundation) | | | - 70-79: Adequate (some areas need review) | | | - Below 70: Needs revision | | | --- | | | ## 📚 References | | | - LoRA Paper: Hu et al. (2021) - "Low-Rank Adaptation of Large Language Models" | | | - QLoRA: Dettmers et al. (2023) - "QLoRA: Efficient Finetuning of Quantized LLMs" | | | - GPTQ: Frantar et al. (2022) - "GPTQ: Accurate Post-Training Quantization" | | | - Self-Consistency: Wang et al. (2022) - "Self-Consistency Improves Chain of Thought Reasoning" | | | --- | | | ## 🚀 Bonus Challenges (Optional) | | | 1. DeepSpeed Integration: Add ZeRO-3 for distributed training support | | | 2. Speculative Decoding: Implement draft model for 2x inference speed | | | 3. vLLM Integration: Replace transformers with vLLM for production serving | | | 4. RLHF/DPO: Add preference tuning step after SFT | | | 5. Multi-GPU: Scale training across multiple GPUs with FSDP | | | --- | | | Good luck! Push your repo when ready and I'll provide detailed feedback. |

source & further reading

gist.github.com — original article Claude Code Secrets A browser bookmarklet that instantly downloads your current clipboard content into a file (.md, .js, .html, .py, .rs, .css etc). Claude Code Status Line - Complete Guide: all fields, config, ready-to-use scripts

NVIDIA GenAI LLM Certification Lab

Run your AI side-project on zahid.host