Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks

Salesforce released a tutorial demonstrating an end-to-end workflow for its CodeGen model, covering loading from Hugging Face, generating Python functions from natural-language prompts, and adding validation steps including syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, and artifact export. The tutorial shows how CodeGen can be used as part of a structured code-generation pipeline that evaluates and filters generated solutions.

In this tutorial, we implement an end-to-end workflow for Salesforce CodeGen https://github.com/salesforce/CodeGen . We load a CodeGen model from Hugging Face, prepare it for code generation, and use it to generate Python functions from natural-language prompts. We then move beyond basic inference by adding function extraction, syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, multi-step program synthesis, prompt-style experimentation, benchmark visualization, and artifact export. Through this workflow, we learn how CodeGen can be used not only as a code completion model but also as part of a structured code-generation pipeline that evaluates, filters, and organizes generated solutions. Loading the Salesforce CodeGen Model from Hugging Face python import os, sys, subprocess, textwrap, json, re, time, math, ast, tempfile, multiprocessing as mp from pathlib import Path def sh cmd : print f"\n$ {cmd}" subprocess.run cmd, shell=True, check=True sh f"{sys.executable} -m pip install -q -U transformers accelerate safetensors einops datasets evaluate pandas matplotlib tqdm rich radon tiktoken" import torch import pandas as pd import matplotlib.pyplot as plt from tqdm.auto import tqdm from rich import print from rich.panel import Panel from rich.syntax import Syntax from transformers import AutoTokenizer, AutoModelForCausalLM, set seed from radon.complexity import cc visit OUT DIR = Path "/content/codegen advanced tutorial" OUT DIR.mkdir parents=True, exist ok=True set seed 42 print Panel.fit "Salesforce CodeGen Advanced Tutorial", style="bold green" print "\nRuntime information" print "Python:", sys.version.split 0 print "Torch:", torch. version print "CUDA available:", torch.cuda.is available if torch.cuda.is available : print "GPU:", torch.cuda.get device name 0 print "CUDA memory GB:", round torch.cuda.get device properties 0 .total memory / 1e9, 2 MODEL ID = os.environ.get "CODEGEN MODEL ID", "Salesforce/codegen-350M-mono" MODEL OPTIONS = { "easy colab default": "Salesforce/codegen-350M-mono", "larger codegen1": "Salesforce/codegen-2B-mono", "codegen2 1b": "Salesforce/codegen2-1B P", "codegen25 7b mono": "Salesforce/codegen25-7b-mono P", } print "\nSelected model:", MODEL ID print "Available model examples:", MODEL OPTIONS trust remote code = any x in MODEL ID.lower for x in "codegen2", "codegen25" device = "cuda" if torch.cuda.is available else "cpu" dtype = torch.float16 if torch.cuda.is available else torch.float32 print "\nLoading tokenizer..." tokenizer = AutoTokenizer.from pretrained MODEL ID, trust remote code=trust remote code if tokenizer.pad token is None: tokenizer.pad token = tokenizer.eos token print "Loading model..." load kwargs = { "trust remote code": trust remote code, "low cpu mem usage": True, } if torch.cuda.is available : load kwargs "torch dtype" = dtype load kwargs "device map" = "auto" else: load kwargs "torch dtype" = torch.float32 model = AutoModelForCausalLM.from pretrained MODEL ID, load kwargs if not torch.cuda.is available : model.to device model.eval def count parameters model : return sum p.numel for p in model.parameters print f"Loaded {MODEL ID}" print f"Parameter count: {count parameters model /1e6:.1f}M" def generate text prompt, max new tokens=180, temperature=0.35, top p=0.92, top k=50, do sample=True, num return sequences=1, repetition penalty=1.05, : inputs = tokenizer prompt, return tensors="pt" inputs = {k: v.to model.device for k, v in inputs.items } with torch.no grad : outputs = model.generate inputs, max new tokens=max new tokens, do sample=do sample, temperature=temperature, top p=top p, top k=top k, num return sequences=num return sequences, repetition penalty=repetition penalty, pad token id=tokenizer.eos token id, eos token id=tokenizer.eos token id, decoded = tokenizer.batch decode outputs, skip special tokens=True return decoded def print code title, code : print Panel.fit title, style="bold cyan" print Syntax code, "python", theme="monokai", line numbers=True We install all required libraries and prepare the environment for running Salesforce CodeGen. We check the runtime, detect GPU availability, select the CodeGen model, and load both the tokenizer and model from Hugging Face. We also define helper functions for text generation and for displaying formatted code so that the rest of the tutorial is easier to follow. Building Extraction, Safety, and Unit-Test Validation Utilities python def extract function source full text, function name : text = full text.replace "\r\n", "\n" fence = re.search r" ?:python ?\n . ? ", text, flags=re.S | re.I if fence: text = fence.group 1 pattern = rf"^def\s+{re.escape function name }\s \ " match = re.search pattern, text, flags=re.M if not match: return "" chunk = text match.start : lines = chunk.splitlines collected = for i, line in enumerate lines : if i 0: if line.startswith "def " or line.startswith "class " : break if line.startswith "if name " : break if line and not line.startswith " ", "\t", " " and re.match r"^ A-Za-z A-Za-z0-9 \s =", line : break collected.append line source = "\n".join collected .rstrip try: ast.parse source return source except SyntaxError: fixed lines = for line in collected: fixed lines.append line candidate = "\n".join fixed lines .rstrip try: ast.parse candidate source = candidate except SyntaxError: pass return source if source.strip .startswith "def " else "" def syntax ok source : try: ast.parse source return True, "" except SyntaxError as e: return False, str e FORBIDDEN NAMES = { "eval", "exec", "compile", "open", "input", " import ", "globals", "locals", "vars", "dir", "getattr", "setattr", "delattr", "help", "breakpoint", "exit", "quit" } FORBIDDEN NODES = ast.Import, ast.ImportFrom, ast.Global, ast.Nonlocal, ast.With, ast.AsyncWith, ast.AsyncFunctionDef, ast.ClassDef, ast.Delete, ast.Raise, ALLOWED BUILTINS = { "abs": abs, "all": all, "any": any, "bool": bool, "dict": dict, "enumerate": enumerate, "float": float, "int": int, "isinstance": isinstance, "len": len, "list": list, "map": map, "max": max, "min": min, "pow": pow, "range": range, "reversed": reversed, "round": round, "set": set, "sorted": sorted, "str": str, "sum": sum, "tuple": tuple, "zip": zip, } def static safety check source : try: tree = ast.parse source except SyntaxError as e: return False, f"SyntaxError: {e}" for node in ast.walk tree : if isinstance node, FORBIDDEN NODES : return False, f"Forbidden AST node: {type node . name }" if isinstance node, ast.Name : if node.id in FORBIDDEN NAMES or node.id.startswith " " : return False, f"Forbidden name: {node.id}" if isinstance node, ast.Attribute : if node.attr.startswith " " : return False, f"Forbidden attribute: {node.attr}" if isinstance node, ast.Call : if isinstance node.func, ast.Name and node.func.id in FORBIDDEN NAMES: return False, f"Forbidden call: {node.func.id}" return True, "passed" def worker run tests source, function name, tests, queue : try: safe globals = {" builtins ": ALLOWED BUILTINS} safe locals = {} compiled = compile source, "<generated code ", "exec" exec compiled, safe globals, safe locals fn = safe locals.get function name or safe globals.get function name if fn is None: queue.put {"ok": False, "error": f"{function name} not found", "passed": 0, "total": len tests } return passed = 0 details = for test in tests: args = test.get "args", kwargs = test.get "kwargs", {} expected = test "expected" result = fn args, kwargs ok = result == expected passed += int ok details.append { "args": args, "kwargs": kwargs, "expected": expected, "result": result, "ok": ok, } queue.put {"ok": passed == len tests , "error": "", "passed": passed, "total": len tests , "details": details} except Exception as e: queue.put {"ok": False, "error": repr e , "passed": 0, "total": len tests } def run unit tests safely source, function name, tests, timeout seconds=3 : safe, reason = static safety check source if not safe: return {"ok": False, "error": reason, "passed": 0, "total": len tests , "details": } ctx = mp.get context "fork" queue = ctx.Queue process = ctx.Process target= worker run tests, args= source, function name, tests, queue process.start process.join timeout seconds if process.is alive : process.terminate process.join return {"ok": False, "error": "timeout", "passed": 0, "total": len tests , "details": } if queue.empty : return {"ok": False, "error": "no result returned", "passed": 0, "total": len tests , "details": } return queue.get def code complexity source : try: blocks = cc visit source if not blocks: return 1 return max block.complexity for block in blocks except Exception: return None def score candidate source, test result : syntax score = 1 if syntax ok source 0 else 0 safety score = 1 if static safety check source 0 else 0 passed = test result.get "passed", 0 total = max test result.get "total", 1 , 1 test score = passed / total complexity = code complexity source complexity penalty = 0 if complexity is None else min complexity / 20, 0.25 return syntax score + safety score + 3 test score - complexity penalty We build the utility layer that extracts generated Python functions from raw model outputs. We add syntax validation, static safety checks, restricted execution, unit-test execution, and timeout handling to make generated code easier to evaluate. We also calculate code complexity and create a scoring function to rank generated candidates by correctness, safety, and simplicity. print "\n" + "=" 90 Generating Code and Defining Benchmark Tasks print "Demo 1: Basic natural-language-to-code completion" print "=" 90 basic prompt = """ Write a Python function that returns the area of a circle. The function should be named circle area and should accept radius as input. Do not print anything. Return the numeric result. def circle area radius : """ basic output = generate text basic prompt, max new tokens=120, temperature=0.25, do sample=True, num return sequences=1, 0 print code "Raw CodeGen output", basic output circle source = extract function source basic output, "circle area" print code "Extracted function", circle source if circle source else " No function extracted" circle tests = {"args": 1 , "expected": math.pi}, {"args": 2 , "expected": 4 math.pi}, if circle source: print "Syntax:", syntax ok circle source print "Safety:", static safety check circle source print "Complexity:", code complexity circle source print "\n" + "=" 90 print "Demo 2: Best-of-N generation with test-based reranking" print "=" 90 TASKS = { "name": "factorial", "signature": "def factorial n :", "instruction": "Return n factorial for a non-negative integer n. Use 1 for factorial 0 .", "tests": {"args": 0 , "expected": 1}, {"args": 1 , "expected": 1}, {"args": 5 , "expected": 120}, {"args": 7 , "expected": 5040}, , }, { "name": "is palindrome", "signature": "def is palindrome text :", "instruction": "Return True if text is a palindrome after removing spaces and ignoring case, otherwise return False.", "tests": {"args": "Race car" , "expected": True}, {"args": "hello" , "expected": False}, {"args": "Never odd or even" , "expected": True}, , }, { "name": "fibonacci", "signature": "def fibonacci n :", "instruction": "Return the nth Fibonacci number where fibonacci 0 =0 and fibonacci 1 =1.", "tests": {"args": 0 , "expected": 0}, {"args": 1 , "expected": 1}, {"args": 8 , "expected": 21}, {"args": 10 , "expected": 55}, , }, { "name": "dedupe keep order", "signature": "def dedupe keep order items :", "instruction": "Return a list with duplicate values removed while preserving the first occurrence order.", "tests": {"args": 1, 2, 1, 3, 2 , "expected": 1, 2, 3 }, {"args": "a", "b", "a", "c" , "expected": "a", "b", "c" }, {"args": , "expected": }, , }, We start with a simple natural-language-to-code generation example using a circle area function. We generate raw CodeGen output, extract the function, and inspect its syntax, safety, and complexity. We then define multiple programming tasks that later help us benchmark CodeGen across different function-generation problems. Best-of-N Candidate Generation and Test-Based Reranking python def build prompt task : examples = for t in task "tests" :2 : examples.append f" Example: {task 'name' } {t 'args' } - {repr t 'expected' }" example block = "\n".join examples return f''' You are writing clean Python 3 code. Task: {task "instruction" } Rules: - Do not import packages. - Do not print anything. - Return the answer from the function. - Keep the implementation compact and readable. {example block} {task "signature" } ''' def generate candidates for task task, n=3, max new tokens=160 : prompt = build prompt task outputs = generate text prompt, max new tokens=max new tokens, temperature=0.45, top p=0.92, do sample=True, num return sequences=n, repetition penalty=1.07, candidates = for i, out in enumerate outputs : source = extract function source out, task "name" syntax pass, syntax error = syntax ok source if source else False, "no source extracted" test result = run unit tests safely source, task "name" , task "tests" if source else { "ok": False, "error": "no source extracted", "passed": 0, "total": len task "tests" , "details": , } candidates.append { "task": task "name" , "candidate id": i, "prompt": prompt, "raw output": out, "source": source, "syntax ok": syntax pass, "syntax error": syntax error, "safety": static safety check source 0 if source else False, "tests passed": test result.get "passed", 0 , "tests total": test result.get "total", len task "tests" , "test ok": test result.get "ok", False , "test error": test result.get "error", "" , "complexity": code complexity source if source else None, "score": score candidate source, test result if source else -999, } candidates = sorted candidates, key=lambda x: x "score" , reverse=True return candidates all candidates = best solutions = {} CANDIDATES PER TASK = 2 for task in tqdm TASKS, desc="Generating and evaluating" : candidates = generate candidates for task task, n=CANDIDATES PER TASK all candidates.extend candidates best solutions task "name" = candidates 0 results df = pd.DataFrame { "task": c "task" , "candidate id": c "candidate id" , "syntax ok": c "syntax ok" , "safety": c "safety" , "tests passed": c "tests passed" , "tests total": c "tests total" , "test ok": c "test ok" , "complexity": c "complexity" , "score": round c "score" , 3 , "test error": c "test error" , } for c in all candidates .sort values "task", "score" , ascending= True, False print "\nCandidate summary" display results df for task name, best in best solutions.items : print code f"Best solution for {task name}", best "source" if best "source" else " No valid source" print { "task": task name, "tests passed": f'{best "tests passed" }/{best "tests total" }', "score": best "score" , "test error": best "test error" , } We create structured prompts for each task and generate multiple candidate solutions using CodeGen. We evaluate each candidate with unit tests, syntax checks, safety checks, complexity analysis, and a scoring system. We then summarize the results in a DataFrame and display the best-generated solution for each task. print "\n" + "=" 90 Multi-Turn Program Synthesis and Prompt-Style Experiments print "Demo 3: Multi-turn program synthesis" print "=" 90 multi turn prompts = { "name": "normalize words", "prompt": """ Step 1. Write a Python function normalize words text . It should lowercase text, remove punctuation characters ., ?:;, and split into words. Do not import packages. def normalize words text : """, "tests": {"args": "Hello, HELLO world " , "expected": "hello", "hello", "world" }, {"args": "A test: yes." , "expected": "a", "test", "yes" }, , }, { "name": "word counts", "prompt": """ Step 2. Write a Python function word counts words . It receives a list of words and returns a dictionary mapping each word to its frequency. Do not import packages. def word counts words : """, "tests": {"args": "a", "b", "a" , "expected": {"a": 2, "b": 1}}, {"args": , "expected": {}}, , }, { "name": "top word", "prompt": """ Step 3. Write a Python function top word counts . It receives a dictionary of word frequencies. Return the word with the highest count. If counts is empty, return None. If there is a tie, return the alphabetically smallest word. Do not import packages. def top word counts : """, "tests": {"args": {"a": 2, "b": 1} , "expected": "a"}, {"args": {"b": 2, "a": 2} , "expected": "a"}, {"args": {} , "expected": None}, , }, multi turn sources = for spec in multi turn prompts: out = generate text spec "prompt" , max new tokens=150, temperature=0.35, top p=0.92, do sample=True, num return sequences=1, 0 src = extract function source out, spec "name" res = run unit tests safely src, spec "name" , spec "tests" if src else {"ok": False, "error": "no extraction"} multi turn sources.append src print code f"Generated {spec 'name' }", src if src else " No source extracted" print "Test result:", res pipeline code = "\n\n".join s for s in multi turn sources if s pipeline code += """ def most common word text : words = normalize words text counts = word counts words return top word counts """ pipeline tests = {"args": "Hello hello, world " , "expected": "hello"}, {"args": "B b a a" , "expected": "a"}, pipeline result = run unit tests safely pipeline code, "most common word", pipeline tests print code "Composed multi-turn pipeline", pipeline code print "Pipeline result:", pipeline result print "\n" + "=" 90 print "Demo 4: Prompt styles for different CodeGen workflows" print "=" 90 PROMPT LIBRARY = { "docstring to code": '''def group by first letter words : """ Given a list of strings, return a dictionary where keys are first letters and values are lists of words beginning with that letter. Preserve input order. """ ''', "partial code completion": '''def moving average values, window : result = for i in range len values : ''', "test generation": ''' Write pytest-style tests for this function. def clamp x, low, high : return max low, min x, high def test clamp : ''', "refactor request": ''' Refactor the following code into a clean function called count positive. x = 1, -2, 5, 0 c = 0 for i in x: if i 0: c = c + 1 print c def count positive values : ''', } for name, prompt in PROMPT LIBRARY.items : print "\nWorkflow:", name out = generate text prompt, max new tokens=120, temperature=0.35, top p=0.92, do sample=True, num return sequences=1, 0 print code name, out We demonstrate multi-turn program synthesis by generating smaller functions that work together as a pipeline. We create functions for word normalization, word counting, and top-word selection, then compose them into a complete most-common-word workflow. We also test different prompt styles such as docstring-to-code, partial completion, test generation, and refactoring. print "Demo 5: Mini benchmark aggregation and visualization" print "=" 90 benchmark rows = for task in TASKS: task candidates = c for c in all candidates if c "task" == task "name" best = max task candidates, key=lambda x: x "score" pass at n = any c "test ok" for c in task candidates benchmark rows.append { "task": task "name" , "best tests passed": best "tests passed" , "tests total": best "tests total" , "best pass rate": best "tests passed" / max best "tests total" , 1 , "pass at n": pass at n, "best complexity": best "complexity" , "best score": best "score" , } benchmark df = pd.DataFrame benchmark rows display benchmark df plt.figure figsize= 9, 4 plt.bar benchmark df "task" , benchmark df "best pass rate" plt.ylim 0, 1.05 plt.ylabel "Best candidate pass rate" plt.xlabel "Task" plt.title "CodeGen mini benchmark: best-of-N unit-test pass rate" plt.xticks rotation=30, ha="right" plt.tight layout plt.show print "\n" + "=" 90 print "Exporting artifacts" print "=" 90 candidates path = OUT DIR / "codegen candidates.jsonl" summary path = OUT DIR / "benchmark summary.csv" solutions path = OUT DIR / "best solutions.py" pipeline path = OUT DIR / "multi turn pipeline.py" with open candidates path, "w", encoding="utf-8" as f: for c in all candidates: serializable = dict c f.write json.dumps serializable, ensure ascii=False, default=str + "\n" benchmark df.to csv summary path, index=False with open solutions path, "w", encoding="utf-8" as f: f.write " Best generated solutions from Salesforce CodeGen tutorial\n\n" for task name, best in best solutions.items : f.write f" ---- {task name} ----\n" f.write best "source" if best "source" else " No source generated" f.write "\n\n" with open pipeline path, "w", encoding="utf-8" as f: f.write pipeline code print "Saved files:" print candidates path print summary path print solutions path print pipeline path print "\n" + "=" 90 print "Optional: interactive single-prompt helper" print "=" 90 def codegen assistant user task, function signature, max new tokens=180, candidates=2 : prompt = f''' Write clean Python 3 code. Task: {user task} Rules: - Do not import packages unless absolutely necessary. - Do not print anything. - Return values from the function. - Keep the function readable. {function signature} ''' outputs = generate text prompt, max new tokens=max new tokens, temperature=0.45, top p=0.92, do sample=True, num return sequences=candidates, extracted = fn match = re.search r"def\s+ A-Za-z A-Za-z0-9 \s \ ", function signature fn name = fn match.group 1 if fn match else None for i, out in enumerate outputs : src = extract function source out, fn name if fn name else out extracted.append src print code f"Candidate {i+1}", src if src else out return extracted custom candidates = codegen assistant user task="Return the second largest unique number in a list. If fewer than two unique numbers exist, return None.", function signature="def second largest unique values :", max new tokens=160, candidates=2, print "\nTutorial complete." print "Tip: change MODEL ID near the top or set os.environ 'CODEGEN MODEL ID' before running to try larger CodeGen variants." We aggregate benchmark results and visualize the best candidate pass rates across all tasks. We export generated candidates, benchmark summaries, best solutions, and the composed pipeline as reusable files. We finish by adding an interactive helper function that lets us generate new CodeGen solutions from custom user-defined programming tasks. Conclusion In conclusion, we built a practical, advanced Salesforce CodeGen tutorial and demonstrated how to turn raw model outputs into more reliable code. We started with simple code completion, then strengthened the workflow with automated extraction, safety checks, unit tests, reranking, multi-turn composition, prompt templates, and benchmark reporting. Finally, we have a complete mini-framework for experimenting with CodeGen, comparing generated candidates, validating their correctness, and exporting useful results for further analysis or integration into larger code-generation systems. Check out the Full Codes here . Also, feel free to follow us on and don’t forget to join our Twitter https://x.com/intent/follow?screen name=marktechpost and Subscribe to 150k+ML SubReddit https://www.reddit.com/r/machinelearningnews/ . Wait are you on telegram? our Newsletter https://www.aidevsignals.com/ now you can join us on telegram as well. https://t.me/machinelearningresearchnews Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us https://forms.gle/wbash1wF6efRj8G58 Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions. - Sana Hassan - Sana Hassan - Sana Hassan - Sana Hassan