{"slug": "salesforce-codegen-tutorial-generate-validate-and-rerank-python-functions-with", "title": "Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks", "summary": "Salesforce released a tutorial demonstrating an end-to-end workflow for its CodeGen model, covering loading from Hugging Face, generating Python functions from natural-language prompts, and adding validation steps including syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, and artifact export. The tutorial shows how CodeGen can be used as part of a structured code-generation pipeline that evaluates and filters generated solutions.", "body_md": "In this tutorial, we implement an end-to-end workflow for[ Salesforce CodeGen](https://github.com/salesforce/CodeGen). We load a CodeGen model from Hugging Face, prepare it for code generation, and use it to generate Python functions from natural-language prompts. We then move beyond basic inference by adding function extraction, syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, multi-step program synthesis, prompt-style experimentation, benchmark visualization, and artifact export. Through this workflow, we learn how CodeGen can be used not only as a code completion model but also as part of a structured code-generation pipeline that evaluates, filters, and organizes generated solutions.\n\n**Loading the Salesforce CodeGen Model from Hugging Face**\n\n``` python\nimport os, sys, subprocess, textwrap, json, re, time, math, ast, tempfile, multiprocessing as mp\nfrom pathlib import Path\ndef sh(cmd):\n   print(f\"\\n$ {cmd}\")\n   subprocess.run(cmd, shell=True, check=True)\nsh(f\"{sys.executable} -m pip install -q -U transformers accelerate safetensors einops datasets evaluate pandas matplotlib tqdm rich radon tiktoken\")\nimport torch\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom tqdm.auto import tqdm\nfrom rich import print\nfrom rich.panel import Panel\nfrom rich.syntax import Syntax\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, set_seed\nfrom radon.complexity import cc_visit\nOUT_DIR = Path(\"/content/codegen_advanced_tutorial\")\nOUT_DIR.mkdir(parents=True, exist_ok=True)\nset_seed(42)\nprint(Panel.fit(\"Salesforce CodeGen Advanced Tutorial\", style=\"bold green\"))\nprint(\"\\nRuntime information\")\nprint(\"Python:\", sys.version.split()[0])\nprint(\"Torch:\", torch.__version__)\nprint(\"CUDA available:\", torch.cuda.is_available())\nif torch.cuda.is_available():\n   print(\"GPU:\", torch.cuda.get_device_name(0))\n   print(\"CUDA memory GB:\", round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2))\nMODEL_ID = os.environ.get(\"CODEGEN_MODEL_ID\", \"Salesforce/codegen-350M-mono\")\nMODEL_OPTIONS = {\n   \"easy_colab_default\": \"Salesforce/codegen-350M-mono\",\n   \"larger_codegen1\": \"Salesforce/codegen-2B-mono\",\n   \"codegen2_1b\": \"Salesforce/codegen2-1B_P\",\n   \"codegen25_7b_mono\": \"Salesforce/codegen25-7b-mono_P\",\n}\nprint(\"\\nSelected model:\", MODEL_ID)\nprint(\"Available model examples:\", MODEL_OPTIONS)\ntrust_remote_code = any(x in MODEL_ID.lower() for x in [\"codegen2\", \"codegen25\"])\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ndtype = torch.float16 if torch.cuda.is_available() else torch.float32\nprint(\"\\nLoading tokenizer...\")\ntokenizer = AutoTokenizer.from_pretrained(\n   MODEL_ID,\n   trust_remote_code=trust_remote_code\n)\nif tokenizer.pad_token is None:\n   tokenizer.pad_token = tokenizer.eos_token\nprint(\"Loading model...\")\nload_kwargs = {\n   \"trust_remote_code\": trust_remote_code,\n   \"low_cpu_mem_usage\": True,\n}\nif torch.cuda.is_available():\n   load_kwargs[\"torch_dtype\"] = dtype\n   load_kwargs[\"device_map\"] = \"auto\"\nelse:\n   load_kwargs[\"torch_dtype\"] = torch.float32\nmodel = AutoModelForCausalLM.from_pretrained(MODEL_ID, **load_kwargs)\nif not torch.cuda.is_available():\n   model.to(device)\nmodel.eval()\ndef count_parameters(model):\n   return sum(p.numel() for p in model.parameters())\nprint(f\"Loaded {MODEL_ID}\")\nprint(f\"Parameter count: {count_parameters(model)/1e6:.1f}M\")\ndef generate_text(\n   prompt,\n   max_new_tokens=180,\n   temperature=0.35,\n   top_p=0.92,\n   top_k=50,\n   do_sample=True,\n   num_return_sequences=1,\n   repetition_penalty=1.05,\n):\n   inputs = tokenizer(prompt, return_tensors=\"pt\")\n   inputs = {k: v.to(model.device) for k, v in inputs.items()}\n   with torch.no_grad():\n       outputs = model.generate(\n           **inputs,\n           max_new_tokens=max_new_tokens,\n           do_sample=do_sample,\n           temperature=temperature,\n           top_p=top_p,\n           top_k=top_k,\n           num_return_sequences=num_return_sequences,\n           repetition_penalty=repetition_penalty,\n           pad_token_id=tokenizer.eos_token_id,\n           eos_token_id=tokenizer.eos_token_id,\n       )\n   decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)\n   return decoded\ndef print_code(title, code):\n   print(Panel.fit(title, style=\"bold cyan\"))\n   print(Syntax(code, \"python\", theme=\"monokai\", line_numbers=True))\n```\n\nWe install all required libraries and prepare the environment for running Salesforce CodeGen. We check the runtime, detect GPU availability, select the CodeGen model, and load both the tokenizer and model from Hugging Face. We also define helper functions for text generation and for displaying formatted code so that the rest of the tutorial is easier to follow.\n\n**Building Extraction, Safety, and Unit-Test Validation Utilities**\n\n``` python\ndef extract_function_source(full_text, function_name):\n   text = full_text.replace(\"\\r\\n\", \"\\n\")\n   fence = re.search(r\"```(?:python)?\\n(.*?)```\", text, flags=re.S | re.I)\n   if fence:\n       text = fence.group(1)\n   pattern = rf\"^def\\s+{re.escape(function_name)}\\s*\\(\"\n   match = re.search(pattern, text, flags=re.M)\n   if not match:\n       return \"\"\n   chunk = text[match.start():]\n   lines = chunk.splitlines()\n   collected = []\n   for i, line in enumerate(lines):\n       if i > 0:\n           if line.startswith(\"def \") or line.startswith(\"class \"):\n               break\n           if line.startswith(\"if __name__\"):\n               break\n           if line and not line.startswith((\" \", \"\\t\", \"#\")) and re.match(r\"^[A-Za-z_][A-Za-z0-9_]*\\s*=\", line):\n               break\n       collected.append(line)\n   source = \"\\n\".join(collected).rstrip()\n   try:\n       ast.parse(source)\n       return source\n   except SyntaxError:\n       fixed_lines = []\n       for line in collected:\n           fixed_lines.append(line)\n           candidate = \"\\n\".join(fixed_lines).rstrip()\n           try:\n               ast.parse(candidate)\n               source = candidate\n           except SyntaxError:\n               pass\n       return source if source.strip().startswith(\"def \") else \"\"\ndef syntax_ok(source):\n   try:\n       ast.parse(source)\n       return True, \"\"\n   except SyntaxError as e:\n       return False, str(e)\nFORBIDDEN_NAMES = {\n   \"eval\", \"exec\", \"compile\", \"open\", \"input\", \"__import__\",\n   \"globals\", \"locals\", \"vars\", \"dir\", \"getattr\", \"setattr\", \"delattr\",\n   \"help\", \"breakpoint\", \"exit\", \"quit\"\n}\nFORBIDDEN_NODES = (\n   ast.Import,\n   ast.ImportFrom,\n   ast.Global,\n   ast.Nonlocal,\n   ast.With,\n   ast.AsyncWith,\n   ast.AsyncFunctionDef,\n   ast.ClassDef,\n   ast.Delete,\n   ast.Raise,\n)\nALLOWED_BUILTINS = {\n   \"abs\": abs,\n   \"all\": all,\n   \"any\": any,\n   \"bool\": bool,\n   \"dict\": dict,\n   \"enumerate\": enumerate,\n   \"float\": float,\n   \"int\": int,\n   \"isinstance\": isinstance,\n   \"len\": len,\n   \"list\": list,\n   \"map\": map,\n   \"max\": max,\n   \"min\": min,\n   \"pow\": pow,\n   \"range\": range,\n   \"reversed\": reversed,\n   \"round\": round,\n   \"set\": set,\n   \"sorted\": sorted,\n   \"str\": str,\n   \"sum\": sum,\n   \"tuple\": tuple,\n   \"zip\": zip,\n}\ndef static_safety_check(source):\n   try:\n       tree = ast.parse(source)\n   except SyntaxError as e:\n       return False, f\"SyntaxError: {e}\"\n   for node in ast.walk(tree):\n       if isinstance(node, FORBIDDEN_NODES):\n           return False, f\"Forbidden AST node: {type(node).__name__}\"\n       if isinstance(node, ast.Name):\n           if node.id in FORBIDDEN_NAMES or node.id.startswith(\"__\"):\n               return False, f\"Forbidden name: {node.id}\"\n       if isinstance(node, ast.Attribute):\n           if node.attr.startswith(\"__\"):\n               return False, f\"Forbidden attribute: {node.attr}\"\n       if isinstance(node, ast.Call):\n           if isinstance(node.func, ast.Name) and node.func.id in FORBIDDEN_NAMES:\n               return False, f\"Forbidden call: {node.func.id}\"\n   return True, \"passed\"\ndef _worker_run_tests(source, function_name, tests, queue):\n   try:\n       safe_globals = {\"__builtins__\": ALLOWED_BUILTINS}\n       safe_locals = {}\n       compiled = compile(source, \"<generated_code>\", \"exec\")\n       exec(compiled, safe_globals, safe_locals)\n       fn = safe_locals.get(function_name) or safe_globals.get(function_name)\n       if fn is None:\n           queue.put({\"ok\": False, \"error\": f\"{function_name} not found\", \"passed\": 0, \"total\": len(tests)})\n           return\n       passed = 0\n       details = []\n       for test in tests:\n           args = test.get(\"args\", [])\n           kwargs = test.get(\"kwargs\", {})\n           expected = test[\"expected\"]\n           result = fn(*args, **kwargs)\n           ok = result == expected\n           passed += int(ok)\n           details.append({\n               \"args\": args,\n               \"kwargs\": kwargs,\n               \"expected\": expected,\n               \"result\": result,\n               \"ok\": ok,\n           })\n       queue.put({\"ok\": passed == len(tests), \"error\": \"\", \"passed\": passed, \"total\": len(tests), \"details\": details})\n   except Exception as e:\n       queue.put({\"ok\": False, \"error\": repr(e), \"passed\": 0, \"total\": len(tests)})\ndef run_unit_tests_safely(source, function_name, tests, timeout_seconds=3):\n   safe, reason = static_safety_check(source)\n   if not safe:\n       return {\"ok\": False, \"error\": reason, \"passed\": 0, \"total\": len(tests), \"details\": []}\n   ctx = mp.get_context(\"fork\")\n   queue = ctx.Queue()\n   process = ctx.Process(target=_worker_run_tests, args=(source, function_name, tests, queue))\n   process.start()\n   process.join(timeout_seconds)\n   if process.is_alive():\n       process.terminate()\n       process.join()\n       return {\"ok\": False, \"error\": \"timeout\", \"passed\": 0, \"total\": len(tests), \"details\": []}\n   if queue.empty():\n       return {\"ok\": False, \"error\": \"no result returned\", \"passed\": 0, \"total\": len(tests), \"details\": []}\n   return queue.get()\ndef code_complexity(source):\n   try:\n       blocks = cc_visit(source)\n       if not blocks:\n           return 1\n       return max(block.complexity for block in blocks)\n   except Exception:\n       return None\ndef score_candidate(source, test_result):\n   syntax_score = 1 if syntax_ok(source)[0] else 0\n   safety_score = 1 if static_safety_check(source)[0] else 0\n   passed = test_result.get(\"passed\", 0)\n   total = max(test_result.get(\"total\", 1), 1)\n   test_score = passed / total\n   complexity = code_complexity(source)\n   complexity_penalty = 0 if complexity is None else min(complexity / 20, 0.25)\n   return syntax_score + safety_score + 3 * test_score - complexity_penalty\n```\n\nWe build the utility layer that extracts generated Python functions from raw model outputs. We add syntax validation, static safety checks, restricted execution, unit-test execution, and timeout handling to make generated code easier to evaluate. We also calculate code complexity and create a scoring function to rank generated candidates by correctness, safety, and simplicity.\n\n```\nprint(\"\\n\" + \"=\" * 90)\n```\n\n**Generating Code and Defining Benchmark Tasks**\n\n```\nprint(\"Demo 1: Basic natural-language-to-code completion\")\nprint(\"=\" * 90)\nbasic_prompt = \"\"\"# Write a Python function that returns the area of a circle.\n# The function should be named circle_area and should accept radius as input.\n# Do not print anything. Return the numeric result.\ndef circle_area(radius):\n\"\"\"\nbasic_output = generate_text(\n   basic_prompt,\n   max_new_tokens=120,\n   temperature=0.25,\n   do_sample=True,\n   num_return_sequences=1,\n)[0]\nprint_code(\"Raw CodeGen output\", basic_output)\ncircle_source = extract_function_source(basic_output, \"circle_area\")\nprint_code(\"Extracted function\", circle_source if circle_source else \"# No function extracted\")\ncircle_tests = [\n   {\"args\": [1], \"expected\": math.pi},\n   {\"args\": [2], \"expected\": 4 * math.pi},\n]\nif circle_source:\n   print(\"Syntax:\", syntax_ok(circle_source))\n   print(\"Safety:\", static_safety_check(circle_source))\n   print(\"Complexity:\", code_complexity(circle_source))\nprint(\"\\n\" + \"=\" * 90)\nprint(\"Demo 2: Best-of-N generation with test-based reranking\")\nprint(\"=\" * 90)\nTASKS = [\n   {\n       \"name\": \"factorial\",\n       \"signature\": \"def factorial(n):\",\n       \"instruction\": \"Return n factorial for a non-negative integer n. Use 1 for factorial(0).\",\n       \"tests\": [\n           {\"args\": [0], \"expected\": 1},\n           {\"args\": [1], \"expected\": 1},\n           {\"args\": [5], \"expected\": 120},\n           {\"args\": [7], \"expected\": 5040},\n       ],\n   },\n   {\n       \"name\": \"is_palindrome\",\n       \"signature\": \"def is_palindrome(text):\",\n       \"instruction\": \"Return True if text is a palindrome after removing spaces and ignoring case, otherwise return False.\",\n       \"tests\": [\n           {\"args\": [\"Race car\"], \"expected\": True},\n           {\"args\": [\"hello\"], \"expected\": False},\n           {\"args\": [\"Never odd or even\"], \"expected\": True},\n       ],\n   },\n   {\n       \"name\": \"fibonacci\",\n       \"signature\": \"def fibonacci(n):\",\n       \"instruction\": \"Return the nth Fibonacci number where fibonacci(0)=0 and fibonacci(1)=1.\",\n       \"tests\": [\n           {\"args\": [0], \"expected\": 0},\n           {\"args\": [1], \"expected\": 1},\n           {\"args\": [8], \"expected\": 21},\n           {\"args\": [10], \"expected\": 55},\n       ],\n   },\n   {\n       \"name\": \"dedupe_keep_order\",\n       \"signature\": \"def dedupe_keep_order(items):\",\n       \"instruction\": \"Return a list with duplicate values removed while preserving the first occurrence order.\",\n       \"tests\": [\n           {\"args\": [[1, 2, 1, 3, 2]], \"expected\": [1, 2, 3]},\n           {\"args\": [[\"a\", \"b\", \"a\", \"c\"]], \"expected\": [\"a\", \"b\", \"c\"]},\n           {\"args\": [[]], \"expected\": []},\n       ],\n   },\n]\n```\n\nWe start with a simple natural-language-to-code generation example using a circle area function. We generate raw CodeGen output, extract the function, and inspect its syntax, safety, and complexity. We then define multiple programming tasks that later help us benchmark CodeGen across different function-generation problems.\n\n**Best-of-N Candidate Generation and Test-Based Reranking**\n\n``` python\ndef build_prompt(task):\n   examples = []\n   for t in task[\"tests\"][:2]:\n       examples.append(f\"# Example: {task['name']}(*{t['args']}) -> {repr(t['expected'])}\")\n   example_block = \"\\n\".join(examples)\n   return f'''# You are writing clean Python 3 code.\n# Task: {task[\"instruction\"]}\n# Rules:\n# - Do not import packages.\n# - Do not print anything.\n# - Return the answer from the function.\n# - Keep the implementation compact and readable.\n{example_block}\n{task[\"signature\"]}\n'''\ndef generate_candidates_for_task(task, n=3, max_new_tokens=160):\n   prompt = build_prompt(task)\n   outputs = generate_text(\n       prompt,\n       max_new_tokens=max_new_tokens,\n       temperature=0.45,\n       top_p=0.92,\n       do_sample=True,\n       num_return_sequences=n,\n       repetition_penalty=1.07,\n   )\n   candidates = []\n   for i, out in enumerate(outputs):\n       source = extract_function_source(out, task[\"name\"])\n       syntax_pass, syntax_error = syntax_ok(source) if source else (False, \"no source extracted\")\n       test_result = run_unit_tests_safely(source, task[\"name\"], task[\"tests\"]) if source else {\n           \"ok\": False,\n           \"error\": \"no source extracted\",\n           \"passed\": 0,\n           \"total\": len(task[\"tests\"]),\n           \"details\": [],\n       }\n       candidates.append({\n           \"task\": task[\"name\"],\n           \"candidate_id\": i,\n           \"prompt\": prompt,\n           \"raw_output\": out,\n           \"source\": source,\n           \"syntax_ok\": syntax_pass,\n           \"syntax_error\": syntax_error,\n           \"safety\": static_safety_check(source)[0] if source else False,\n           \"tests_passed\": test_result.get(\"passed\", 0),\n           \"tests_total\": test_result.get(\"total\", len(task[\"tests\"])),\n           \"test_ok\": test_result.get(\"ok\", False),\n           \"test_error\": test_result.get(\"error\", \"\"),\n           \"complexity\": code_complexity(source) if source else None,\n           \"score\": score_candidate(source, test_result) if source else -999,\n       })\n   candidates = sorted(candidates, key=lambda x: x[\"score\"], reverse=True)\n   return candidates\nall_candidates = []\nbest_solutions = {}\nCANDIDATES_PER_TASK = 2\nfor task in tqdm(TASKS, desc=\"Generating and evaluating\"):\n   candidates = generate_candidates_for_task(task, n=CANDIDATES_PER_TASK)\n   all_candidates.extend(candidates)\n   best_solutions[task[\"name\"]] = candidates[0]\nresults_df = pd.DataFrame([\n   {\n       \"task\": c[\"task\"],\n       \"candidate_id\": c[\"candidate_id\"],\n       \"syntax_ok\": c[\"syntax_ok\"],\n       \"safety\": c[\"safety\"],\n       \"tests_passed\": c[\"tests_passed\"],\n       \"tests_total\": c[\"tests_total\"],\n       \"test_ok\": c[\"test_ok\"],\n       \"complexity\": c[\"complexity\"],\n       \"score\": round(c[\"score\"], 3),\n       \"test_error\": c[\"test_error\"],\n   }\n   for c in all_candidates\n]).sort_values([\"task\", \"score\"], ascending=[True, False])\nprint(\"\\nCandidate summary\")\ndisplay(results_df)\nfor task_name, best in best_solutions.items():\n   print_code(f\"Best solution for {task_name}\", best[\"source\"] if best[\"source\"] else \"# No valid source\")\n   print({\n       \"task\": task_name,\n       \"tests_passed\": f'{best[\"tests_passed\"]}/{best[\"tests_total\"]}',\n       \"score\": best[\"score\"],\n       \"test_error\": best[\"test_error\"],\n   })\n```\n\nWe create structured prompts for each task and generate multiple candidate solutions using CodeGen. We evaluate each candidate with unit tests, syntax checks, safety checks, complexity analysis, and a scoring system. We then summarize the results in a DataFrame and display the best-generated solution for each task.\n\n```\nprint(\"\\n\" + \"=\" * 90)\n```\n\n**Multi-Turn Program Synthesis and Prompt-Style Experiments**\n\n```\nprint(\"Demo 3: Multi-turn program synthesis\")\nprint(\"=\" * 90)\nmulti_turn_prompts = [\n   {\n       \"name\": \"normalize_words\",\n       \"prompt\": \"\"\"# Step 1.\n# Write a Python function normalize_words(text).\n# It should lowercase text, remove punctuation characters .,!?:;, and split into words.\n# Do not import packages.\ndef normalize_words(text):\n\"\"\",\n       \"tests\": [\n           {\"args\": [\"Hello, HELLO world!\"], \"expected\": [\"hello\", \"hello\", \"world\"]},\n           {\"args\": [\"A test: yes.\"], \"expected\": [\"a\", \"test\", \"yes\"]},\n       ],\n   },\n   {\n       \"name\": \"word_counts\",\n       \"prompt\": \"\"\"# Step 2.\n# Write a Python function word_counts(words).\n# It receives a list of words and returns a dictionary mapping each word to its frequency.\n# Do not import packages.\ndef word_counts(words):\n\"\"\",\n       \"tests\": [\n           {\"args\": [[\"a\", \"b\", \"a\"]], \"expected\": {\"a\": 2, \"b\": 1}},\n           {\"args\": [[]], \"expected\": {}},\n       ],\n   },\n   {\n       \"name\": \"top_word\",\n       \"prompt\": \"\"\"# Step 3.\n# Write a Python function top_word(counts).\n# It receives a dictionary of word frequencies.\n# Return the word with the highest count.\n# If counts is empty, return None.\n# If there is a tie, return the alphabetically smallest word.\n# Do not import packages.\ndef top_word(counts):\n\"\"\",\n       \"tests\": [\n           {\"args\": [{\"a\": 2, \"b\": 1}], \"expected\": \"a\"},\n           {\"args\": [{\"b\": 2, \"a\": 2}], \"expected\": \"a\"},\n           {\"args\": [{}], \"expected\": None},\n       ],\n   },\n]\nmulti_turn_sources = []\nfor spec in multi_turn_prompts:\n   out = generate_text(\n       spec[\"prompt\"],\n       max_new_tokens=150,\n       temperature=0.35,\n       top_p=0.92,\n       do_sample=True,\n       num_return_sequences=1,\n   )[0]\n   src = extract_function_source(out, spec[\"name\"])\n   res = run_unit_tests_safely(src, spec[\"name\"], spec[\"tests\"]) if src else {\"ok\": False, \"error\": \"no extraction\"}\n   multi_turn_sources.append(src)\n   print_code(f\"Generated {spec['name']}\", src if src else \"# No source extracted\")\n   print(\"Test result:\", res)\npipeline_code = \"\\n\\n\".join([s for s in multi_turn_sources if s])\npipeline_code += \"\"\"\ndef most_common_word(text):\n   words = normalize_words(text)\n   counts = word_counts(words)\n   return top_word(counts)\n\"\"\"\npipeline_tests = [\n   {\"args\": [\"Hello hello, world!\"], \"expected\": \"hello\"},\n   {\"args\": [\"B b a a\"], \"expected\": \"a\"},\n]\npipeline_result = run_unit_tests_safely(pipeline_code, \"most_common_word\", pipeline_tests)\nprint_code(\"Composed multi-turn pipeline\", pipeline_code)\nprint(\"Pipeline result:\", pipeline_result)\nprint(\"\\n\" + \"=\" * 90)\nprint(\"Demo 4: Prompt styles for different CodeGen workflows\")\nprint(\"=\" * 90)\nPROMPT_LIBRARY = {\n   \"docstring_to_code\": '''def group_by_first_letter(words):\n   \"\"\"\n   Given a list of strings, return a dictionary where keys are first letters\n   and values are lists of words beginning with that letter.\n   Preserve input order.\n   \"\"\"\n''',\n   \"partial_code_completion\": '''def moving_average(values, window):\n   result = []\n   for i in range(len(values)):\n''',\n   \"test_generation\": '''# Write pytest-style tests for this function.\ndef clamp(x, low, high):\n   return max(low, min(x, high))\ndef test_clamp():\n''',\n   \"refactor_request\": '''# Refactor the following code into a clean function called count_positive.\n# x = [1, -2, 5, 0]\n# c = 0\n# for i in x:\n#     if i > 0:\n#         c = c + 1\n# print(c)\ndef count_positive(values):\n''',\n}\nfor name, prompt in PROMPT_LIBRARY.items():\n   print(\"\\nWorkflow:\", name)\n   out = generate_text(\n       prompt,\n       max_new_tokens=120,\n       temperature=0.35,\n       top_p=0.92,\n       do_sample=True,\n       num_return_sequences=1,\n   )[0]\n   print_code(name, out)\n```\n\nWe demonstrate multi-turn program synthesis by generating smaller functions that work together as a pipeline. We create functions for word normalization, word counting, and top-word selection, then compose them into a complete most-common-word workflow. We also test different prompt styles such as docstring-to-code, partial completion, test generation, and refactoring.\n\n```\nprint(\"Demo 5: Mini benchmark aggregation and visualization\")\nprint(\"=\" * 90)\nbenchmark_rows = []\nfor task in TASKS:\n   task_candidates = [c for c in all_candidates if c[\"task\"] == task[\"name\"]]\n   best = max(task_candidates, key=lambda x: x[\"score\"])\n   pass_at_n = any(c[\"test_ok\"] for c in task_candidates)\n   benchmark_rows.append({\n       \"task\": task[\"name\"],\n       \"best_tests_passed\": best[\"tests_passed\"],\n       \"tests_total\": best[\"tests_total\"],\n       \"best_pass_rate\": best[\"tests_passed\"] / max(best[\"tests_total\"], 1),\n       \"pass_at_n\": pass_at_n,\n       \"best_complexity\": best[\"complexity\"],\n       \"best_score\": best[\"score\"],\n   })\nbenchmark_df = pd.DataFrame(benchmark_rows)\ndisplay(benchmark_df)\nplt.figure(figsize=(9, 4))\nplt.bar(benchmark_df[\"task\"], benchmark_df[\"best_pass_rate\"])\nplt.ylim(0, 1.05)\nplt.ylabel(\"Best candidate pass rate\")\nplt.xlabel(\"Task\")\nplt.title(\"CodeGen mini benchmark: best-of-N unit-test pass rate\")\nplt.xticks(rotation=30, ha=\"right\")\nplt.tight_layout()\nplt.show()\nprint(\"\\n\" + \"=\" * 90)\nprint(\"Exporting artifacts\")\nprint(\"=\" * 90)\ncandidates_path = OUT_DIR / \"codegen_candidates.jsonl\"\nsummary_path = OUT_DIR / \"benchmark_summary.csv\"\nsolutions_path = OUT_DIR / \"best_solutions.py\"\npipeline_path = OUT_DIR / \"multi_turn_pipeline.py\"\nwith open(candidates_path, \"w\", encoding=\"utf-8\") as f:\n   for c in all_candidates:\n       serializable = dict(c)\n       f.write(json.dumps(serializable, ensure_ascii=False, default=str) + \"\\n\")\nbenchmark_df.to_csv(summary_path, index=False)\nwith open(solutions_path, \"w\", encoding=\"utf-8\") as f:\n   f.write(\"# Best generated solutions from Salesforce CodeGen tutorial\\n\\n\")\n   for task_name, best in best_solutions.items():\n       f.write(f\"# ---- {task_name} ----\\n\")\n       f.write(best[\"source\"] if best[\"source\"] else \"# No source generated\")\n       f.write(\"\\n\\n\")\nwith open(pipeline_path, \"w\", encoding=\"utf-8\") as f:\n   f.write(pipeline_code)\nprint(\"Saved files:\")\nprint(candidates_path)\nprint(summary_path)\nprint(solutions_path)\nprint(pipeline_path)\nprint(\"\\n\" + \"=\" * 90)\nprint(\"Optional: interactive single-prompt helper\")\nprint(\"=\" * 90)\ndef codegen_assistant(user_task, function_signature, max_new_tokens=180, candidates=2):\n   prompt = f'''# Write clean Python 3 code.\n# Task: {user_task}\n# Rules:\n# - Do not import packages unless absolutely necessary.\n# - Do not print anything.\n# - Return values from the function.\n# - Keep the function readable.\n{function_signature}\n'''\n   outputs = generate_text(\n       prompt,\n       max_new_tokens=max_new_tokens,\n       temperature=0.45,\n       top_p=0.92,\n       do_sample=True,\n       num_return_sequences=candidates,\n   )\n   extracted = []\n   fn_match = re.search(r\"def\\s+([A-Za-z_][A-Za-z0-9_]*)\\s*\\(\", function_signature)\n   fn_name = fn_match.group(1) if fn_match else None\n   for i, out in enumerate(outputs):\n       src = extract_function_source(out, fn_name) if fn_name else out\n       extracted.append(src)\n       print_code(f\"Candidate {i+1}\", src if src else out)\n   return extracted\ncustom_candidates = codegen_assistant(\n   user_task=\"Return the second largest unique number in a list. If fewer than two unique numbers exist, return None.\",\n   function_signature=\"def second_largest_unique(values):\",\n   max_new_tokens=160,\n   candidates=2,\n)\nprint(\"\\nTutorial complete.\")\nprint(\"Tip: change MODEL_ID near the top or set os.environ['CODEGEN_MODEL_ID'] before running to try larger CodeGen variants.\")\n```\n\nWe aggregate benchmark results and visualize the best candidate pass rates across all tasks. We export generated candidates, benchmark summaries, best solutions, and the composed pipeline as reusable files. We finish by adding an interactive helper function that lets us generate new CodeGen solutions from custom user-defined programming tasks.\n\n**Conclusion**\n\nIn conclusion, we built a practical, advanced Salesforce CodeGen tutorial and demonstrated how to turn raw model outputs into more reliable code. We started with simple code completion, then strengthened the workflow with automated extraction, safety checks, unit tests, reranking, multi-turn composition, prompt templates, and benchmark reporting. Finally, we have a complete mini-framework for experimenting with CodeGen, comparing generated candidates, validating their correctness, and exporting useful results for further analysis or integration into larger code-generation systems.\n\nCheck out the ** Full Codes here**.\n\n**Also, feel free to follow us on**\n\n**and don’t forget to join our**[Twitter](https://x.com/intent/follow?screen_name=marktechpost)\n\n**and Subscribe to**\n\n[150k+ML SubReddit](https://www.reddit.com/r/machinelearningnews/)**. Wait! are you on telegram?**\n\n[our Newsletter](https://www.aidevsignals.com/)\n\n[now you can join us on telegram as well.](https://t.me/machinelearningresearchnews)Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? [Connect with us](https://forms.gle/wbash1wF6efRj8G58)\n\nSana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.\n\n- Sana Hassan\n- Sana Hassan\n- Sana Hassan\n- Sana Hassan", "url": "https://wpnews.pro/news/salesforce-codegen-tutorial-generate-validate-and-rerank-python-functions-with", "canonical_source": "https://www.marktechpost.com/2026/06/18/salesforce-codegen-tutorial-generate-validate-and-rerank-python-functions-with-unit-tests-and-safety-checks/", "published_at": "2026-06-19 02:44:12+00:00", "updated_at": "2026-06-19 03:07:41.498264+00:00", "lang": "en", "topics": ["large-language-models", "generative-ai", "ai-tools", "developer-tools"], "entities": ["Salesforce", "CodeGen", "Hugging Face", "Python"], "alternates": {"html": "https://wpnews.pro/news/salesforce-codegen-tutorial-generate-validate-and-rerank-python-functions-with", "markdown": "https://wpnews.pro/news/salesforce-codegen-tutorial-generate-validate-and-rerank-python-functions-with.md", "text": "https://wpnews.pro/news/salesforce-codegen-tutorial-generate-validate-and-rerank-python-functions-with.txt", "jsonld": "https://wpnews.pro/news/salesforce-codegen-tutorial-generate-validate-and-rerank-python-functions-with.jsonld"}}