{"slug": "open-reproduction-of-deepseek-r1", "title": "Open Reproduction of DeepSeek-R1", "summary": "Hugging Face has released Open R1, an open-source reproduction of DeepSeek's R1 reasoning model, completing the first of three planned steps. The project includes a 350k-trace reasoning dataset distilled from DeepSeek-R1 and the OpenR1-Distill-7B model that matches the performance of DeepSeek's distilled 7B model. The initiative aims to make the full R1 pipeline reproducible and accessible to the broader AI community.", "body_md": "*A fully open reproduction of DeepSeek-R1. This repo is a work in progress, let's build it together!*\n\n**Table of Contents**\n\n[Overview](#overview)[Plan of attack](#plan-of-attack)[Installation](#installation)[Training models](#training-models)[Evaluating models](#evaluating-models)[Reproducing Deepseek's evaluation results](#reproducing-deepseeks-evaluation-results)[Data generation](#data-generation)[Contributing](#contributing)\n\nThe goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:\n\n`src/open_r1`\n\n: contains the scripts to train models as well as generate synthetic data:`grpo.py`\n\n: trains a model with GRPO on a given dataset.`sft.py`\n\n: performs a simple SFT of a model on a dataset.`generate.py`\n\n: generates synthetic data from a model using[Distilabel](https://github.com/argilla-io/distilabel).\n\n`Makefile`\n\n: contains easy-to-run commands for each step in the R1 pipeline leveraging the scripts above.\n\nWe will use the DeepSeek-R1 [tech report](https://github.com/deepseek-ai/DeepSeek-R1) as a guide, which can roughly be broken down into three main steps:\n\n- Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.\n- Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.\n- Step 3: show we can go from base model to RL-tuned via multi-stage training.\n\n**🧑🍳 [2025/05/26] (Step 1 completed!)** We release--a curated reasoning dataset of 350k verified traces distilled from R1. The dataset spans tasks in mathematics, coding, and science, and is designed to teach language models to reason step-by-step. We also provide a recipe to train**Mixture-of-Thoughts**[OpenR1-Distill-7B](https://huggingface.co/open-r1/OpenR1-Distill-7B), which replicates the reasoning capabilities of[deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)and marks the completion of step 1 in the Open R1 project.**⚡️ [2025/03/11]** We release the[(update #3)](https://huggingface.co/blog/open-r1/update-3):dataset of 10k competitive programming problems and 100k solutions distilled from R1. We also release IOI24: a new benchmark of**CodeForces-CoTs*** very*hard problems from international olympiads. A 7B Qwen model trained on CodeForces-CoTs can outperform Claude 3.7 Sonnet on IOI24, while a 32B model can outperform R1 itself.**∞ [2025/02/10]** We release the[(update #2)](https://huggingface.co/blog/open-r1/update-2):dataset of 220k traces distilled from R1 on a new version of NuminaMath. Models trained on this dataset match the performance of DeepSeek's distilled ones.**OpenR1-Math-220k****🔥 [2025/02/02]** We implement the first parts of the[(update #1)](https://huggingface.co/blog/open-r1/update-1):[training](https://github.com/huggingface/open-r1?tab=readme-ov-file#training-models),[inference](https://github.com/huggingface/open-r1?tab=readme-ov-file#data-generation), and[evaluation](https://github.com/huggingface/open-r1?tab=readme-ov-file#reproducing-deepseeks-evaluation-results)pipelines. Let's go!\n\nCaution\n\nLibraries rely on CUDA 12.4. If you see errors related to segmentation faults, double check the version your system is running with `nvcc --version`\n\n.\n\nTo run the code in this project, first, create a Python virtual environment using e.g. `uv`\n\n.\nTo install `uv`\n\n, follow the [UV Installation Guide](https://docs.astral.sh/uv/getting-started/installation/).\n\nNote\n\nAs a shortcut, run `make install`\n\nto setup development libraries (spelled out below). Afterwards, if everything is setup correctly you can try out the Open-R1 models.\n\n```\nuv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pip\n```\n\nTip\n\nFor Hugging Face cluster users, add `export UV_LINK_MODE=copy`\n\nto your `.bashrc`\n\nto suppress cache warnings from `uv`\n\nNext, install vLLM and FlashAttention:\n\n```\nuv pip install vllm==0.8.5.post1\nuv pip install setuptools && uv pip install flash-attn --no-build-isolation\n```\n\nThis will also install PyTorch `v2.6.0`\n\nand it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`\n\n. For most contributors, we recommend:\n\n```\nGIT_LFS_SKIP_SMUDGE=1 uv pip install -e \".[dev]\"\n```\n\nNext, log into your Hugging Face and Weights and Biases accounts as follows:\n\n```\nhuggingface-cli login\nwandb login\n```\n\nFinally, check whether your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:\n\n```\ngit-lfs --version\n```\n\nIf it isn't installed, run:\n\n```\nsudo apt-get install git-lfs\n```\n\nNote\n\nThe training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.\n\nWe support training models with either DDP or DeepSpeed (ZeRO-2 and ZeRO-3). For example, to perform SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts), run:\n\n```\n# Train via command line\naccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \\\n    --model_name_or_path open-r1/Qwen2.5-Math-7B-RoPE-300k \\\n    --dataset_name open-r1/Mixture-of-Thoughts \\\n    --dataset_config all \\\n    --eos_token '<|im_end|>' \\\n    --learning_rate 4.0e-5 \\\n    --num_train_epochs 5 \\\n    --max_seq_length 32768 \\\n    --per_device_train_batch_size 2 \\\n    --gradient_checkpointing \\\n    --bf16 \\\n    --use_liger_kernel \\\n    --output_dir data/OpenR1-Distill-7B\n\n# Train via YAML config\naccelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \\\n    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml\n```\n\nCurrently, the following tasks are supported:\n\n- Supervised Fine-Tuning\n`sft`\n\n- Group Relative Policy Optimization\n`grpo`\n\nTip\n\nIf you scale up/down the number of GPUs, we recommend also scaling up the per-device batch size or number of gradient accumulation steps to keep the global batch size constant.\n\nBy default, these scripts will push each model to your Hugging Face Hub username, i.e. `{username}/{model_name}-{task}`\n\n. You can override the parameters in each YAML config by appending them to the command as follows:\n\n```\n# Change the base model to a smaller variant\naccelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \\\n    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml \\\n    --model_name_or_path Qwen/Qwen3-0.6B-Base \\\n    --hub_model_id OpenR1-Distill-0.6B \\\n    --output_dir data/OpenR1-Distill-0.6B\n```\n\nIf you also wish to override the Weights and Biases default settings, you can do so as follows:\n\n```\naccelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \\\n    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml\n    --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO\n```\n\n**🚨 WARNING 🚨**\n\nMost base models like `meta-llama/Llama-3.2-1B`\n\ndo not have a chat template, so we set ChatML as the default during training. However, for Qwen base models like `Qwen/Qwen2.5-1.5B`\n\n, a chat template is pre-defined in the tokenizer, so the EOS token must be set accordingly, e.g.\n\n```\n# Align EOS token with chat template for Qwen base models\naccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \\\n    --model_name_or_path Qwen/Qwen2.5-1.5B \\\n+   --eos_token '<|im_end|>'\n    --dataset_name open-r1/Mixture-of-Thoughts \\\n    --dataset_config all \\\n    --learning_rate 4.0e-5 \\\n    --num_train_epochs 1 \\\n    --max_seq_length 32768 \\\n    --per_device_train_batch_size 16 \\\n    --gradient_checkpointing \\\n    --bf16 \\\n    --use_liger_kernel \\\n    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill\n```\n\nIf you wish to use a custom chat template (e.g. Llama or Gemma), then the chat template and associated EOS token must be provided:\n\n```\n# Align EOS token with custom chat template\naccelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \\\n    --model_name_or_path meta-llama/Llama-3.2-1B \\\n+   --chat_template \"$(cat llama_chat_template.jinja)\" \\\n+   --eos_token '<|eot_id|>' \\\n    --dataset_name open-r1/Mixture-of-Thoughts \\\n    --dataset_config all \\\n    --learning_rate 4.0e-5 \\\n    --num_train_epochs 1 \\\n    --max_seq_length 32768 \\\n    --per_device_train_batch_size 16 \\\n    --gradient_checkpointing \\\n    --bf16 \\\n    --use_liger_kernel \\\n    --output_dir data/Llama-3.2-1B-Open-R1-Distill\n```\n\nWe provide a recipe to reproduce the reasoning capabilities of [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), starting from the same base model. To do so, run:\n\n```\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \\\n    src/open_r1/sft.py \\\n    --config recipes/OpenR1-Distill-7B/sft/config_distill.yaml\n```\n\nThe result will be a model like [open-r1/OpenR1-Distill-7B](https://huggingface.co/open-r1/OpenR1-Distill-7B), with the following downstream performance:\n\n| Model | AIME 2024 | MATH-500 | GPQA Diamond | LiveCodeBench v5 |\n|---|---|---|---|---|\n| OpenR1-Distill-7B | 52.7 | 89.0 | 52.8 | 39.4 |\n| DeepSeek-R1-Distill-Qwen-7B | 51.3 | 93.5 | 52.4 | 37.4 |\n\nYou can adjust the YAML config to train on a different base model or dataset.\n\nWe use TRL's [vLLM backend](https://huggingface.co/docs/trl/speeding_up_training?vllm+examples=GRPO#vllm-for-fast-generation-in-online-methods) to scale training to large models across multiple nodes. For single-node training of smol models across 8 GPUs, use `vllm_mode=\"colocate\"`\n\nto run vLLM in the same process as the training script:\n\n```\nACCELERATE_LOG_LEVEL=info \\\n    accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \\\n    src/open_r1/grpo.py --config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml \\\n    --vllm_mode colocate\n```\n\nWarning\n\nThe chat template used in the distilled DeepSeek models omits the contents of the reasoning block within the `<think>`\n\nand `</think>`\n\ntags. It also prefills the assistant response with `<think>`\n\nwhich interferes with the format reward function. To handle that, it is important to override the chat template as done in e.g. [recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml](/huggingface/open-r1/blob/main/recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml).\n\nFor multi-node training on N+1 nodes, with 1 node running the vLLM server and N nodes running training, we provide an example Slurm script. For example, to run the above example on 1+1 nodes with data parallelism, run:\n\n```\nsbatch --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 8 --tp 1\n```\n\nSee the [Launching jobs on a Slurm cluster](#launching-jobs-on-a-slurm-cluster) section for more details.\n\nWe provide support to filter datasets by generating and computing pass rate on veriable tasks, see this [README](/huggingface/open-r1/blob/main/scripts/pass_rate_filtering/README.md)\n\nWe provide a `code`\n\nreward function for executing code generated by the policy during training. Currently, this reward function targets code contests like [Codeforces](https://codeforces.com), where solutions are executed against a set of test cases and the overall success rate is returned as the final reward. To ensure safe execution, we support multiple sandbox providers:\n\n[E2B](https://e2b.dev)- Fast, cloud-based sandboxes with focus on Python execution[Morph](https://cloud.morph.so/web/)- Cloud-based sandboxes with broader language support - Python/JS/C++/Rust\n\nTo use the code reward function, first install the necessary dependencies:\n\n```\nuv pip install -e '.[code]'\n```\n\nTo use E2B sandboxes, create a `.env`\n\nfile and add your E2B API token:\n\n```\nE2B_API_KEY=\"e2b_xxx\"\n```\n\nTo use Morph, first install the morphcloud package:\n\n```\npip install morphcloud\n```\n\nThen add your Morph API token to the `.env`\n\nfile:\n\n```\nMORPH_API_KEY=\"YOUR_MORPH_API_KEY\"\n```\n\nTo specify which provider to use, add the `provider_type`\n\nparameter in your configuration:\n\n```\n# For E2B\nprovider_type: e2b\n\n# For Morph\nprovider_type: morph\n```\n\nMake sure your dataset contains a `verification_info`\n\ncolumn with the following schema (adopted from PrimeIntellect's excellent [datasets](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37) of verifiable problems):\n\n```\n{\n    \"language\": \"python\",  # Morph supports more languages including C++, Java, etc.\n    \"test_cases\": [\n        {\n            \"input\": \"4\\n4\\n0001\\n1000\\n0011\\n0111\\n3\\n010\\n101\\n0\\n2\\n00000\\n00001\\n4\\n01\\n001\\n0001\\n00001\\n\",\n            \"output\": \"1\\n3 \\n-1\\n0\\n\\n2\\n1 2 \\n\",\n            \"type\": \"stdin_stdout\",\n        }\n    ],\n}\n```\n\nFor example, to train a smol model on Python problems, start the vLLM server:\n\n```\nCUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-1.5B-Instruct\n```\n\nThen run training with:\n\n```\nCUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 ACCELERATE_LOG_LEVEL=info \\\n    accelerate launch --config_file recipes/accelerate_configs/zero2.yaml --num_processes=7 \\\n    src/open_r1/grpo.py --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code.yaml\n```\n\nIt is possible to be rate limited when too many scripts are executed on sandbox services. For both providers, we offer router scripts that can be launched on a CPU node:\n\nFor E2B:\n\n```\nsbatch slurm/e2b_router.slurm\n```\n\nFor Morph:\n\n```\nsbatch slurm/morph_router.slurm\n```\n\nThen add the router URL in your training YAML config:\n\n```\n# For E2B\ne2b_router_url: 1.2.3.4:8000\n\n# For Morph\nmorph_router_url: 1.2.3.4:8000\n```\n\nThe port should match the one used when launching the router. All training jobs can share the same router IP which will ensure parallel executions are properly managed.\n\nWe provide `ioi_code_reward`\n\nand `cf_code_reward`\n\nreward functions for executing problems from [IOI](https://hf.co/datasets/open-r1/ioi) and [CodeForces](https://huggingface.co/datasets/open-r1/codeforces), respectively. You can use either [piston](https://github.com/engineer-man/piston) or Morph (currently IOI only) as your execution provider.\n\nTo use Piston:\n\n- Get piston workers running, see\n[slurm/piston/README.md](/huggingface/open-r1/blob/main/slurm/piston/README.md) - Set your environment variable\n`PISTON_ENDPOINTS`\n\nto`slurm`\n\nor to a list of piston worker endpoints\n\nFor IOI:\n\n- In your configuration, use\n`ioi_provider: \"piston\"`\n\nFor CodeForces:\n\n- Download the generated (hard) test cases:\n\n```\n# change PATH_TO_SAVE_TESTCASES. Increase --max-workers according to your machine's capacity\nhuggingface-cli download open-r1/codeforces --repo-type=dataset --include='generated_tests/*.parquet' --max-workers=8 --local-dir PATH_TO_SAVE_TESTCASES\n```\n\n- Save the path in .env:\n\n```\nCF_TESTS_FOLDER=PATH_TO_SAVE_TESTCASES\n```\n\nMorph is a cloud-based solution that provides sandboxed environments for running code. To use it:\n\n- Install the Morph client:\n`pip install morphcloud`\n\n- Add your Morph API key to the\n`.env`\n\nfile:`MORPH_API_KEY=\"your_key_here\"`\n\n- In your configuration, use\n`ioi_provider: \"morph\"`\n\nFor IOI:\n\nSee the [example recipe](/huggingface/open-r1/blob/main/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code_ioi.yaml) for how to use the IOI reward function:\n\n```\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \\\n    --num_processes=7 src/open_r1/grpo.py \\\n    --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo_code_ioi.yaml\n```\n\nFor CodeForces:\n\n```\nsbatch --job-name=cf-grpo --nodes=2 slurm/train.slurm --model Qwen2.5-Coder-7B-Instruct --task grpo --config codeforces --accelerator zero3 --dp 8 --tp 1\n```\n\nIf you have access to a Slurm cluster, we provide a `slurm/train.slurm`\n\nscript that will automatically queue training jobs for you. Here's how you can use it:\n\n```\nsbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model {model_name} --task {task} --config {config_suffix} --accelerator {accelerator}\n```\n\nHere `{model_name}`\n\nand `{task}`\n\nare defined as above, while `{config_suffix}`\n\nrefers to the specific config and `{accelerator}`\n\nrefers to the choice of 🤗 Accelerate config in `recipes/accelerate_configs`\n\n. If you wish to override the default config parameters, you can provide them by appending a space-separated string like `'--arg1=value1 --arg2=value2'`\n\n. Here's a concrete example to run SFT on 1 node of 8 GPUs:\n\n```\nsbatch --job-name=open_r1 --nodes=1 slurm/train.slurm --model OpenR1-Distill-7B --task sft --config distill --accelerator zero3\n```\n\nYou can scale the number of nodes by increasing the `--nodes`\n\nflag.\n\nFor GRPO, we use 1 node for the vLLM server and N nodes for training. For example, to run GRPO on 1+1 nodes with mixed data and tensor parallelism, run:\n\n```\nsbatch --job-name=open_r1 --nodes=2 slurm/train.slurm --model Qwen2.5-1.5B-Instruct --task grpo --config demo --accelerator zero2 --dp 4 --tp 2\n```\n\nNote\n\nThe configuration in `slurm/train.slurm`\n\nis optimised for the Hugging Face Compute Cluster and may require tweaking to be adapted to your own compute nodes.\n\nTo combine multiple datasets as a single training mixture, you can specify the `dataset_mixture`\n\nparameter in the YAML config file. Here's a template for how to do this:\n\n```\ndataset_mixture:\n  datasets:                     # List of datasets to include in the mixture\n    - id: dataset_1             # Hub dataset ID\n      config: config_name_1     # Name of the dataset config\n      split: split_1            # Split to use from the dataset\n      columns:                  # Columns to keep\n        - column_1              \n        - column_2    \n      weight: 0.25              # Fraction of dataset to use\n    - id: dataset_2\n      config: config_name_2\n      split: split_2\n      columns:                  \n        - column_1              \n        - column_2   \n      weight: 0.5\n  seed: 42                      # Seed for shuffling the combined dataset\n  test_split_size: 0.1          # Fraction of mixture to use for a test split\n```\n\nWe use `lighteval`\n\nto evaluate models. For models which fit on a single GPU, run:\n\n```\nexport VLLM_WORKER_MULTIPROC_METHOD=spawn # Required for vLLM\nMODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\nMODEL_ARGS=\"model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nOUTPUT_DIR=data/evals/$MODEL\n\n# AIME 2024\nTASK=aime24\nlighteval vllm $MODEL_ARGS \"lighteval|$TASK|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\n\n# MATH-500\nTASK=math_500\nlighteval vllm $MODEL_ARGS \"lighteval|$TASK|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\n\n# GPQA Diamond\nTASK=gpqa:diamond\nlighteval vllm $MODEL_ARGS \"lighteval|$TASK|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\n\n# LiveCodeBench\nlighteval vllm $MODEL_ARGS \"extended|lcb:codegeneration|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\n```\n\nTo increase throughput across multiple GPUs, use *data parallel* as follows:\n\n```\nNUM_GPUS=8\nMODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\nMODEL_ARGS=\"model_name=$MODEL,dtype=bfloat16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nTASK=aime24\nOUTPUT_DIR=data/evals/$MODEL\n\nlighteval vllm $MODEL_ARGS \"lighteval|$TASK|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\n```\n\nFor large models which require sharding across GPUs, use *tensor parallel* and run:\n\n```\nNUM_GPUS=8\nMODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B\nMODEL_ARGS=\"model_name=$MODEL,dtype=bfloat16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nTASK=aime24\nOUTPUT_DIR=data/evals/$MODEL\n\nexport VLLM_WORKER_MULTIPROC_METHOD=spawn\nlighteval vllm $MODEL_ARGS \"lighteval|$TASK|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\n```\n\nYou can also launch an evaluation with `make evaluate`\n\n, specifying the model, task, and optionally the parallelism technique and number of GPUs.\n\nTo evaluate on a single GPU:\n\n```\nmake evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24\n```\n\nTo use Data Parallelism:\n\n```\nmake evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=data NUM_GPUS=8\n```\n\nTo use Tensor Parallelism:\n\n```\nmake evaluate MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B TASK=aime24 PARALLEL=tensor NUM_GPUS=8\n```\n\nThe DeepSeek-R1 paper uses sampling with 4-64 responses per query to estimate `pass@1`\n\naccuracy, but does not specify the specific number of responses per benchmark. In the tables below, we estimate `pass@1`\n\naccuracy with the following number of responses per query:\n\n| Benchmark | Number of responses per query |\n|---|---|\n| AIME 2024 | 64 |\n| MATH-500 | 4 |\n| GPQA Diamond | 8 |\n| LiveCodeBench | 16 |\n\nNote that for benchmarks like AIME24, it is important to sample many responses as there are only 30 problems and this can introduce high variance across repeated runs. The choice of how many responses to sample per prompt likely explains the small differences between our evaluation results and those reported by DeepSeek.\n\nWe are able to reproduce Deepseek's reported results on the AIME 2024 benchmark within ~1-3 standard deviations:\n\n| Model | AIME 2024 (🤗 LightEval) | AIME 2024 (DeepSeek Reported) |\n|---|---|---|\n| DeepSeek-R1-Distill-Qwen-1.5B | 30.7 | 28.9 |\n| DeepSeek-R1-Distill-Qwen-7B | 50.8 | 55.5 |\n| DeepSeek-R1-Distill-Qwen-14B | 65.9 | 69.7 |\n| DeepSeek-R1-Distill-Qwen-32B | 69.7 | 72.6 |\n| DeepSeek-R1-Distill-Llama-8B | 43.9 | 41.7 |\n| DeepSeek-R1-Distill-Llama-70B | 63.0 | 70.0 |\n\nTo reproduce these results use the following command:\n\n```\nNUM_GPUS=1 # Set to 8 for 32B and 70B models\nMODEL=deepseek-ai/{model_name}\nMODEL_ARGS=\"model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nOUTPUT_DIR=data/evals/$MODEL\n\nlighteval vllm $MODEL_ARGS \"lighteval|aime24|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\n```\n\nAlternatively, you can launch Slurm jobs as follows:\n\n```\npython scripts/run_benchmarks.py --model-id {model_id}  --benchmarks aime24\n```\n\nWe are able to reproduce Deepseek's reported results on the MATH-500 benchmark within ~1-3 standard deviations:\n\n| Model | MATH-500 (🤗 LightEval) | MATH-500 (DeepSeek Reported) |\n|---|---|---|\n| DeepSeek-R1-Distill-Qwen-1.5B | 83.1 | 83.9 |\n| DeepSeek-R1-Distill-Qwen-7B | 94.5 | 92.8 |\n| DeepSeek-R1-Distill-Qwen-14B | 94.1 | 93.9 |\n| DeepSeek-R1-Distill-Qwen-32B | 95.6 | 94.3 |\n| DeepSeek-R1-Distill-Llama-8B | 88.6 | 89.1 |\n| DeepSeek-R1-Distill-Llama-70B | 95.1 | 94.5 |\n\nTo reproduce these results use the following command:\n\n```\nexport VLLM_WORKER_MULTIPROC_METHOD=spawn\nNUM_GPUS=1 # Set to 8 for 32B and 70B models\nMODEL=deepseek-ai/{model_name}\nMODEL_ARGS=\"model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nOUTPUT_DIR=data/evals/$MODEL\n\nlighteval vllm $MODEL_ARGS \"lighteval|math_500|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\n```\n\nAlternatively, you can launch Slurm jobs as follows:\n\n```\npython scripts/run_benchmarks.py --model-id {model_id}  --benchmarks math_500\n```\n\nWe are able to reproduce Deepseek's reported results on the GPQA Diamond benchmark within ~1-3 standard deviations:\n\n| Model | GPQA Diamond (🤗 LightEval) | GPQA Diamond (DeepSeek Reported) |\n|---|---|---|\n| DeepSeek-R1-Distill-Qwen-1.5B | 35.8 | 33.8 |\n| DeepSeek-R1-Distill-Qwen-7B | 50.5 | 49.1 |\n| DeepSeek-R1-Distill-Qwen-14B | 61.5 | 59.1 |\n| DeepSeek-R1-Distill-Qwen-32B | 63.1 | 62.1 |\n| DeepSeek-R1-Distill-Llama-8B | 46.7 | 49.0 |\n| DeepSeek-R1-Distill-Llama-70B | 67.4 | 65.2 |\n\nTo reproduce these results use the following command:\n\n```\nexport VLLM_WORKER_MULTIPROC_METHOD=spawn\nNUM_GPUS=1 # Set to 8 for 32B and 70B models\nMODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\nMODEL_ARGS=\"model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nOUTPUT_DIR=data/evals/$MODEL\n\nlighteval vllm $MODEL_ARGS \"lighteval|gpqa:diamond|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\npython scripts/run_benchmarks.py --model-id {model_id}  --benchmarks gpqa\n```\n\nWe are able to reproduce Deepseek's reported results on the LiveCodeBench code generation benchmark within ~1-3 standard deviations:\n\n| Model | LiveCodeBench (🤗 LightEval) | LiveCodeBench (DeepSeek Reported) |\n|---|---|---|\n| DeepSeek-R1-Distill-Qwen-1.5B | 16.1 | 16.9 |\n| DeepSeek-R1-Distill-Qwen-7B | 37.4 | 37.6 |\n| DeepSeek-R1-Distill-Qwen-14B | 51.3 | 53.1 |\n| DeepSeek-R1-Distill-Qwen-32B | 56.0 | 57.2 |\n| DeepSeek-R1-Distill-Llama-8B | 37.4 | 39.6 |\n| DeepSeek-R1-Distill-Llama-70B | 55.9 | 57.5 |\n\nTo reproduce these results use the following command:\n\n```\nNUM_GPUS=1 # Set to 8 for 32B and 70B models, or data_parallel_size=8 with the smaller models for speed\nMODEL=deepseek-ai/{model_name}\nMODEL_ARGS=\"model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,data_parallel_size=$NUM_GPUS,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}\"\nOUTPUT_DIR=data/evals/$MODEL\n\nlighteval vllm $MODEL_ARGS \"extended|lcb:codegeneration|0|0\" \\\n    --use-chat-template \\\n    --output-dir $OUTPUT_DIR\npython scripts/run_benchmarks.py --model-id {model_id}  --benchmarks lcb\n```\n\nThe following example can be run in 1xH100. First install the following dependencies:\n\n```\nuv pip install \"distilabel[vllm]>=1.5.2\"\n```\n\nNow save the following snippet into a file named `pipeline.py`\n\nand run it with `python pipeline.py`\n\n. It will generate 4 outputs for each of the 10 examples (change the username for the repository to your org/user name):\n\n``` python\nfrom datasets import load_dataset\nfrom distilabel.models import vLLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps.tasks import TextGeneration\n\nprompt_template = \"\"\"\\\nYou will be given a problem. Please reason step by step, and put your final answer within \\boxed{}:\n{{ instruction }}\"\"\"\n\ndataset = load_dataset(\"AI-MO/NuminaMath-TIR\", split=\"train\").select(range(10))\n\nmodel_id = \"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\"  # Exchange with another smol distilled r1\n\nwith Pipeline(\n    name=\"distill-qwen-7b-r1\",\n    description=\"A pipeline to generate data from a distilled r1 model\",\n) as pipeline:\n\n    llm = vLLM(\n        model=model_id,\n        tokenizer=model_id,\n        extra_kwargs={\n            \"tensor_parallel_size\": 1,\n            \"max_model_len\": 8192,\n        },\n        generation_kwargs={\n            \"temperature\": 0.6,\n            \"max_new_tokens\": 8192,\n        },\n    )\n    prompt_column = \"problem\"\n    text_generation = TextGeneration(\n        llm=llm, \n        template=prompt_template,\n        num_generations=4,\n        input_mappings={\"instruction\": prompt_column} if prompt_column is not None else {}\n    )\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(dataset=dataset)\n    distiset.push_to_hub(repo_id=\"username/numina-deepseek-r1-qwen-7b\")\n```\n\nTake a look at the sample dataset at [HuggingFaceH4/numina-deepseek-r1-qwen-7b](https://huggingface.co/datasets/HuggingFaceH4/numina-deepseek-r1-qwen-7b).\n\nTo run the bigger DeepSeek-R1, we used 2 nodes, each with 8×H100 GPUs using the slurm file present in this repo at `slurm/generate.slurm`\n\n. First, install the dependencies:\n\n(for now we need to install the vllm dev wheel that [fixes the R1 cuda graph capture](https://github.com/vllm-project/vllm/commits/221d388cc5a836fa189305785ed7e887cea8b510/csrc/moe/moe_align_sum_kernels.cu))\n\n```\npip install https://wheels.vllm.ai/221d388cc5a836fa189305785ed7e887cea8b510/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu121\n\nuv pip install \"distilabel[vllm,ray,openai]>=1.5.2\"\n```\n\nAnd then run the following command:\n\n```\nsbatch slurm/generate.slurm \\\n    --hf-dataset AI-MO/NuminaMath-TIR \\\n    --temperature 0.6 \\\n    --prompt-column problem \\\n    --model deepseek-ai/DeepSeek-R1 \\\n    --hf-output-dataset username/r1-dataset\n```\n\nNote\n\nWhile the job is running, you can setup an SSH tunnel through the cluster login node to access the Ray dashboard from your computer running `ssh -L 8265:ray_ip_head_node:8265 <login_node>`\n\n, then browsing `http://localhost:8265`\n\nFollowing [s1: Simple test-time scaling](https://huggingface.co/papers/2501.19393) the data can be decontaminated using the script at: [scripts/decontaminate.py](/huggingface/open-r1/blob/main/scripts/decontaminate.py), which decontaminates a dataset using 8-grams and deduplicate the data. Sample run:\n\n```\npython scripts/decontaminate.py \\\n    --dataset \"open-r1/verifiable-coding-problems-python\" \\\n    --problem_column problem \\\n    --cleanup\n```\n\nIt will decontaminate against the benchmark datasets, and remove the contaminated samples afterwards. If no argument `--new_dataset_name`\n\nis provided, the same dataset will be reused, adding a `_decontaminated`\n\n. It runs against the prompt, which for this dataset is the column `problem`\n\n, but a different one can be provided.\n\nArguments for the script:\n\n```\nusage: decontaminate.py [-h] --dataset DATASET [--split SPLIT] [--ngram_size NGRAM_SIZE] [--problem_column PROBLEM_COLUMN] [--cleanup] [--new_dataset_name NEW_DATASET_NAME]\n\noptions:\n  -h, --help            show this help message and exit\n  --dataset DATASET     Name of the dataset to check for contamination.\n  --split SPLIT         Split to check for contamination, defaults to `train`.\n  --ngram_size NGRAM_SIZE\n                        Size of n-grams to build, defaults to 8.\n  --problem_column PROBLEM_COLUMN\n                        Name of the column containing the problem (prompt).\n  --cleanup           Whether to remove the contaminated rows before pushing the dataset.\n  --new_dataset_name NEW_DATASET_NAME\n                        New name for the dataset. If not provided, will reuse the name and add a `_decontaminated` to the name.\n```\n\nContributions are welcome. Please refer to [#23](https://github.com/huggingface/open-r1/issues/23).\n\nThis project is built with the collective efforts of many groups and individuals in the open AI community. We are especially grateful to the vLLM and SGLang teams for creating high-performance tooling to scale the rollouts of GRPO. We also thank the teams at [OpenThoughts](https://www.open-thoughts.ai), [Prime Intellect](https://www.primeintellect.ai), and [General Reasoning](https://gr.inc) for creating and sharing high-quality datasets for reasoning.\n\nIf you find this project is useful in your own work, please consider citing as follows:\n\n```\n@misc{openr1,\n    title = {Open R1: A fully open reproduction of DeepSeek-R1},\n    url = {https://github.com/huggingface/open-r1},\n    author = {{Hugging Face}},\n    month = {January},\n    year = {2025}\n}\n```\n\n", "url": "https://wpnews.pro/news/open-reproduction-of-deepseek-r1", "canonical_source": "https://github.com/huggingface/open-r1", "published_at": "2026-06-11 13:14:31+00:00", "updated_at": "2026-06-11 17:14:35.678485+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-research"], "entities": ["DeepSeek-R1", "DeepSeek", "Distilabel"], "alternates": {"html": "https://wpnews.pro/news/open-reproduction-of-deepseek-r1", "markdown": "https://wpnews.pro/news/open-reproduction-of-deepseek-r1.md", "text": "https://wpnews.pro/news/open-reproduction-of-deepseek-r1.txt", "jsonld": "https://wpnews.pro/news/open-reproduction-of-deepseek-r1.jsonld"}}