Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved clarity in 3 sections #43

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 11 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,12 @@

## 1. Introduction

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning.
With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors.
However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance,
we introduce DeepSeek-R1, which incorporates cold-start data before RL.
DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.
To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.
We introduce our first-generation reasoning models: DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without preliminary supervised fine-tuning (SFT), demonstrates remarkable reasoning performance.
Through RL training, it naturally developed numerous powerful and intriguing reasoning behaviors.
However, DeepSeek-R1-Zero faces challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning capabilities, we developed DeepSeek-R1, which incorporates cold-start data prior to RL training.
DeepSeek-R1 achieves performance comparable to OpenAI-o1 in math, coding, and reasoning tasks.
To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (based on Llama and Qwen architectures) distilled from DeepSeek-R1. Notably, DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across benchmarks, achieving new state-of-the-art results for dense models.

<p align="center">
<img width="80%" src="figures/benchmark.jpg">
Expand All @@ -66,17 +65,17 @@ To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSe

**Post-Training: Large-Scale Reinforcement Learning on the Base Model**

- We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.
- We directly apply reinforcement learning (RL) to the base model without supervised fine-tuning (SFT) as a preliminary step. This approach enables the model to explore chain-of-thought (CoT) reasoning for solving complex problems, leading to the development of DeepSeek-R1-Zero. The model demonstrates capabilities such as self-verification, reflection, and the generation of long CoTs, marking a significant milestone for the research community. Notably, this is the first open research initiative to validate that large language models (LLMs) can develop reasoning capabilities purely through RL, eliminating the need for SFT. This breakthrough paves the way for future advancements in the field.

- We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities.
We believe the pipeline will benefit the industry by creating better models.
- We introduce our pipeline for developing DeepSeek-R1, which incorporates two RL stages (aimed at discovering improved reasoning patterns and aligning with human preferences) and two SFT stages (serving as the foundation for the model’s reasoning and non-reasoning capabilities). We believe this pipeline will benefit the industry by enabling the creation of more advanced models.

---

**Distillation: Smaller Models Can Be Powerful Too**

- We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future.
- Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.
- We demonstrate that reasoning patterns from larger models can be distilled into smaller ones, achieving superior performance compared to reasoning patterns discovered through RL on small models. The open-source DeepSeek-R1 and its API will empower the research community to distill more capable smaller models in the future.

- Using reasoning data generated by DeepSeek-R1, we fine-tuned several dense models widely adopted in the research community. Evaluations show that these smaller distilled dense models excel on benchmarks. We have open-sourced distilled checkpoints (1.5B, 7B, 8B, 14B, 32B, and 70B) based on the Qwen2.5 and Llama3 architectures for community use.

## 3. Model Downloads

Expand Down