We Read the DeepSeek AI Papers, So You Don't Have To

Read Time 10 mins | Written by: Intlabs team

If you’ve been following AI developments in early 2025, you’ve likely been swept up by the tidal wave of news regarding DeepSeek AI. The conversation around DeepSeek paints a picture of a disruptive contender that came out of nowhere, threatening the dominance of existing AI providers and making too-good-to-be-true promises of higher performance and dramatically lower costs.

I find the narrative surrounding DeepSeek’s rise surprising, considering that the company’s family of models are open, with an extensive backlog of peer-reviewed papers detailing its approach and philosophy. These open contributions are undoubtedly great for open-source development. The fact we can read and see what they did to achieve these results is amazing.

After reading the DeepSeek model papers, I decided to summarize their process and outline some crucial question marks.

What is DeepSeek?

DeepSeek is a Chinese AI research lab formed by the hedge fund High Flyer in 2023 with connections to a number of universities in China. They have developed a series of increasingly capable open-source models. These models evolved very quickly from what appears to be the end of 2023 to the end of 2024, iteratively incorporating new techniques aimed at improving efficiency and cost-effectiveness for AI training and deployment.

Here’s a rough overview of the progression of DeepSeek’s core models. While DeepSeek has published other models specifically for tasks like coding, math, chat, or image generation, these are the company’s core models and their differences.

DeepSeek-LLM 7B & 67B:

These models were published to Hugging Face between October and November 2023. They are named after their respective number of parameters (7 billion and 67 billion) and trained on 2 trillion English and Chinese tokens, the smallest units of text a model can process. A token is not a fixed number of characters but roughly corresponds to one Chinese character or three to four letters in English. These models generally follow Meta’s Llama-2 model architecture with minor changes. The 67B model is very competitive in performance with Llama-2.

DeepSeek-V2:

Published in April 2024, this model has 236 billion parameters and was trained on 8.1 trillion tokens. It introduced significant performance improvements for both training and execution by using Multi-Head Latent Attention (MLA) and DeepSeekMoE, which we will cover later.

DeepSeek’s current offerings

DeepSeek-V3:

Published in late 2024, this model scaled training up to 14.8 trillion tokens, improving the V2 techniques developed to keep training costs at a theoretical $5.57 million. To hit that target the team at DeepSeek would have had to train the final model correctly on the first go, which seems unlikely. Regardless, this is significantly lower than cost estimates for closed models like Anthropic Claude and OpenAI’s ChatGPT-4o.

DeepSeek-R1:

Published in January 2025, this model is a deep reasoning model similar to OpenAI’s GPT-o1. DeepSeek used unsupervised reinforcement-learning-based fine-tuning to develop the model’s reasoning behavior. This approach allowed them to achieve comparable results to closed models, like OpenAI’s GPT series, without the use of expensive labeled training datasets. We’ll take a closer look at what this means later as well.

What are open vs. closed models?

One of DeepSeek’s core features is its commitment to open-source models. Open-source models are defined by public accessibility, allowing anyone to view their code, find errors, recommend changes, and customize the model for different tasks. Companies that develop closed models restrict a model’s code and data from external access.

Under an open-source approach, DeepSeek has published several papers on their models. While there are more, these are the ones I read:

DeepSeek isn’t alone in developing a model in the open. Meta’s Llama, the model from which DeepSeek gained inspiration, is open. There are also other fantastic open models with performance comparable to the “mini” closed models from Mistral and Microsoft. I’m particularly impressed with Mistral-Nemo lately. Because of the community-based approach promoted by open models, there’s often technique cross-development. You can catch new open models popping up all the time on the Hugging Face Open LLM Leaderboard.

Closed Models	Open Models
OpenAI (GPT-4o)	Meta (Llama-3)
Anthropic (Claude)	Mistral (Nemo, Mistal 7B, Mixtral)
Google (Gemini)	DeepSeek (V3, R1)
Cohere (Command R/R+)	Microsoft (Phi4)
	Alibaba (Qwen)

Be aware that high-performance open models comparable to closed models are incredibly memory and compute-intensive. Your performance will likely be terrible even if you use a very high-end PC. If you still want to try, OLlama is a helpful tool.

DeepSeek’s cost-saving measures

According to DeepSeek’s papers, here’s how the company has been able to drastically improve model efficiency while maintaining good performance.

Multi-Head Latent Attention: boosting inference efficiency

Multi-Head Latent Attention (MLA) is a breakthrough most likely unique to DeepSeek-V2/V3 that reduces storage requirements and improves efficiency. According to DeepSeek’s papers, MLA results in a 92% reduction in key-value storage compared to standard transformer models.

PySpur has a great blog post about MLA that goes into more detail than I will. Put simply, LLMs can only generate meaningful outputs with “attention,” an awareness of the text that has come before it. Computational demands increase with the number of tokens required to maintain attention. To improve efficiency, we can cache (store) tokens so that, when looking back at the chain of text, the model doesn’t need to re-compute this information. Modern LLMs have many “attention heads,” meaning they can track numerous chains at once. You can think of this as the LLM having multiple trains of thought.

If every attention head maintains its own cache (called multi-head-attention or MHA), it uses a tremendous amount of computer memory to maintain these trains of thought, potentially upwards of 20GB per request.

One solution is to share caches between attention heads. The problem is that the values in the cache can get muddled between heads when they cache on the same key, degrading accuracy. MLA is a middle-ground, creating virtual (fake) caches for each attention head, which are all based on a single, real cache. The real cache uses different techniques to optimize value-sharing, reducing memory requirements while maintaining performance. For example, cached values may be shared if they’re similar. Or, less precise values may be used if their accuracy doesn’t impact the model’s performance.

DeepSeekMoE: using sparse computation to reduce execution costs

DeepSeek uses a Mixture-of-Experts (MoE) in its parameters. This means that rather than activating all parameters for every input, it only activates parameters that are trained for the task at hand. Put simply, the model performs like a collection of smaller specialized models. One is good at math, one is good at science, one is good at programming, and so on. With this approach, about 37 billion of the 671 billion model parameters are activated per token, significantly reducing computational costs. It also allows the model to scale efficiently without requiring large, high-performance computers and GPUs.

It’s worth noting that DeepSeek is not pioneering MoE. Mistral’s open Mixtral model is also based on MoE. So is GPT-4, using 16 experts with 111 billion parameters each.

While MoE isn’t groundbreaking, DeepSeekMoE does introduce several cost-saving techniques. For instance, Auxiliary-Loss-Free Load Balancing dynamically adjusts the computational workload across experts to improve efficiency. Device-limited routing can also evenly distribute the model across multiple machines, allowing it to run effectively on less powerful compute resources. With DeepSeekMoE, DeepSeek-V3 uses substantially fewer activated parameters per expert (37 billion vs. 111 billion) and maintains reasonably similar performance. This is a huge cost savings compared to the training and execution demands of GPT-4.

DeepSeek-R1: unsupervised reinforcement learning for reasoning

DeepSeek-R1 is a reasoning model. Reasoning models are fine-tuned to perform logical inference, problem-solving, and structured thinking. They excel at answering questions that involve multiple steps, like math and programming. This type of model typically involves supervised learning using large labeled training datasets where formal reasoning has been modeled in different scenarios. Such an approach essentially provides example inputs and outputs to allow the model to learn a more specialized task. While this involves less compute power than training the original model from zero, generating labeled datasets has a steep labor cost. DeepSeek aimed to address this cost by using unsupervised reinforcement learning.

Reinforcement learning is a type of machine learning where a computer program learns by interacting with an environment to achieve a goal rather than having a behavior modeled. The computer takes actions, gets rewards (positive or negative), and adjusts its behavior to maximize future rewards. In this case, the rewards were outputs from code compilation tools, math solvers, and other formal tools. These tools were able to measure the correctness of the model’s approach and provide suggestions for fixing identified issues.

DeepSeek’s training pipeline also uses “Group Relative Policy Optimization.” This approach essentially measures the collective reward of a group of outputs and adjusts parameters to achieve the highest sum of rewards for that group. In other words, it can flexibly and rapidly identify the fine-tuning needed to meet each group’s goal. By iteratively incorporating rewards and corrections, the model converges on “aha” moments where it starts to get the reasoning path correct on its own.

The first attempt at this approach, called DeepSeek-R1-Zero, was encouraging but expressed some pretty bizarre behavior. Interestingly, it flip-flopped between reasoning in English and Chinese within the same prompt. To address this, DeepSeek set up a slightly more supervised approach. Instead of starting completely fresh, they provided a small sample set of (a few thousand) examples called a “cold start.” This helped with the language issue.

Once the model was stable enough, they made a training pipeline where inputs and outputs were run through DeepSeek-V3 to clean up the fine-tuning data. This hybrid approach of unsupervised, reward-based training and LLM-based fine-tuning data results in a model that competes with GPT-o1 despite being trained relatively quickly and with fewer labour costs.

Remaining thoughts and questions

Despite DeepSeek’s transparency, their approach still raises a few key questions:

Are proprietary models adopting similar cost-saving techniques?

Some of DeepSeek’s papers have been publicly available since early 2024. This means OpenAI, Anthropic, and others may have already integrated some of the company’s optimizations. If they have, can we expect to see cost savings and performance improvements in next-generation closed models?

OpenAI just launched o3-mini, which is substantially cheaper than o1. The o3-mini release and DeepSeek’s paper use similar language regarding small model performance on reasoning tasks. This could indicate that some of DeepSeek’s innovations are already available in other commercial models.

How will DeepSeek-R2 compare if trained with a larger budget?

If DeepSeek had the same resources as OpenAI, how much further could it push performance? The reality is that DeepSeek-R1 is only as good as ChatGPT-o1. Did the DeepSeek team stop there because they were focused on quantifying cost savings? Or is there an impending limit on the performance of this class of model?

Why aren’t DeepSeek’s data-sourcing practices more transparent?

While DeepSeek’s training methods and parameter weights are open-source, their training dataset and fine-tuning rewards are not. DeepSeek’s papers mention "14.6 trillion high-quality and diverse tokens" from “internet sources,” but I can’t find any more information about what this dataset contains. OpenAI has openly accused DeepSeek of using the US company’s data and model outputs in training. Can DeepSeek’s model be truly open without more information about what it’s trained on?

How does DeepSeek’s Chinese-English training set affect bias?

DeepSeek’s model is trained on over 60% Chinese-language data. This differs from other models, which are much more aligned with North American content. This means there's a very real chance that, intentionally or not, DeepSeek’s outputs are biased in unexpected ways. DeepSeek researchers note that a special effort was made to “filter out the contentious content from our pre-training corpus to mitigate the data bias introduced from specific regional cultures.” However, the exclusion of certain content is, in itself, a bias.

Other AI models are absolutely biased. There is a standard “buyer beware” with any LLM that bias in training data results in unexpected and negative consequences. The difference in DeepSeek’s training data just means there may need to be differences in bias mitigation strategies, both with prompts and outputs, when using these models.

On the other hand, DeepSeek’s more diverse training could represent an opportunity to provide different perspectives. In the near future, it might be commonplace to run problems across multiple LLMs with different cultural or linguistic backgrounds to extract additional nuance.

Should companies use DeepSeek’s hosted services?

As with other LLMs, businesses should consider data residency and compliance issues when using hosted AI services. DeepSeek is a Chinese company and its hosted solution must comply with Chinese data policies. TikTok has been under massive scrutiny lately due to ByteDance’s engineering operations in China. As a Chinese entity, DeepSeek’s hosted services should be under the same level of scrutiny.

Final thoughts

If you’ve read DeepSeek’s papers, I’m curious to know what you think. Did you find this analysis useful, and are there other aspects of DeepSeek’s approach that you think warrant further clarification to be considered an open model?

If you’ve read this far and see any errors that I need to correct, please reach out to <mike@intlabs.io>.