Redefining Language Models with Synthetic Data

The landscape of AI is evolving rapidly, and language models, particularly those designed for reasoning and problem-solving tasks, are at the heart of this revolution. One such breakthrough in AI is Phi-4, a 14-billion parameter model developed by Microsoft Research. What sets Phi-4 apart from its predecessors and other models is its innovative approach to training—especially its use of synthetic data. By prioritizing the quality of data over sheer quantity, Phi-4 demonstrates remarkable improvements in reasoning capabilities, STEM-focused question answering, and coding tasks.

In this blog, we will explore Phi-4 in detail, analyzing every component of its architecture, training process, and post-training innovations. We’ll break down its key strengths, discuss areas of improvement, and explain how it outperforms many other language models—even those much larger in size. By the end of this deep dive, you’ll understand why Phi-4 isn’t just another model, but a true leap forward in the field of natural language processing (NLP).

Learning Objectives

Learn why synthetic data is crucial for phi-4’s development and how it boosts performance in long-context tasks.
Learn how the team trains Phi-4 using diverse data sources, including synthetic and non-synthetic data, across three training stages.
Discover how phi-4’s context length increases from 4K to 16K tokens in midtraining and its impact on performance.
See how Phi-4 undergoes evaluation on real-world tasks like question answering, summarization, and retrieval-augmented generation, and compare its performance.
Get a guide on running phi-4 locally, covering technical setup, system requirements, and challenges like overfitting and data contamination.

This article was published as a part of the Data Science Blogathon.

Why Synthetic Data Matters?

At its core, Phi-4 is a 14-billion parameter language model developed by Microsoft Research. The model builds on the successes of previous iterations in the Phi family, such as Phi-3, but introduces several key innovations that significantly enhance its performance on reasoning-heavy tasks. Unlike many other large language models (LLMs) that rely primarily on massive amounts of organic data (like web content, books, and code repositories), Phi-4 strategically incorporates a large amount of synthetic data in its training pipeline. This focus on synthetic data, combined with other training innovations, allows Phi-4 to achieve better performance in key areas—particularly STEM-related question answering and complex problem-solving.

Why Synthetic Data is Key for Phi-4?

In the AI community, data is the lifeblood of training models. Typically, LLMs are trained using massive datasets scraped from the web or curated from books and papers. While this organic data is useful, it often contains inconsistencies, irrelevant information, or a lack of structured challenges that would push the model’s reasoning abilities. This is where synthetic data comes in.

Role of Synthetic Data in Phi-4

The team artificially generates synthetic data to meet specific training objectives, making it a highly effective tool for guiding the model’s learning process. For Phi-4, synthetic data helps build high-quality datasets that encourage strong reasoning and problem-solving abilities.

Structured Learning: Unlike organic data, which often requires models to decipher complex, indirect relationships between tokens, synthetic data allows Phi-4 to learn more systematically. For example, in math or coding tasks, the synthetic data provides clear step-by-step reasoning, making it easier for the model to follow logical progressions.
Diversity in Challenges: Synthetic data can be generated to cover a wide range of topics and skills, ensuring the model encounters various challenges. For example, Phi-4’s synthetic datasets include complex math problems, coding challenges, and scientific reasoning tasks—each designed to stretch the model’s cognitive abilities.
Alignment with Inference Contexts: One key advantage of synthetic data is that it can be generated in formats that align closely with the types of outputs the model is expected to produce during real-world interactions. This helps Phi-4 generate responses that are contextually appropriate and more aligned with user queries.

Synthetic Data Techniques in Phi-4

Phi-4’s synthetic data isn’t just randomly generated—it’s carefully crafted using a combination of advanced techniques:

Multi-agent prompting: Multiple agents (models) generate different solutions to the same problem, which are then filtered for quality and consistency. This generates diverse and nuanced examples that challenge the model’s problem-solving abilities.
Self-revision workflows: The model initially generates answers, and then critiques and refines them through iterative feedback loops. This helps improve the accuracy and reasoning in the generated responses.
Instruction reversal: For coding tasks, Phi-4 uses instruction reversal techniques. It transforms existing code snippets into problem descriptions, helping the model generate solutions effectively.

By prioritizing such techniques, Phi-4 learns to solve problems more intelligently, while also reducing biases that may arise from purely organic datasets.

How Phi-4 was Trained?

Phi-4’s impressive performance doesn’t come solely from the use of synthetic data. The model’s training curriculum is also crucial to its success. Phi-4’s creators designed a sophisticated training process that incorporates a balanced mixture of data types, including organic sources and synthetic data.

Pretraining with a Mixture of Data Sources

The phi-4 model utilizes a decoder-only transformer architecture with 14 billion parameters and initially operates with a context length of 4096 tokens. This context length is later increased to 16K tokens during a subsequent midtraining phase. The architecture shares many similarities with the phi-3-medium model but introduces several enhancements. Notably, phi-4 adopts the tiktoken tokenizer, which improves multilingual support, and has a vocabulary size of 100,352 tokens, including unused tokens. Additionally, phi-4 employs full attention across the 4K context length, a departure from the 2K sliding window approach used in phi-3-medium.

The team pretrained the model using approximately 10 trillion tokens, following a linear warm-up and decay schedule. They set the peak learning rate to 0.0003, applied a constant weight decay of 0.1, and used a global batch size of 5760. They fine-tuned hyperparameters by interpolating from shorter-duration runs and stress testing the learning rate warm-up phase to ensure model stability. After pretraining, the model underwent a brief midtraining stage to extend the original 4K context length to 16K tokens.

Since pre-trained models typically do not perform well on instruction-following tasks, the researchers chose not to rely on 0-shot evaluations, such as SIMPLE-EVALS, which require answers in a particular format. Instead, they developed a custom evaluation approach for pretraining, which combines log-likelihood assessments and few-shot prompts for various tasks. For instance, the team used log-likelihood evaluations for tasks like MMLU (5-shot), MMLU-pro, and ARCC (1-shot). Additionally, they trained the model using 1, 3, 4, and 8 few-shot examples for tasks such as TriviaQA (TQA), MBPP, MATH, and GSM8k, helping it follow the required answer formats and extract correct solutions.

Insights from the Mid-Training Phase

In the midtraining phase of phi-4, the context length is extended from the original 4K tokens to 16K tokens. During this stage, the researchers conduct a series of ablation studies to investigate how different types of data impact the model’s performance with long contexts. They compare data sources that naturally have longer contexts with synthetic data, where shorter sequences are padded to create longer ones. The results show that the model performs better when trained on data that inherently has long contexts.

The team refines their dataset by filtering out high-quality, non-synthetic data like academic papers, books, and code. They isolate samples longer than 8K tokens and give more weight to those 16K tokens or longer. New synthetic datasets are created with sequences longer than 4K tokens. The final dataset mixture contains 30% long-context data and 70% recall tokens from pretraining. To accommodate the increased context length, the team sets the rotary position encoding (RoPE) base frequency to 250K. They reduce the maximum learning rate by a factor of 10 and train the model with 250 billion tokens.

To evaluate phi-4’s ability to handle long contexts, the researchers emphasize a diverse set of real-world tasks, rather than relying solely on synthetic benchmarks like needle-in-a-haystack or RULER, which are simpler but less reflective of practical scenarios. The team selects these tasks from the HELMET [YGH+24] evaluation suite and averages the results across five runs for each category.

Evaluation Framework

The evaluation framework includes the following tasks:

Recall: The model retrieves a specific value from a randomly generated long JSON file based on a given key, measured using the SubEM metric.
RAG (Retrieval-Augmented Generation): The model answers questions based on multiple retrieved and shuffled Wikipedia documents, with datasets such as NaturalQuestions, HotpotQA, and PopQA. The final results are averaged across all datasets, evaluated with the SubEM metric.
Re-rank: In this task, the model re-ranks the top-10 documents retrieved for a given query, using the MSMARCO dataset. Performance is measured with nDCG@10.
ICL (In-Context Learning): This task tests the model’s ability to perform many-shot in-context learning on datasets like TREC coarse, TREC fine, Banking77, NLU, and CLINC150. The results are averaged across all datasets, with performance measured by the F1 score.
QA (Question Answering): The model answers questions based on lengthy documents from the NarrativeQAv2 dataset, with performance evaluated using GPT-4o scoring.
Summ (Summarization): The task involves summarizing long legal documents from the Multi-LexSum dataset, with results evaluated using GPT-4o scoring.

This comprehensive evaluation strategy thoroughly tests Phi-4’s long-context capabilities across various practical tasks. It reflects the model’s real-world applicability.

Outcomes and Reflections from Post-Training

Post-training is aimed at transforming the pretrained language model into an AI assistant that users can
safely interact with. Phi-4 align the pretrained model with one round of SFT, one round of DPO on data from our pivotal token search method and one round of DPO on full length preference pairs. The model undergoes chat fine-tuning using the standard ChatML format. An example usage template for two rounds of conversation is as follows:

Innovative Post-Training Techniques

Once pretraining is complete, Phi-4 enters a post-training phase where further fine-tuning takes place. This stage focuses on refining the model’s reasoning abilities and improving the quality of its outputs. Several post-training innovations contribute to Phi-4’s impressive performance:

Supervised Fine-Tuning: In this phase, researchers fine-tune the pretrained model with a learning rate of 10−6on a variety of data generated from high-quality data across diverse domains, including math, coding, reasoning, conversation, model identity, and safety. They also added multilingual data for 40 languages. They use around 8B tokens of data in this phase, all formatted in the chatml format.
Direct Preference Optimization: Researchers use DPO to align the model with human preferences, and also to steer the model away from unwanted behavior through pairs of desired and undesired outputs. DPO data covers chat format data, reasoning, and Responsible AI (RAI) data and improves the model in math, coding, reasoning, robustness, and safety. They did two rounds of DPO on the SFT model.
Pivotal Token Search (PTS): A novel technique developed for Phi-4, PTS identifies key tokens in a response that have a significant impact on the overall success of the model’s output. This allows the model to focus on improving specific, critical tokens in its responses, ensuring greater accuracy and robustness.

Performance on Key Benchmarks

To assess Phi-4’s capabilities, it’s essential to examine its performance on standard benchmarks. Phi-4 consistently outperforms its predecessors and many larger models across several critical tasks.

STEM and Reasoning Tasks

Phi-4 shines particularly in STEM-focused question answering (such as GPQA for graduate-level questions) and mathematics competitions (MATH). Despite being smaller than models like Llama-3, Phi-4 achieves comparable or superior results on these reasoning-heavy tasks. This is a testament to the model’s effective use of synthetic data and its focus on structured, logical problem-solving.

For example, Phi-4 outperforms its teacher model, GPT-4, on many reasoning benchmarks such as GPQA and MATH, despite being a smaller model. The incorporation of high-quality synthetic data and innovative training techniques has allowed Phi-4 to surpass the capabilities of much larger models in these areas.

Coding and Technical Tasks

In coding tasks, Phi-4 also excels, outperforming models such as GPT-4 mini and Qwen 2.5. Whether it’s solving algorithmic problems in HumanEval or tackling more complex programming challenges, Phi-4’s ability to reason and apply logic effectively makes it one of the top performers in the coding space.

Safety

Phi-4 demonstrates robust safeguards against generating harmful or biased content, ensuring ethical and responsible AI interactions during benchmarking.

How to Run Phi-4 Locally

Running Phi-4 locally allows you to interact with this advanced AI model directly from your system, offering convenience and flexibility for testing or application development. Follow the steps below to set it up:

Install Ollama

Ollama is a tool that facilitates running and interacting with AI models like Phi-4. Begin by installing Ollama on your system. You can find detailed installation instructions on Ollama’s official website.

Run Phi-4 in the Command Line

Once Ollama is installed, you can run the Phi-4 model with a single command in your terminal or PowerShell:

ollama run vanilj/Phi-4

This command initializes the Phi-4 model and allows you to interact with it directly in your CLI. You can start chatting or asking questions immediately.

Integrate Phi-4 with LangChain

For more advanced use cases, such as integrating Phi-4 into a workflow or application, you can use LangChain with Ollama. LangChain provides tools for working with language models programmatically.

Install the LangChain-Ollama library:

%pip install -U langchain-ollama

Use the following Python script to run Phi-4 via LangChain:

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
template = """Question: {question}
Answer: Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="vanilj/Phi-4")
chain = prompt | model
print(chain.invoke({"question": "Write a poem on AI?"}))

Challenges: Dealing with Overfitting and Data Contamination

No model is perfect, and Phi-4 has its own set of challenges. Overfitting is a common concern in AI development. It happens when a model becomes too specialized to training data, hurting generalization. Phi-4 tackles this by using a data decontamination process. This ensures no test data is included in training, reducing overfitting risk.

Overfitting Mitigation

By using fresh datasets, such as the November 2024 AMC-10 and AMC-12 math competitions, Phi-4 has shown that it can generalize well beyond its training set and perform excellently on new tasks. This is crucial for ensuring that Phi-4 remains a robust and reliable tool for real-world applications.

Weaknesses

Instruction Following: While Phi-4 performs well in reasoning tasks, it struggles with strict instruction-following. Tasks requiring specific formatting or complex stylistic instructions can sometimes cause the model to veer off course.
Factual Hallucinations: Phi-4 still struggles with factual accuracy in some cases, particularly in generating information about non-existent or hypothetical individuals.

Conclusion

Phi-4 is a game-changer in the world of language models. Its combination of innovative synthetic data generation, cutting-edge training techniques, and post-training refinements sets it apart from many other models. Phi-4 demonstrates that with the right approach to training, quality can trump quantity—achieving superior performance in reasoning-heavy tasks, STEM Q&A, and coding challenges, despite being smaller than many contemporary models.

Phi-4 is not without its challenges, particularly around instruction-following and factual accuracy. However, its remarkable abilities in logical reasoning and problem-solving make it a significant step forward in the AI space. As AI evolves, Phi-4’s use of synthetic data sets a model for future developments in the field. It helps push the boundaries of what’s possible with language models.

Key Takeaways

Phi-4 leverages synthetic data to prioritize quality over quantity, enhancing its reasoning, STEM question answering, and coding capabilities.
Synthetic data in Phi-4 introduces structured learning, diverse challenges, and better alignment with real-world inference contexts.
Phi-4’s training includes pretraining, midtraining with extended context lengths, and innovative post-training techniques for fine-tuning.
Midtraining expands Phi-4’s context length from 4K to 16K tokens, optimizing it for long-context tasks.
Evaluation of Phi-4 emphasizes real-world tasks like RAG, summarization, and in-context learning for practical insights.
Post-training innovations, including Supervised Fine-Tuning and Direct Preference Optimization, refine Phi-4’s reasoning and safety.
Phi-4’s architecture, coupled with advanced datasets and training techniques, sets a new benchmark in NLP for handling complex problem-solving tasks.

Frequently Asked Questions

Q1. What is phi-4 and how is it different from previous models?

A. Phi-4 is a large-scale, state-of-the-art AI model based on a decoder-only transformer architecture. Phi-4 builds on models like Phi-3-medium by increasing the context length to 16K tokens. It also introduces improved data preprocessing techniques, including tiktoken, for better multilingual support.

Q2. Why is synthetic data important for training phi-4?

A. Synthetic data plays a key role in training phi-4, as it helps the model handle long-context tasks more effectively. By combining real-world data with synthetically generated sequences, Phi-4 generalizes better across diverse scenarios. This improves its performance on tasks requiring reasoning across large datasets.

Q3. What are the key stages of phi-4’s training process?

A. Phi-4’s training involves three stages. Pretraining uses diverse data sources. Midtraining expands context length from 4K to 16K tokens. Posttraining includes fine-tuning techniques like SFT, reinforcement learning with DPO, and token sampling (PTS) from the pretraining stage.

Q4. How does phi-4 perform on real-world tasks?

A. Phi-4 excels on a wide range of real-world benchmarks, including question answering, summarization, and retrieval-augmented generation. Phi-4 excels in reasoning tasks over lengthy documents, evaluated using diverse datasets from the HELM evaluation suite.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I’m a Data Scientist at Syngene International Limited. I have completed my Master’s in Data Science from VIT AP and I have a burning passion for Generative AI. My expertise lies in building robust machine learning and NLP models for innovative projects. Currently, I’m putting this knowledge to work in drug discovery research at Syngene, exploring the potential of LLMs. Always eager to learn and delve deeper into the ever-evolving world of data science and AI!

Source link

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31