
With the release of DeepSeek V3 and R1, U.S. tech giants are struggling to regain their competitive edge. Now, DeepSeek has introduced Janus Pro, a state-of-the-art multimodal AI that further solidifies its dominance in both understanding and generative AI tasks. Janus Pro outperforms many leading models in multimodal reasoning, text-to-image generation, and instruction-following benchmarks.
Janus Pro, builds upon its predecessor, Janus, by introducing optimized training strategies, expanding its dataset, and scaling its model architecture. These enhancements enable Janus Pro to achieve notable improvements in multimodal understanding and text-to-image instruction-following capabilities, setting a new benchmark in the field of AI. In this article, we will dissect the research paper to help you understand what’s inside DeepSeek Janus Pro and how you can access DeepSeek Janus Pro 7B.
The DeepSeek Janus Pro 7B is an AI model designed to handle tasks across multiple formats, like text, images, and videos, all in one system. What makes it stand out is its unique design: it separates the processing of visual information into different pathways while using a single transformer framework to bring everything together. This smart setup makes the model more flexible and efficient, whether it’s analyzing content or generating new ideas. Compared to older multimodal AI models, Janus Pro 7B takes a big step forward in both performance and versatility.
In a nutshell, Janus-Pro outperforms both unified multimodal models and specialized models, making it a top-performing AI for both understanding and generating visual content.
DeepSeek Janus Pro incorporates improvements in four primary areas: training strategies, data scaling, model architecture, and implementation efficiency.
Janus-Pro refines its training pipeline to address computational inefficiencies observed in Janus:
To boost the multimodal understanding and visual generation capabilities, Janus-Pro significantly expands its dataset:
Janus-Pro scales the architecture of the original Janus:
Janus-Pro adheres to an autoregressive framework with a decoupled visual encoding approach:
This module enables the model to analyze and describe images based on an input query.
Example:
Input: An image of a cat sitting on a table + “Describe the image.”
Output: “A small white cat is sitting on a wooden table.”
This module enables the model to create new images from textual descriptions.
Example:
Input: “A dragon flying over a castle at sunset.”
Output: AI-generated image of a dragon soaring above a medieval castle at sunset.
Component | Function |
Und. Encoder | Extracts visual features from input images. |
Text Tokenizer | Converts text input into tokens for processing. |
Auto-Regressive Transformer | Central module that handles both text and image generation sequentially. |
Gen. Encoder | Converts generated image tokens into structured representations. |
Image Decoder | Produces an image from encoded representations. |
Text De-Tokenizer | Converts generated text tokens into human-readable responses. |
The DeepSeek Janus-Pro model is a powerful vision-language AI system that enables both image comprehension and text-to-image generation. By leveraging auto-regressive learning, it efficiently produces text and images in a structured and scalable manner. 🚀
Janus-Pro modifies the three-stage training pipeline:
Janus-Pro utilizes the HAI-LLM framework, leveraging NVIDIA A100 GPUs for distributed training. The entire training process is streamlined, taking 7 days for the 1.5B model and 14 days for the 7B model across multiple nodes.
Janus-Pro demonstrates significant advancements over previous models:
Model of Janus Series:
Firstly, save the below given Python libraries and dependencies under requirements.txt in Google Colab and then run this:
pip install -r /content/requirements.txt
followed by the required libraries, use the below code:
import torch
from transformers import AutoConfig, AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images
from PIL import Image
# specify the path to the model
model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer
vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
conversation = [
{
"role": "<|User|>",
"content": f"\n{question}",
"images": [image],
},
{"role": "<|Assistant|>", "content": ""},
]
# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# # run the model to get the response
outputs = vl_gpt.language_model.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)
Refer to this for full code with Gradio: deepseek-ai/Janus-Pro-7B
Image
Output
The image contains a logo with a stylized design that includes a circular
pattern resembling a target or a camera aperture. Within this design, there
is a cartoon character with sunglasses and a hand gesture, which appears to
be a playful or humorous representation.The text next to the logo reads "License to Call." This suggests that the
image is likely related to a service or product that involves calling or
communication, possibly with a focus on licensing or authorization.The overall design and text imply that the service or product is related to
communication, possibly involving a license or authorization process.
DeepSeek Janus-Pro produces an impressive and human-like description with excellent structure, vivid imagery, and strong coherence. Minor refinements could make it even more concise and precise.
The text recognition output is accurate, clear, and well-structured, effectively capturing the main heading. However, it misses smaller text details and could mention the stylized typography for a richer description. Overall, it’s a strong response but could be improved with more completeness and visual insights.
A strong and diverse text-to-image generation output with accurate visuals and descriptive clarity. A few refinements, such as fixing text cut-offs and adding finer details, could elevate the quality further.
Checkout our detailed articles on DeepSeek working and comparison with similar models:
Despite its successes, Janus-Pro has certain limitations:
Future work could focus on:
Janus-Pro marks a transformative step in multimodal AI. By optimizing training strategies, scaling data, and expanding model size, it achieves state-of-the-art results in multimodal understanding and text-to-image generation. Despite some limitations, Janus-Pro lays a strong foundation for future research in scalable, efficient multimodal AI systems. Its advancements highlight the growing potential of AI to bridge the gap between vision and language, inspiring further innovation in the field.
Stay tuned to Analytics Vidhya Blog for more such awesome content!