
In the last two weeks, if something has made it to the headlines besides the crashing stock markets or the new US President, it is DeepSeek. A Chinese AI firm has been launching one great LLM after the other. It has launched its state-of-the-art image generation model – Janus Pro- 7B. This latest launch has already performed better than OpenAI’s Dall-E 3 and Stable Diffusion across various benchmarks. The best part? It is open source! So in this blog, we will test DeepSeek’s Janus Pro 7B vs Dall-E 3 – let’s find which image generation model is better.
Developed by DeepSeek AI, Janus-Pro is an advanced multimodal large language model (LLM). It is an enhanced version of the previous Janus model.
Janus Pro model has been tested rigorously over 20 different benchmarks and the results are as follows:
Text-to-Image Generation
Multimodal Understanding
Janus-Pro involves a three-stage training process using an architecture that integrates decoupled components for multimodal and text-to-image tasks:
I’ll be testing DeepSeek’s Janus Pro 7 B version against OpenAI’s DALL-E 3 model for various image generation and image analysis tasks.
Let’s get started!
Prompt: “Based on the score in the image, which team has a higher probability of winning this game?”
Janus Pro-7B
“Based on the score in the image, the team IND (India) has a higher probability of winning this game. IND has 157/4, which means they have a lead of 157 runs with 4 wickets remaining. The team from the other side, the one with the batting score, is currently on 123/6, meaning they are trailing by 34 runs with 6 wickets remaining”
DALL-E 3
“The score displayed in this image shows that India is batting at 137/1 in 12.5 overs against South Africa. This indicates a strong position for India due to:
Unless South Africa has already posted a very high target, India’s position is dominant. However, without knowing the target or the context of the game (e.g., first innings or chase), a definitive probability cannot be calculated. Based on this information alone, India has a higher probability of winning.”
Model | Performance |
---|---|
Janus Pro-7B | The model identified the teams accurately and gave the correct winning probability but it incorrectly read the scores mentioned in the image. So overall its analysis was flawed. |
DALL-E3 | The model not only correctly identified the teams and the score. It gave the correct winning chances based on the information that was provided in the image. |
DALL-E 3 unlike Janus Pro-7B read the score correctly and also mentioned that due to a lack of complete information, it can’t make an exact prediction of the win probability. Thus, DALL-E 3 hit the ball out of the park for this!
Prompt: “Explain the backstory behind this image”
Janus Pro-7B
“This image depicts a character who is wearing an advanced suit that includes a high-tech armored glove with a glowing element on the hand, reminiscent of technology often associated with superheroes in popular culture. The character is also wearing a chest piece with intricate and complex designs, suggesting advanced technology and possibly serving as a power source or a protective device.
The visual style and elements of the image are typical of science fiction and superhero films, where characters often possess superhuman abilities and wear specialized suits or armor to enhance their powers and protect themselves.
This particular image is likely a scene from a movie where the character is engaging in a battle or using their powers, given the intensity of the action and the dramatic lighting. The glowing element in the hand suggests the use of some form of energy manipulation, which is a common theme in superhero narratives.”
DALL-E 3
Janus Pro-7B | DALL-E3 |
The model gives a detailed description of the image yet is not able to give the backstory behind the image. | The model correctly identifies the image as a part of a Marvel movie’s snippet and based on it, the model gives a brief and accurate backstory. It correctly identifies the main character in the image and states the significance of the scene too. |
Janus Pro-7B gave a detailed image description, listing out the intricacies indicating its strong image analysis capabilities but it failed to get the back story behind it. DALL-E 3 quickly identified the image, gave its backstory, and even provided the significance of the scene.Thus, for this task, DALL-E 3 is a clear winner.
Winner: DALL-E 3
Prompt: “Generate an image of a girl with deep blue eyes and blonde hair looking straight into a mirror, with her one hand under her face and the other one resting on her side, standing under the light of a flickering blub”
Janus Pro-7B
DALL-E 3
Janus Pro-7B | DALL-E3 |
The model generates the image as per the prompt. It captures all the details of the prompt and in the mirror, it shows a reflection of the bulb too. The image appeared to be a bit animated. | The model generates a detailed, flawless image that captures all the nuances of the prompt. The image gave the feel of a sci-fi movie. |
Both the models generated impressive images in this task. On the first glance, I couldn’t spot any error in either of the two images. But while looking deeply, I can see slight blurriness around the bulb in the image generated by Janus Pro-7B while the image generated by DALL-E 3 is quite immersive and intricately detailed.
Winner: DALL-E 3
Prompt: “Explain this meme”
Janus Pro-7B:
“1. Left Side (Phones in 1990s):
2. Right Side (Phones in 2020):
The meme humorously contrasts the durability and robustness of phones from the 1990s with the fragility and frequent battery drain of modern smartphones. It highlights the significant changes in technology and user expectations over the years, with the 1990s phones being seen as more reliable and less demanding.
The meme is a playful critique of modern smartphones, poking fun at their perceived fragility and frequent need for charging, in contrast to the robust and long-lasting phones of the past.
Let me know if you’d like further clarification!”
DALL-E 3
Janus Pro-7B | DALL-E3 |
The model correctly explains the meme, explaining both sides. It further provides context to the image and then briefly concludes it. | The model gives a slightly detailed explanation of the meme with a quite verbose message at the end. |
Both the models gave accurate descriptions of the meme but the result given by Janus Pro -7B is more crisp and clear.
Winner: Janus Pro-7B
Task | Winner |
---|---|
Guess the Winner | DALL-E3 |
Tell me the “back” story | DALL-E3 |
Generate an image | DALL-E3 |
Meme talk | Janus Pro-7B |
DeepSeek’s Janus Pro-7B is a great addition to the world of image generation and multimodal LLMs.However, DALL-E 3, with its advanced integration and richer dataset, maintains a lead in versatility and real-world application. As DeepSeek continues to improve Janus Pro-7B and integrate it into its broader ecosystem, it is set to become a challenger to industry leaders like DALL-E 3, stable diffusion, and Midjourney. The open-source nature of Janus Pro-7B makes it a go-to-tool for researchers and developers aiming to harness the advanced image generation capabilities.
Stay tuned to Analytics Vidhya Blog for more such awesome content!
A. Janus Pro-7B is a multimodal large language model (LLM) developed by DeepSeek AI. It specializes in text-to-image generation and multimodal understanding.
A. Janus Pro-7B uses a decoupled architecture, making it highly efficient in handling both multimodal tasks and text-to-image generation, with enhanced accuracy and scalability.
A. Janus Pro-7B excels in image analysis and provides detailed responses but falls short in some real-world applications, such as reading scores accurately or interpreting backstories compared to DALL-E 3.
A. Yes, Janus Pro-7B is open-source and available on platforms like Hugging Face, making it accessible for developers and researchers.
A. Janus Pro-7B has been tested on benchmarks like GenEval, DPG-Bench, MMMU, and MME, consistently outperforming competitors like DALL-E 3 in certain aspects.
A. Yes, Janus Pro-7B generates amazing images following detailed prompts.