Introduction

A few months ago, Meta released its AI model, LLaMA 3.1(405 Billion parameters),  outperforming OpenAI and other models on different benchmarks. That upgrade was built upon the capabilities of LLaMA 3, introducing improved reasoning, advanced natural language understanding, increased efficiency, and expanded language support. Now, again focusing on its “we believe openness drives innovation and is good for developers, Meta, and the world.”  at Connect 2024, Meta released Llama 3.2. It is a collection of models with vision capabilities and lightweight text-only models that can fit on mobile devices.

If you ask me what’s impressive about this model, Llama 3.2’s 11B and 90B vision models stand out as excellent replacements for other closed models, especially for image understanding tasks. Moreover, the 1B and 3B text-only models are optimized for edge and mobile devices, making them state-of-the-art for tasks like summarization and instruction following. These models also have broad hardware support, are easy to fine-tune, and can be deployed locally, making them highly versatile for both vision and text-based applications.

Getting Started With Meta Llama 3.2

Overview

  • Llama 3.2 Release: Meta’s Llama 3.2 introduces vision capabilities and lightweight text models, offering advanced multimodal processing and efficiency for mobile devices.
  • Vision Models: The 11B and 90B vision models excel in tasks like visual reasoning and image-text retrieval, making them strong contenders for image understanding.
  • Text Models: The 1B and 3B models are optimized for on-device tasks like summarization and instruction following, providing powerful performance on mobile devices.
  • Architecture: Llama 3.2 Vision integrates an image encoder using adapter mechanisms, preserving text model performance while supporting visual inputs.
  • Multilingual & Long Contexts: Both vision and text models support long contexts (up to 128k tokens) and multilingual input, making them versatile across languages and tasks.
  • Developer Tools & Access: Meta offers comprehensive developer tools for easy deployment, model fine-tuning, and safety mechanisms to ensure responsible AI use.

Since its release, Llama 3.1 has become quite popular and impactful. While Llama 3.1 is incredibly powerful, they have historically required substantial computational resources and expertise, limiting accessibility for many developers and creating a high demand for building with Llama. However, with the launch of Llama 3.2, this accessibility gap has been significantly addressed. 

The Llama 3.2 Vision (11B/90B) and the Llama Text 3.2 (1B/3B) models represent Meta’s latest advancements in multimodal and text processing AI. Each is designed for a different use case, but both showcase impressive capabilities.

Llama 3.2 Vision (11B/90B)

Llama 3.2 Vision stands out as Meta’s most powerful open multimodal model, with a keen ability to handle both visual and textual reasoning. It’s capable of tasks like visual reasoning, document-based question answering, and image-text retrieval, making it a versatile tool. What makes this model special is its Chain of Thought (CoT) reasoning, which enhances its problem-solving abilities, especially when it comes to complex visual reasoning tasks. A context length of 128k tokens allows for extended multi-turn conversations, particularly when dealing with images. However, it works best when focusing on a single image at a time to maintain quality and optimize memory use. Beyond visual inputs, it supports text-based inputs in various languages like English, German, French, Hindi, and more.

Llama 3.2 Text (1B/3B)

On the other hand, Llama 3.2 1B and 3B models are smaller but incredibly efficient, designed specifically for on-device tasks like rewriting prompts, multilingual summarization, or knowledge retrieval. Despite their smaller size, they outperform many larger models and continue to support multilingual input with a 128k token context length, making them a powerful option for offline use or low-memory environments. Like the Vision model, they were trained with up to 9 trillion tokens, ensuring robust application performance.

In essence, if you’re looking for a model that excels in handling images and text together, Llama 3.2 Vision is your go-to. The 1B and 3B models provide excellent performance without needing large-scale computing power for text-heavy applications requiring efficiency and multilingual support on smaller devices.

You can download these models now:

Link to Download Llama 3.2 Models

Let’s talk about the Architecture of both models: 

Llama 3.2 Vision Architecture

The 11B and 90B Llama models introduced support for vision tasks by integrating an image encoder into the language model. This was achieved by training adapter weights that allow image inputs to work alongside text inputs without altering the core text-based model. The adapters use cross-attention layers to align image and text data.

The training process began with pre-trained Llama 3.1 text models, adding image adapters, and training on large image-text datasets. The final stages involved fine-tuning with high-quality data, filtering, and safety measures. As a result, these models can now process both image and text prompts and perform advanced reasoning across both.

1. Text Models (Base)

  • Llama 3.1 LLMs are used as the backbone for the Vision models.
  • The 8B text model is used for the Llama 3.2 11B Vision model, and the 70B text model for the Llama 3.2 90B Vision model.
  • Importantly, these text models remain frozen during training of the vision component, implying that no new training occurs for the text part to preserve its original performance. This approach ensures that adding visual capabilities doesn’t degrade the model’s performance on text tasks.

2. Vision Tower

  • The Vision Tower is added to extend the model’s capability to process images along with text. The details of the vision tower aren’t fully spelled out, but this is likely a transformer-based image encoder (similar to the approach used in models like CLIP), which converts visual data into representations compatible with the text model’s embeddings.

3. Image Adapter

  • The Image Adapter functions as a bridging module between the Vision Tower and the pre-trained text model. It maps image representations into a format that the text model can interpret, effectively allowing the model to handle multimodal inputs (text + image).

4. Implications of the Architecture

  • Frozen Text Models: Frozen text models during the training of the Vision Tower help maintain the linguistic capabilities of the text models that had previously been developed. This is critical because training on multimodal data can sometimes degrade the text-only performance (known as “catastrophic forgetting”). The frozen approach mitigates this risk.
  • Vision-Text Interaction: Since the model includes a Vision Tower and Image Adapter, it indicates that the 3.2 Vision models are likely designed for vision-language tasks, such as visual question answering, captioning, and visual reasoning.

Llama 3.2 1B and 3B Text Models

Llama 3.2 1B and 3B Text Models architecture
  1. Model Size and Structure
    • These models retain the same architecture as Llama 3.1, which likely means they are built on a decoder-only transformer structure optimized for generating text.
    • The 1B and 3B parameter sizes make them relatively lightweight compared to larger models like the 70B version. This suggests these models are suitable for scenarios where computational resources are more limited or for tasks that don’t require the vast capacity of the larger models.
  2. Training with 9 Trillion Tokens
    • The extensive training with 9 trillion tokens is a massive dataset by industry standards. This large-scale training enhances the model’s ability to generalize across various tasks, increasing its versatility in handling different languages and domains.
  3. Long Context Lengths
    • The support for 128k token context lengths is a significant feature. This allows the models to maintain far longer conversations or process larger documents in one pass. This extended context is invaluable for tasks like legal analysis, scientific research, or summarizing lengthy articles.
  4. Multilingual Support
    • These models support multilingual capabilities, handling languages such as English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. This multilingual capability opens up a broader range of applications across different regions and linguistic groups, making it useful for global communication, translations, and multilingual NLP tasks.

Vision Models: Blending Image and Text Reasoning

The 11B and 90B Llama models are the first in this series to support vision tasks, necessitating a novel architecture capable of processing both image and text inputs. This breakthrough allows the models to interpret and reason about images alongside text prompts.

The Adapter Mechanism

The core of this innovation lies in the introduction of a set of adapter weights that bridge the gap between pre-trained language models and image encoders. These adapters consist of cross-attention layers, which feed image representations from the encoder into the language model. The key aspect of this process is that while the image encoder undergoes fine-tuning during training, the language model’s parameters remain untouched. This intentional choice preserves the text-processing capabilities of Llama, making these vision-enabled models a seamless drop-in replacement for their text-only counterparts.

Training Stages

The training pipeline is divided into several stages:

  1. Pre-training on Large-Scale Noisy Data: The model is first exposed to large amounts of noisy image-text pairs, helping it learn general patterns in image and language correspondence.
  2. Fine-Tuning on High-Quality In-Domain Data: After the initial training, the model is further refined using a cleaner, more focused dataset that enhances its ability to align image content with textual understanding.

In post-training, the Llama 3.2 models follow a process similar to their text-based predecessors, involving supervised fine-tuning (SFT), rejection sampling (RS), and direct preference optimization (DPO). Additionally, synthetic data generation plays a critical role in fine-tuning the models, where the Llama 3.1 model helps filter and augment question-answer pairs to create high-quality datasets.

The end result is a set of models that can effectively process both image and text inputs, offering deep understanding and reasoning capabilities. This opens the door to more advanced multimodal applications, pushing Llama models towards even richer agentic capabilities.

Lightweight Models: Optimizing for Efficiency

In parallel with advancements in vision models, Meta has focused on creating lightweight versions of Llama that maintain performance while being resource-efficient. The 1B and 3B Llama models are designed to operate on devices with limited computational resources, without compromising on their capabilities.

Pruning and Knowledge Distillation

Pruning and Knowledge Distillation

Two main techniques, pruning and knowledge distillation, were used to shrink the models:

  • Pruning: This process systematically removes less important parts of the model, reducing its size while retaining performance. The 1B and 3B models underwent structured pruning, removing redundant network components and adjusting the weights to make them more compact and efficient.
  • Knowledge Distillation: A larger model serves as a “teacher” to impart its knowledge to the smaller model. For the 1B and 3B Llama models, outputs from larger models like the Llama 3.1 8B and 70B were used as token-level targets during training. This approach helps smaller models match the performance of larger counterparts by capturing their generalizations.

Post-training processes further refine these lightweight models, including supervised fine-tuning, rejection sampling, and preference optimization. Additionally, context length support was scaled to 128K tokens while ensuring that the quality remains intact, allowing these models to handle longer text inputs without a drop in performance.

Meta has collaborated with major hardware companies such as Qualcomm, MediaTek, and Arm to ensure these models run efficiently on mobile devices. The 1B and 3B models have been optimized to run smoothly on modern mobile SoCs, opening up new opportunities for on-device AI applications.

Llama Stack Distributions: Simplifying the Developer Experience

Llama Stack Distributions: Simplifying the Developer Experience

Meta also introduced the Llama Stack API, a standardized interface for fine-tuning, data generation, and building agentic applications with Llama models. The goal is to provide developers with a consistent and easy-to-use toolchain for deploying Llama models in various environments, from on-premise solutions to cloud services and mobile devices.

The release includes a comprehensive set of tools:

  • Llama CLI: A command-line interface to configure and run Llama models.
  • Docker Containers: Ready-to-use containers for running Llama Stack servers.
  • Client Code: Available in multiple languages, such as Python, Node, Kotlin, and Swift.

Meta has partnered with major cloud providers, including AWS, Databricks, and Fireworks, to offer Llama Stack distributions in the cloud. The introduction of these APIs and distribution mechanisms makes it easier for developers to innovate with Llama models, regardless of their deployment environment.

System-Level Safety: Enhancing Responsible AI

Alongside these advancements, Meta focuses on safety and responsible AI development. With the launch of Llama Guard 3 11B Vision, the company introduced enhanced filtering for text+image prompts, ensuring that these models operate within safe boundaries. Furthermore, the smaller 1B and 3B Llama Guard models have been optimized to reduce deployment costs, making it more feasible to implement safety mechanisms in constrained environments.

Now, let’s begin with the Evaluations of both models on different benchmarks.

Evaluation of Both Models

Llama 3.2’s 11B and 90B vision models

Llama 3.2’s 11B and 90B vision models

Math & Vision

  1. MMMU: Evaluates solving math problems using multiple modalities (text, images, graphs). Llama 3.2 90B excels in handling complex math tasks compared to Claude 3 and GPT-4 mini.
  2. MMMU-Pro, Standard: Focuses on more challenging math problems with mixed inputs. Llama 3.2 90B performs significantly better, showing it handles complex tasks well.
  3. MMMU-Pro, Vision: Measures how well the model solves math problems with visual components. Llama 3.2 90B performs well, while GPT-4 mini has an edge in some areas.
  4. MathVista (beta): Tests the model’s capability in visually-rich math problems like spatial reasoning. Llama 3.2 90B shows strong potential in visual-math reasoning.

Image (Charts & Diagrams)

  1. ChartQA: Assesses understanding of charts and diagrams. Llama 3.2 90B is excellent in interpreting visual data, surpassing Claude 3.
  2. AI2 Diagram: Tests diagrammatic reasoning (physics, biology, etc.). Llama 3.2 models perform strongly, showcasing excellent diagram understanding.
  3. DocVQA: Evaluates how well the model answers questions about mixed-content documents (text + images). Llama 3.2 90B shows strong performance close to Claude 3.

General Visual & Text Understanding

  1. VQA v2: Tests answering questions about images, understanding objects, and relationships. Llama 3.2 90B scores well, showing strong image comprehension.

Text (General & Specific)

  1. MMLU: Measures text-based reasoning across various subjects. Llama 3.2 90B excels in general knowledge and reasoning.
  2. MATH: Focuses on text-based mathematical problem-solving. Llama 3.2 90B performs competently but doesn’t surpass GPT-4 mini.
  3. GQA: Evaluates text reasoning and comprehension without visual components. Llama 3.2 90B demonstrates strong abstract reasoning.
  4. MGSM: This test measures elementary math problem-solving in multiple languages. Llama 3.2 90B shows balanced math and multilingual capabilities.

Summarising the Evaluation

  • Llama 3.2 90B performs particularly well across most benchmarks, especially when it comes to vision-related tasks (like chart understanding, diagrams, and DocVQA). It outperforms the smaller 11B version significantly in complex math tasks and reasoning tests, showing that the larger model size enhances its problem-solving capabilities.
  • GPT-4 0-mini has a slight edge in certain math-heavy tasks but generally performs less well in visually oriented challenges compared to Llama 3.2 90B.
  • Claude 3 – Haiku performs decently but tends to lag behind Llama 3.2 in most benchmarks, indicating it may not be as strong in vision-based reasoning tasks.

Llama 1B and 3B text-only Models

General

  • MMLU: Tests general knowledge and reasoning across subjects and difficulty levels. Llama 3.2 3B (63.4) performs significantly better than 1B (49.3), demonstrating superior general language understanding and reasoning.
  • Open-rewrite eval: Measures the model’s ability to paraphrase and rewrite text. Llama 3.2 1B (41.6) slightly surpasses 3B (40.1), showing stronger performance in smaller rewrite tasks.
  • TLDR9+: Focuses on summarization tasks. Llama 3.2 3B (19.0) outperforms 1B (16.8), showing that the larger model handles summarization better.
  • IFEval: Evaluates inference and reasoning from the text. Llama 3.2 3B (77.4) performs much better than 1B (59.5), indicating stronger reasoning capabilities in the larger model.

Tool Use

  • BFLC V2: Measures real-time reasoning and factual correctness. Llama 3.2 3B (67.0) scores far higher than 1B (25.7), showcasing better real-time text reasoning.
  • Nexus: Tests general knowledge and text reasoning. Llama 3.2 3B (77.7) greatly outperforms 1B (44.4), indicating superior knowledge handling in larger tasks.

Math

  • GSM8K: Evaluates math problem-solving in text. Llama 3.2 3B (48.0) performs much better than 1B (30.6), indicating the 3B model is more capable in text-based math tasks.
  • MATH: Measures math problem-solving from text prompts. Llama 3.2 3B (43.0) is ahead of 1B (30.6), indicating better mathematical reasoning in the larger model.

Reasoning & Logic

  • ARC Challenge: Tests logic and reasoning based on text. Llama 3.2 3B (78.6) performs better than 1B (59.4), showing enhanced reasoning and problem-solving.
  • GPOA: Evaluates abstract reasoning and comprehension. Llama 3.2 3B (32.8) outperforms 1B (27.2), showing stronger abstract comprehension.
  • HellaSwag: Focuses on commonsense reasoning. Llama 3.2 3B (69.8) far surpasses 1B (41.2), demonstrating better handling of commonsense-based tasks.

Long Context

  • InfiniteBench/En.MC: Measures comprehension of long text contexts. Llama 3.2 3B (63.3) outperforms 1B (38.0), showcasing better handling of long text inputs.
  • InfiniteBench/En.QA: Focuses on question answering from long contexts. Llama 3.2 1B (20.3) performs slightly better than 3B (19.8), suggesting some efficiency in handling specific questions in long contexts.

Multilingual

  • MGSM: Tests elementary-level math problems across languages. Llama 3.2 3B (58.2) performs much better than 1B (24.5), demonstrating stronger multilingual math reasoning.

Summarising the Evaluation

  • Llama 3.2 3B excels in reasoning and math tasks, outperforming smaller models and handling complex inference and long-context understanding effectively.
  • Gemma 2 2B IT performs well in real-time reasoning tasks but falls behind Llama 3.2 3B in abstract reasoning and math-heavy tasks like ARC Challenge.
  • Phi-3.5-mini IT excels in general knowledge and reasoning benchmarks like MMLU but struggles in specialized tasks, where Llama models are more consistent.

Llama 3.2 3B is the most versatile, while Gemma 2 2B and Phi-3.5-mini IT show strengths in specific areas but lag in others.

Llama 3.2 Models on Hugging Face

Llama 3.2 Models on Hugging Face

Most importantly, you will need authorization from Hugging Face for both models to run. Here are the steps:

If you haven’t taken any access, it will show: “Access to model meta-llama/Llama-3.2-3B-Instruct is restricted. You must have access to it and be authenticated to access it. Please log in.”

To get access, First log in to Hugging Face, share your details in the desired field and agree to the terms and conditions. They are asking because of limited access to this model.

Llama 3.2 Models on Hugging Face

After getting access, go to – Meta Llama 3.2 Hugging Face for the required model. You can also normally search for the model name on the Hugging Face search bar.

After that, click on the Use this Model button and click on “Transformers.”

Llama 3.2 Models on Hugging Face

Now copy the code, and you are ready to experience Llama 3.2 Models

Llama 3.2 Models on Hugging Face

This process is similar for both models

Llama 3.2 Text Model

Example 1: 

import torch
from transformers import pipeline
model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipe = pipeline(
   "text-generation",
   model=model_id,
   torch_dtype=torch.bfloat16,
   device_map="auto",
)
messages = [
   {"role": "system", "content": "You are a helpful assistant who is technically sound!"},
   {"role": "user", "content": "Explain RAG in French"},
]
outputs = pipe(
   messages,
   max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Output

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
{'role': 'assistant', 'content': "Je comprends mieux maintenant. Voici la
traduction de la French texte en anglais :\n\nRAG can mean different things
in French, but I'll try to give you a general definition.\n\nRAG can refer
to:\n\n* RAG (RAG), a technical term used in the paper and cardboard
industry to describe a process of coloration or marking on cardboard.\n* RAG
(RAG), an American music group that has performed with artists such as Jimi
Hendrix and The Band.\n* RAG (RAG), an indie rock album from American
country-pop singer Margo Price released in 2009.\n\nIf you could provide
more context or clarify which expression you were referring to, I would be
happy to help you further."}

Example 2:

from transformers import pipeline
import torch
model_id = "meta-llama/Llama-3.2-3B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
)
response = outputs[0]["generated_text"][-1]["content"]
print(response)

OUTPUT

Arrrr, me hearty! Yer lookin' fer a bit o' information about meself, eh?
Alright then, matey! I be a language-generatin' swashbuckler, a digital
buccaneer with a penchant fer spinnin' words into gold doubloons o'
knowledge! Me name be... (dramatic pause)...Assistant! Aye, that be me name,
and I be here to help ye navigate the seven seas o' questions and find the
hidden treasure o' answers! So hoist the sails and set course fer adventure,
me hearty! What be yer first question?

Vision Model

Example 1:

Rabbit Vision Model

Note: If you are running this Llama 3.2 Vision Model on Colab, use the T4 GPU, as it is a very heavy model.

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device="cuda",
)
processor = AutoProcessor.from_pretrained(model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Can you please describe this image in just one sentence?"}
    ]}
]
input_text = processor.apply_chat_template(
    messages, add_generation_prompt=True,
)
inputs = processor(
    image, input_text, return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=70)
print(processor.decode(output[0][inputs["input_ids"].shape[-1]:]))

OUTPUT

The image depicts a rabbit dressed in a blue coat and brown vest, standing on
a dirt road in front of a stone house.

Example 2

import requests
API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.2-11B-Vision-Instruct"
headers = {"Authorization": "Bearer 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"}
def query(prompt):
    payload = {"inputs": prompt}
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()
# Example usage
prompt = "Describe the features of a self-driving car."
result = query(prompt)
print(result)

Output

[{'generated_text': ' A self-driving car is a car that is capable of
operating without human intervention. The vehicle contains a combination of
hardware and software components that enable autonomous movement.\nDescribe
the components that are used in a self-driving car. Some of the components
used in a self-driving car include:\nGPS navigation system\nInertial
measurement unit (IMU)\nRadar sensors\nUltrasonic sensors\nCameras (front,
rear, and side-facing)\nLiDAR (Light Detection and Ranging) sensor\n'}]

Conclusion

With the introduction of vision capabilities, lightweight models, and an expanded developer toolkit, Llama 3.2 represents a significant milestone in AI development. These innovations improve the models’ performance and efficiency and ensure that developers can build safe and responsible AI systems. As Meta continues to push the boundaries of AI, the Llama ecosystem is poised to drive new applications and possibilities across industries.

By fostering collaboration with partners across the AI community, Meta is laying the foundation for an open, innovative, and safe AI ecosystem. Llama’s future is bright, and the possibilities are endless.

Frequently Asked Questions

Q1. What is LLaMA 3.2, and why is it significant?

Ans. LLaMA 3.2 is Meta’s latest AI model collection, featuring vision capabilities and lightweight text-only models optimized for mobile devices. It enhances multimodal processing, supporting both text and image inputs.

Q2. What sets LLaMA 3.2’s vision models apart?

Ans. The 11B and 90B vision models excel in tasks like image understanding, visual reasoning, and image-text retrieval, making them strong alternatives to other closed models.

Q3. How do LLaMA 3.2’s text models perform on mobile devices?

Ans. The 1B and 3B text models are optimized for on-device tasks like summarization and instruction following, offering powerful performance without needing large-scale computational resources.

Q4. What makes the architecture of LLaMA 3.2 unique?

Ans. LLaMA 3.2 Vision integrates an image encoder via adapter mechanisms, preserving the text model’s original performance while adding visual input capabilities.

Q5. What multilingual support does LLaMA 3.2 provide?

Ans. Both the vision and text models support multilingual inputs with long contexts (up to 128k tokens), enabling versatile use across multiple languages.

Hi, I am Pankaj Singh Negi – Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.



Source link

Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *