Introduction
We are going to look into the recently released multimodal large language model NVLM 1.0 by NVIDIA. These models achieve state-of-the-art results on vision-language tasks, even rivalling the leading proprietary models and open-access models (Llama 3-V 405B and InternVL 2). NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. NVLM is open-sourced; the model weights and code are open for the community.
NVIDIA conducts a thorough model design comparison between cross-attention-based models (e.g., Flamingo) and decoder-only multimodal LLMs (e.g., LLaVA). Based on the merits and shortcomings of both approaches, they presented a unique architecture that boosts both training efficiency and multimodal reasoning skills.
Overview
- NVIDIA’s NVLM 1.0 is an open-source multimodal LLM family that excels in vision-language and text-only tasks.
- NVLM 1.0 offers three architectures: decoder-only (NVLM-D), cross-attention (NVLM-X), and a hybrid model (NVLM-H).
- The models demonstrate superior performance in tasks like OCR, multimodal reasoning, and high-resolution image processing.
- NVLM 1.0 maintains strong text-only performance, overcoming typical multimodal training issues seen in other models.
- NVIDIA emphasizes data quality and diversity in both pretraining and supervised fine-tuning for optimal model outcomes.
- NVLM 1.0 is open-source, with model weights and code accessible to the community for further research and development.
Qualitative Examples of NVLM 1.0 D 74B
Illustration of the powerful scene understanding capabilities of the NVLM-1.0-D 72B model. It has the common sense to identify possible risks or mishaps and accurately recommends what needs to be done right away.
Additional illustrations of the NVLM-1.0-D 72B model’s capacity to comprehend memes, a difficult undertaking including a sense of humour and familiarity with significant societal trends, context, or occurrences.
Comparison of NVLM with Other LLM
When comparing popular open-access and private multimodal LLMs with NVLM 1.0. Note that the model weights for *Llama 3-V have not been provided as of the time of this report. The outcomes show that NVLM 1.0 performs comparably to top models in both vision-language and text-only tasks. Furthermore, multimodal LLM is compared to its backbone LLM on text-only tasks.
After multimodal training, InternVL2-Llama3-76B’s text performance drastically declines. Llama 3-V 70B and 405B exhibit no degradation in text-only tasks because multimodal training freezes their LLM backbones. However, the NVLM-1.0-D 72B model shows notable improvements over its text backbone on text-only math and coding benchmarks, with average accuracy rising by 4.3 points following multimodal training.
Also Read: Nvidia Introduces VILA: Visual Language Intelligence and Edge AI 2.0
Limitations of other Multimodal LLMs
The field has advanced the possibilities of open-access multimodal LLMs to a considerable degree. Prominent groups of open models consist of LLaVA, Llama 3-V, InternVL, and BLIP. The two most popular architectures for creating these multimodal LLMs are the cross-attention-based architecture (like Flamingo and Llama 3-V), which manages image tokens through LLM cross-attention layers, and the decoder-only architecture (like LLaVA and InternVL), which processes image tokens inside the LLM self-attention layers.
- Inconsistent architecture comparisons: Unlike text-based LLMs, multimodal LLM architectures (e.g., decoder-only vs. cross-attention models) haven’t been compared uniformly, due to differences in model backbones, vision encoders, and training data. This makes direct comparisons challenging. For instance, the open-access IDEFICS-80B (based on LLaMA-65B) is considered inferior to LLaVA-1.5-13B (based on Vicuna-13B) in visual question-answering tasks.
- Handling high-resolution image input: While models that use dynamic high-resolution images perform well on OCR tasks, they sometimes show reduced accuracy in reasoning tasks compared to low-resolution models.
- Degradation in text-only performance: Open-access multimodal LLMs show strong performance on vision-language tasks but suffer in text-only tasks, unlike proprietary models like GPT-4. Llama 3-V addresses this by freezing LLM parameters, but these models are not yet publicly available.
Addressing those limitations
To address these limitations NVIDIA introduced NVLM 1.0 Family, a multimodal family LLMs
- NVLM-D: A decoder-only architecture
- NVLM-X: A cross-attention-based architecture
- NVLM-H: A novel Hybrid architecture
All three models are trained on the same curated data blend. The architectures achieve state-of-the-art performance while offering practitioners flexible and feature-rich model options.
- Model architecture: A comparison between decoder-only and cross-attention models shows that cross-attention-based NVLM-X is more computationally efficient with high-resolution images, while the decoder-only NVLM-D performs better in OCR tasks and reasoning. Based on these insights, a hybrid model, NVLM-H, is proposed, which balances efficiency and reasoning ability.
- High-resolution image processing: A new tile-tagging design is introduced for handling high-resolution images, improving OCR tasks and multimodal reasoning performance. Ablation studies reveal that adding text-based tags to image tokens enhances accuracy.
- Training data: The study emphasizes the importance of data quality and diversity over scale in multimodal pretraining and supervised fine-tuning (SFT). Abundant, diverse pretraining data benefits both cross-attention and decoder-only models. Compared to previous works, the team curated a larger, task-oriented dataset for SFT.
- Production-grade multimodality: To ensure the NVLM models excel in both vision-language and text-only tasks, two strategies are employed: freezing LLM parameters in cross-attention models to maintain text performance, and integrating a high-quality text dataset into multimodal fine-tuning. This approach not only preserves text-only performance but also improves capabilities in math and coding tasks.
Also Read: Top 5 FREE Generative AI Courses by NVIDIA
NVLM: Models and Training Methods
- Decoder-only (NVLM-D): This model handles multimodal inputs by processing image tokens directly within the language model’s self-attention layers, making it well-suited for unified multimodal reasoning tasks such as OCR and document understanding.
- Cross-attention-based (NVLM-X): It processes image tokens through cross-attention layers, which makes it computationally efficient, especially when dealing with high-resolution images. This model excels in handling image-heavy tasks and offers higher throughput during training compared to decoder-only models.
- Hybrid (NVLM-H): This model combines the advantages of both NVLM-D and NVLM-X by processing thumbnail images and text tokens jointly in the LLM’s self-attention layers, while finer image details are handled through cross-attention. It improves both computational efficiency and reasoning capabilities for multimodal tasks.
All models share a vision encoder (InternViT-6B) and employ a dynamic high-resolution (DHR) approach, which divides high-resolution images into smaller tiles for processing. The models handle different tasks through a variety of text-based tags and modality-alignment modules. The training method is split into two phases:
- Pretraining, where the vision encoder and LLM are frozen.
- Supervised fine-tuning (SFT), which trains both the LLM and modality-alignment modules.
NVLM-1.0 offers three architectural options: the cross-attention-based NVLM-X (top), the hybrid NVLM-H (middle), and the decoder-only NVLM-D (bottom). The dynamic high-resolution vision pathway is shared by all three models. However, different architectures process the image features from thumbnails and regular local tiles in distinct ways.
Training Data
The authors provide a detailed breakdown of the curated datasets used for both pretraining and SFT.
- Pretraining datasets include captioning, visual question answering (VQA), document understanding, and OCR-related data. The study emphasizes the importance of data quality and diversity over sheer scale, noting that noisy datasets hinder the model’s ability to learn effectively.
- The multimodal pretraining datasets cover a wide range of tasks, from image captioning (COCO, LAION-115M) to document OCR (OCR-VQA, ReCTs) and math reasoning in visual contexts (CLEVR-Math). A notable finding is that diverse task-oriented datasets, such as VQA and OCR, significantly enhance cross-modal alignment and improve final results.
- During SFT, the model is fine-tuned on a high-quality blend of multimodal datasets to enhance vision-language understanding. The SFT stage incorporates datasets like TextVQA, ChartQA, DocVQA, and AI2D. Text-only fine-tuning datasets are also used to prevent degradation of text-only performance. A special effort is made to ensure that the fine-tuning data includes math and coding tasks, helping the model to improve performance in these areas.
Also Read: What are Multimodal Models?
Results
The NVLM-1.0 family is evaluated across multiple benchmarks, demonstrating competitive or superior performance compared to other leading multimodal and text-only models, both proprietary (e.g., GPT-4o, Claude 3.5) and open-access (e.g., LLaVA, InternVL). Key findings include:
- NVLM-D outperformed all open-access models on OCR benchmarks like OCRBench and VQAv2, highlighting its strength in vision-language tasks like scene text reading and document understanding.
- NVLM-H showed the highest scores on multimodal reasoning tasks (e.g., MMMU, MathVista) and demonstrated superior computational efficiency. This hybrid model combines the strengths of both decoder-only and cross-attention approaches, achieving state-of-the-art results on vision-language tasks without sacrificing efficiency.
- NVLM-X demonstrated best-in-class performance among cross-attention-based models, particularly for tasks involving high-resolution images, and had the advantage of faster training and inference speeds.
NVLM models maintained or improved their performance on text-only tasks (like coding and math benchmarks such as MMLU, GSM8K, MATH, and HumanEval) after multimodal training, which is a significant achievement, as other multimodal models typically experience degradation in these areas.
Accessing NVLM D 72B
We can access the model using the hugging face function and the transformers library. Below is the code to infer the NVLM D 72B model; this is straight out of the documentation. Note that this is a 150+ GB model.
1. Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModel
import math
from PIL import Image
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
2. Model Sharding
The split_model() function defines a device map for distributing the layers of the model across multiple GPUs
def split_model():
device_map = {}
world_size = torch.cuda.device_count()
num_layers = 80
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
This distribution ensures efficient use of multiple GPUs to handle large models.
3. Image Preprocessing
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
4. Dynamic image tiling
This function splits an image into smaller tiles based on its aspect ratio
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
5. Loading and Preprocessing Images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
6. Loading and Using the Model
path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
trust_remote_code=True,
device_map=device_map).eval()
print(model)
7. Text and Image Conversations
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
generation_config = dict(max_new_tokens=1024, do_sample=False)
# pure-text conversation
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
# single-image single-round conversation
pixel_values = load_image('path/to/your/example/image.jpg', max_num=6).to(
torch.bfloat16)
question = '\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')
Conclusion
We can highlight that the NVLM-1.0 family achieves state-of-the-art results across a wide range of vision-language and text-only tasks, maintaining production-grade multimodality. This means the models perform well in both multimodal and text-only settings, without significant degradation in text-only performance—a common issue in many other multimodal models. The authors also emphasize the importance of high-quality training data and diverse task-oriented datasets for boosting model performance.
The NVLM-1.0 family demonstrates that it is possible to create multimodal LLMs that excel in a wide variety of tasks, including reasoning, coding, and math. In their commitment to furthering research, the team plans to release the model weights and open-source the code, inviting the community to build upon their work.
Are you looking for an online Generative AI course? If yes, explore this: GenAI Pinnacle Program.
Frequently Asked Questions
Ans. NVLM 1.0 is a family of open-source, multimodal large language models by NVIDIA. It excels in both vision-language tasks and text-only tasks, rivaling leading proprietary and open-access models.
Ans. NVLM 1.0 includes three model architectures:
– NVLM-D: A decoder-only model for unified multimodal reasoning tasks like OCR and document understanding.
– NVLM-X: A cross-attention-based model for efficient high-resolution image processing.
– NVLM-H: A hybrid model that balances efficiency and reasoning by combining elements of both NVLM-D and NVLM-X.
Ans. NVLM 1.0 is trained in two phases:
Pretraining: The vision encoder and LLM are frozen, and only modality-alignment layers are trained.
Supervised Fine-Tuning (SFT): Both the LLM and modality-alignment layers are fine-tuned on a curated set of multimodal tasks, ensuring strong performance on vision-language and text-only tasks.
Ans. NVLM 1.0 uses high-quality, diverse datasets for pretraining and fine-tuning, including COCO, OCR-VQA, ChartQA, DocVQA, and MathVista. Special attention is given to maintaining data quality and diversity.