Introduction

We are going to look into the recently released multimodal large language model NVLM 1.0 by NVIDIA. These models achieve state-of-the-art results on vision-language tasks, even rivalling the leading proprietary models and open-access models (Llama 3-V 405B and InternVL 2). NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. NVLM is open-sourced; the model weights and code are open for the community. 

NVIDIA conducts a thorough model design comparison between cross-attention-based models (e.g., Flamingo) and decoder-only multimodal LLMs (e.g., LLaVA). Based on the merits and shortcomings of both approaches, they presented a unique architecture that boosts both training efficiency and multimodal reasoning skills.

NVIDIA’s Approach to Multimodal LLMs

Overview

  • NVIDIA’s NVLM 1.0 is an open-source multimodal LLM family that excels in vision-language and text-only tasks.
  • NVLM 1.0 offers three architectures: decoder-only (NVLM-D), cross-attention (NVLM-X), and a hybrid model (NVLM-H).
  • The models demonstrate superior performance in tasks like OCR, multimodal reasoning, and high-resolution image processing.
  • NVLM 1.0 maintains strong text-only performance, overcoming typical multimodal training issues seen in other models.
  • NVIDIA emphasizes data quality and diversity in both pretraining and supervised fine-tuning for optimal model outcomes.
  • NVLM 1.0 is open-source, with model weights and code accessible to the community for further research and development.

Qualitative Examples of NVLM 1.0 D 74B

Illustration of the powerful scene understanding capabilities of the NVLM-1.0-D 72B model. It has the common sense to identify possible risks or mishaps and accurately recommends what needs to be done right away.

Additional illustrations of the NVLM-1.0-D 72B model’s capacity to comprehend memes, a difficult undertaking including a sense of humour and familiarity with significant societal trends, context, or occurrences.

Comparison of NVLM with Other LLM

When comparing popular open-access and private multimodal LLMs with NVLM 1.0. Note that the model weights for *Llama 3-V have not been provided as of the time of this report. The outcomes show that NVLM 1.0 performs comparably to top models in both vision-language and text-only tasks. Furthermore, multimodal LLM is compared to its backbone LLM on text-only tasks.

After multimodal training, InternVL2-Llama3-76B’s text performance drastically declines. Llama 3-V 70B and 405B exhibit no degradation in text-only tasks because multimodal training freezes their LLM backbones. However, the NVLM-1.0-D 72B model shows notable improvements over its text backbone on text-only math and coding benchmarks, with average accuracy rising by 4.3 points following multimodal training.

Also Read: Nvidia Introduces VILA: Visual Language Intelligence and Edge AI 2.0

Limitations of other Multimodal LLMs

The field has advanced the possibilities of open-access multimodal LLMs to a considerable degree. Prominent groups of open models consist of LLaVA, Llama 3-V, InternVL, and BLIP. The two most popular architectures for creating these multimodal LLMs are the cross-attention-based architecture (like Flamingo and Llama 3-V), which manages image tokens through LLM cross-attention layers, and the decoder-only architecture (like LLaVA and InternVL), which processes image tokens inside the LLM self-attention layers.

  • Inconsistent architecture comparisons: Unlike text-based LLMs, multimodal LLM architectures (e.g., decoder-only vs. cross-attention models) haven’t been compared uniformly, due to differences in model backbones, vision encoders, and training data. This makes direct comparisons challenging. For instance, the open-access IDEFICS-80B (based on LLaMA-65B) is considered inferior to LLaVA-1.5-13B (based on Vicuna-13B) in visual question-answering tasks.
  • Handling high-resolution image input: While models that use dynamic high-resolution images perform well on OCR tasks, they sometimes show reduced accuracy in reasoning tasks compared to low-resolution models.
  • Degradation in text-only performance: Open-access multimodal LLMs show strong performance on vision-language tasks but suffer in text-only tasks, unlike proprietary models like GPT-4. Llama 3-V addresses this by freezing LLM parameters, but these models are not yet publicly available.

Addressing those limitations

To address these limitations NVIDIA introduced NVLM 1.0 Family, a multimodal family LLMs 

  1. NVLM-D: A decoder-only architecture
  2. NVLM-X: A cross-attention-based architecture
  3. NVLM-H: A novel Hybrid architecture

All three models are trained on the same curated data blend. The architectures achieve state-of-the-art performance while offering practitioners flexible and feature-rich model options.

  • Model architecture: A comparison between decoder-only and cross-attention models shows that cross-attention-based NVLM-X is more computationally efficient with high-resolution images, while the decoder-only NVLM-D performs better in OCR tasks and reasoning. Based on these insights, a hybrid model, NVLM-H, is proposed, which balances efficiency and reasoning ability.
  • High-resolution image processing: A new tile-tagging design is introduced for handling high-resolution images, improving OCR tasks and multimodal reasoning performance. Ablation studies reveal that adding text-based tags to image tokens enhances accuracy.
  • Training data: The study emphasizes the importance of data quality and diversity over scale in multimodal pretraining and supervised fine-tuning (SFT). Abundant, diverse pretraining data benefits both cross-attention and decoder-only models. Compared to previous works, the team curated a larger, task-oriented dataset for SFT.
  • Production-grade multimodality: To ensure the NVLM models excel in both vision-language and text-only tasks, two strategies are employed: freezing LLM parameters in cross-attention models to maintain text performance, and integrating a high-quality text dataset into multimodal fine-tuning. This approach not only preserves text-only performance but also improves capabilities in math and coding tasks.

Also Read: Top 5 FREE Generative AI Courses by NVIDIA

NVLM: Models and Training Methods

  • Decoder-only (NVLM-D): This model handles multimodal inputs by processing image tokens directly within the language model’s self-attention layers, making it well-suited for unified multimodal reasoning tasks such as OCR and document understanding.
  • Cross-attention-based (NVLM-X): It processes image tokens through cross-attention layers, which makes it computationally efficient, especially when dealing with high-resolution images. This model excels in handling image-heavy tasks and offers higher throughput during training compared to decoder-only models.
  • Hybrid (NVLM-H): This model combines the advantages of both NVLM-D and NVLM-X by processing thumbnail images and text tokens jointly in the LLM’s self-attention layers, while finer image details are handled through cross-attention. It improves both computational efficiency and reasoning capabilities for multimodal tasks.

All models share a vision encoder (InternViT-6B) and employ a dynamic high-resolution (DHR) approach, which divides high-resolution images into smaller tiles for processing. The models handle different tasks through a variety of text-based tags and modality-alignment modules. The training method is split into two phases:

  • Pretraining, where the vision encoder and LLM are frozen.
  • Supervised fine-tuning (SFT), which trains both the LLM and modality-alignment modules.

NVLM-1.0 offers three architectural options: the cross-attention-based NVLM-X (top), the hybrid NVLM-H (middle), and the decoder-only NVLM-D (bottom). The dynamic high-resolution vision pathway is shared by all three models. However, different architectures process the image features from thumbnails and regular local tiles in distinct ways.

Training Data

The authors provide a detailed breakdown of the curated datasets used for both pretraining and SFT.

  • Pretraining datasets include captioning, visual question answering (VQA), document understanding, and OCR-related data. The study emphasizes the importance of data quality and diversity over sheer scale, noting that noisy datasets hinder the model’s ability to learn effectively.
  • The multimodal pretraining datasets cover a wide range of tasks, from image captioning (COCO, LAION-115M) to document OCR (OCR-VQA, ReCTs) and math reasoning in visual contexts (CLEVR-Math). A notable finding is that diverse task-oriented datasets, such as VQA and OCR, significantly enhance cross-modal alignment and improve final results.
  • During SFT, the model is fine-tuned on a high-quality blend of multimodal datasets to enhance vision-language understanding. The SFT stage incorporates datasets like TextVQA, ChartQA, DocVQA, and AI2D. Text-only fine-tuning datasets are also used to prevent degradation of text-only performance. A special effort is made to ensure that the fine-tuning data includes math and coding tasks, helping the model to improve performance in these areas.

Also Read: What are Multimodal Models?

Results

The NVLM-1.0 family is evaluated across multiple benchmarks, demonstrating competitive or superior performance compared to other leading multimodal and text-only models, both proprietary (e.g., GPT-4o, Claude 3.5) and open-access (e.g., LLaVA, InternVL). Key findings include:

  • NVLM-D outperformed all open-access models on OCR benchmarks like OCRBench and VQAv2, highlighting its strength in vision-language tasks like scene text reading and document understanding.
  • NVLM-H showed the highest scores on multimodal reasoning tasks (e.g., MMMU, MathVista) and demonstrated superior computational efficiency. This hybrid model combines the strengths of both decoder-only and cross-attention approaches, achieving state-of-the-art results on vision-language tasks without sacrificing efficiency.
  • NVLM-X demonstrated best-in-class performance among cross-attention-based models, particularly for tasks involving high-resolution images, and had the advantage of faster training and inference speeds.

NVLM models maintained or improved their performance on text-only tasks (like coding and math benchmarks such as MMLU, GSM8K, MATH, and HumanEval) after multimodal training, which is a significant achievement, as other multimodal models typically experience degradation in these areas.

Accessing NVLM D 72B

We can access the model using the hugging face function and the transformers library. Below is the code to infer the NVLM D 72B model; this is straight out of the documentation. Note that this is a 150+ GB model. 

1. Import necessary libraries

import torch

from transformers import AutoTokenizer, AutoModel

import math

from PIL import Image

import torchvision.transforms as T

from torchvision.transforms.functional import InterpolationMode

2. Model Sharding

The split_model() function defines a device map for distributing the layers of the model across multiple GPUs

def split_model():

   device_map = {}

   world_size = torch.cuda.device_count()

   num_layers = 80

   # Since the first GPU will be used for ViT, treat it as half a GPU.

   num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))

   num_layers_per_gpu = [num_layers_per_gpu] * world_size

   num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)

   layer_cnt = 0

   for i, num_layer in enumerate(num_layers_per_gpu):

       for j in range(num_layer):

           device_map[f'language_model.model.layers.{layer_cnt}'] = i

           layer_cnt += 1

   device_map['vision_model'] = 0

   device_map['mlp1'] = 0

   device_map['language_model.model.tok_embeddings'] = 0

   device_map['language_model.model.embed_tokens'] = 0

   device_map['language_model.output'] = 0

   device_map['language_model.model.norm'] = 0

   device_map['language_model.lm_head'] = 0

   device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

   return device_map

This distribution ensures efficient use of multiple GPUs to handle large models.

3. Image Preprocessing

IMAGENET_MEAN = (0.485, 0.456, 0.406)

IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):

   MEAN, STD = IMAGENET_MEAN, IMAGENET_STD

   transform = T.Compose([

       T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),

       T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),

       T.ToTensor(),

       T.Normalize(mean=MEAN, std=STD)

   ])

   return transform

4. Dynamic image tiling

This function splits an image into smaller tiles based on its aspect ratio

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):

   best_ratio_diff = float('inf')

   best_ratio = (1, 1)

   area = width * height

   for ratio in target_ratios:

       target_aspect_ratio = ratio[0] / ratio[1]

       ratio_diff = abs(aspect_ratio - target_aspect_ratio)

       if ratio_diff < best_ratio_diff:

           best_ratio_diff = ratio_diff

           best_ratio = ratio

       elif ratio_diff == best_ratio_diff:

           if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:

               best_ratio = ratio

   return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):

   orig_width, orig_height = image.size

   aspect_ratio = orig_width / orig_height

   # calculate the existing image aspect ratio

   target_ratios = set(

       (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if

       i * j <= max_num and i * j >= min_num)

   target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

   # find the closest aspect ratio to the target

   target_aspect_ratio = find_closest_aspect_ratio(

       aspect_ratio, target_ratios, orig_width, orig_height, image_size)

   # calculate the target width and height

   target_width = image_size * target_aspect_ratio[0]

   target_height = image_size * target_aspect_ratio[1]

   blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

   # resize the image

   resized_img = image.resize((target_width, target_height))

   processed_images = []

   for i in range(blocks):

       box = (

           (i % (target_width // image_size)) * image_size,

           (i // (target_width // image_size)) * image_size,

           ((i % (target_width // image_size)) + 1) * image_size,

           ((i // (target_width // image_size)) + 1) * image_size

       )

       # split the image

       split_img = resized_img.crop(box)

       processed_images.append(split_img)

   assert len(processed_images) == blocks

   if use_thumbnail and len(processed_images) != 1:

       thumbnail_img = image.resize((image_size, image_size))

       processed_images.append(thumbnail_img)

   return processed_images

5. Loading and Preprocessing Images

def load_image(image_file, input_size=448, max_num=12):

   image = Image.open(image_file).convert('RGB')

   transform = build_transform(input_size=input_size)

   images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)

   pixel_values = [transform(image) for image in images]

   pixel_values = torch.stack(pixel_values)

   return pixel_values

6. Loading and Using the Model

path = "nvidia/NVLM-D-72B"

device_map = split_model()

model = AutoModel.from_pretrained(

   path,

   torch_dtype=torch.bfloat16,

   low_cpu_mem_usage=True,

   use_flash_attn=False,

   trust_remote_code=True,

   device_map=device_map).eval()

print(model)

7. Text and Image Conversations

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

generation_config = dict(max_new_tokens=1024, do_sample=False)

# pure-text conversation

question = 'Hello, who are you?'

response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)

print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation

pixel_values = load_image('path/to/your/example/image.jpg', max_num=6).to(

   torch.bfloat16)

question = '\nPlease describe the image shortly.'

response = model.chat(tokenizer, pixel_values, question, generation_config)

print(f'User: {question}\nAssistant: {response}')

Conclusion

We can highlight that the NVLM-1.0 family achieves state-of-the-art results across a wide range of vision-language and text-only tasks, maintaining production-grade multimodality. This means the models perform well in both multimodal and text-only settings, without significant degradation in text-only performance—a common issue in many other multimodal models. The authors also emphasize the importance of high-quality training data and diverse task-oriented datasets for boosting model performance.

The NVLM-1.0 family demonstrates that it is possible to create multimodal LLMs that excel in a wide variety of tasks, including reasoning, coding, and math. In their commitment to furthering research, the team plans to release the model weights and open-source the code, inviting the community to build upon their work.

Are you looking for an online Generative AI course? If yes, explore this: GenAI Pinnacle Program.

Frequently Asked Questions

Data science Trainee at Analytics Vidhya, specializing in ML, DL and Gen AI. Dedicated to sharing insights through articles on these subjects. Eager to learn and contribute to the field’s advancements. Passionate about leveraging data to solve complex problems and drive innovation.



Source link

Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *