How OpenAI’s Latest Model Stacks Up?

Introduction

OpenAI launched GPT-4o mini yesterday (18th June 2024), taking the world by storm. There are several reasons for this. OpenAI has traditionally focused on large language models (LLMs), which take a lot of computing power and have significant costs associated with using them. However, with this release, they are officially venturing into small language models (SLMs) territory and competing against models like Llama 3, Gemma 2, and Mistral. While many official benchmark results and performance comparisons have been released, I thought of putting this model to the test against its two predecessors, GPT-3.5 Turbo, and their newest flagship model, GPT-4o, in a series of diverse tasks. So, let’s dive in and see more details about GPT-4o mini and its performance.

Overview

OpenAI launches GPT-4o mini, a small language model (SLM), competing with models like Llama 3 and Mistral.
GPT-4o mini offers low cost, low latency, and near-real-time responses with a large 128K token context window.
The model supports text and image inputs with future plans for audio and video support.
GPT-4o mini excels in reasoning, math, and coding benchmarks, outperforming predecessors and competitors.
It is available in OpenAI’s API services at competitive pricing, making advanced AI more accessible.

Unboxing GPT-4o mini and its features

This section will try to understand all the details about OpenAI’s new GPT-4o mini model. Based on their recent announcement, this model has been released, focusing on making access to intelligent models more affordable. It has low cost (more on this shortly) and latency. It enables users to build Generative AI applications faster, processing large volumes of text thanks to its large context window, giving near-real-time responses, and parallelizing multiple API calls.

GPT-4o mini, just like its predecessor, GPT-4o, is a multimodal model and has support for text, images, audio, and video. Right now, it only supports text and image, unfortunately, with the other input options to be released sometime in the future. This model has been trained on data upto October 2023 and has a massive input context window of 128K tokens and an output response token limit of 16K per request. This model shares the same tokenizer as GPT-4o and hence has improved responses for prompts in non-English languages.

GPT-4o mini performance comparisons

OpenAI has significantly tested GPT-4o mini’s performance across a variety of standard benchmark datasets focusing on diverse tasks and comparing it with several other large language models (LLMs), including Gemini, Claude, and its predecessors, GPT-3.5 and GPT-4o.

OpenAI claims that GPT-4o mini performs significantly better than GPT-3.5 Turbo and other models in textual intelligence, multimodal reasoning, math, and coding proficiency benchmarks. As you can see in the above-mentioned visualization, GPT-4o mini has been evaluated across several key benchmarks, including:

Reasoning: GPT-4o mini is better at reasoning tasks involving both text and vision, scoring 82.0% on the Massive Multitask Language Understanding (MMLU) dataset, which is textual intelligence and reasoning benchmark, as compared to 77.9% for Gemini Flash and 73.8% for Claude Haiku.
Mathematical Proficiency: On the Multilingual Grade School Math Benchmark (MGSM), which measures math reasoning using grade-school math problems, GPT-4o mini scored 87.0%, compared to 75.5% for Gemini Flash and 71.7% for Claude Haiku.
Coding Proficiency: GPT-4o mini scored 87.2% on HumanEval, which measures coding proficiency by looking at functional correctness for synthesizing programs from docstrings, compared to 71.5% for Gemini Flash and 75.9% for Claude Haiku.
Multimodal reasoning: GPT-4o mini also shows strong performance on the Massive Multi-discipline Multimodal Understanding (MMMU) dataset, a multimodal reasoning benchmark, scoring 59.4% compared to 56.1% for Gemini Flash and 50.2% for Claude Haiku.

We also have detailed analysis and comparisons done by Artificial Analysis, an independent organization that provides benchmarking and related information for various LLMs and SLMs. The following visual clearly shows how GPT-4o mini focuses on providing quality responses at blazing-fast speeds as compared to most other models.

Quality vs. Output Speed — Image Source: Artificial Analysis

Besides the performance of the model in terms of quality of results, there are a couple of factors which we usually consider when choosing an LLM or SLM, this includes the response speed and cost. Considering these factors, we get a variety of comparisons, including the model’s output speed, which basically focuses on the output tokens per second received while the model is generating tokens (ie, after the first chunk has been received from the API). These numbers are based on the median speed across all providers, and as claimed by their observations, GPT-4o-mini seems to have the highest output speed, which is pretty interesting, as seen in the following visual

Output Speed — Image Source: Artificial Analysis

We also get a detailed comparison from Artificial Analysis on the cost of using GPT-4o mini vs other popular models. Here, the pricing is shown in terms of both input prompts and output responses in USD per 1M (million) tokens. GPT-4o mini is quite cheap, considering you do not need to worry about hosting it, setting up your own GPU infrastructure, and maintaining it!

Input and output prices — Image Source: Artificial Analysis

OpenAI also mentions that GPT-4o mini demonstrates strong performance in function and tool calling, which means you can get better performance when using this model to build AI Agents and complex Agentic AI systems that can fetch live data from the web, reason, observe, and take actions with external systems and tools. GPT-4o mini also has improved long-context performance compared to GPT-3.5 Turbo and also performs well in tasks like extracting structured data from receipts or generating high-quality email responses when provided with the full conversation history.

Also Read: Here’s How You Can Use GPT 4o API for Vision, Text, Image & More.

GPT-4o mini availability and pricing comparisons

OpenAI has made GPT-4o mini available as a text and vision model immediately in the Assistant API, Chat Completion API, and the Batch API. You only need to pay 15 cents per 1M (million) input prompt tokens and 60 cents per 1M output response tokens. For ease of understanding, that is roughly the equivalent of a 2500-page book!

It is also the cheapest model from OpenAI yet in comparison to its previous models, as seen in the following table, where we have condensed all the pricing information

GPT-4o mini availability and pricing comparisons

In ChatGPT, Free, plus, and Team users will be able to access GPT-4o mini very soon, during this week (the third week of July 2024).

Putting GPT-4o mini to the test

We will now put GPT-4o mini to the test and compare it with its two predecessors, GPT-4o and GPT-3.5 Turbo in various popular tasks based on real-world problems. The key tasks we will we focusing on include the following:

Task 1: Zero-shot Classification
Task 2: Few-shot Classification
Task 3: Coding Tasks – Python
Task 4: Coding Tasks – SQL
Task 5: Information Extraction
Task 6: Closed-Domain Question Answering
Task 7: Open-Domain Question Answering
Task 8: Document Summarization
Task 9: Transformation
Task 10: Translation

Please note that the intent of this exercise is not to run any models on benchmark datasets but to take an example in each problem and see how well GPT-4o mini responds to it compared to the other two OpenAI models. Let the show begin!

Install Dependencies

We start by installing the necessary dependencies, which is basically the OpenAI library to access its APIs

!pip install openai

Enter OpenAI API Key

We enter our OpenAI key using the getpass() function so we don’t accidentally expose our key in the code.

from getpass import getpass

OPENAI_KEY = getpass('Enter Open AI API Key: ')

Setup API Key

Next, we setup our API key to use with the openai library

import openai
from IPython.display import HTML, Markdown, display

openai.api_key = openai_key

Create ChatGPT Completion Access Function

This function will use the Chat Completion API to access ChatGPT for us and return responses based on the model we want to use including GPT-3.5 Turbo, GPT-4o, and GPT-4o mini.

def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.0, # degree of randomness of the model's output
    )
    return response.choices[0].message.content

Let’s try out the ChatGPT API!

We can quickly test the above function to see if our code can access OpenAI’s servers and use their models.

response = get_completion(prompt="Explain Generative AI in 2 bullet points", 
                          model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

Seems to be working as expected; we can now start with our experiments!

Also Read: GPT-4o vs Gemini: Comparing Two Powerful Multimodal AI Models

Task 1: Zero-shot Classification

This task tests an LLM’s text classification capabilities by prompting it to classify a text without providing examples. Here, we will do a zero-shot sentiment analysis on some customer product reviews. We have three customer reviews as follows:

reviews = [
    f"""
    Just received the Bluetooth speaker I ordered for beach outings, and it's  
    fantastic. The sound quality is impressively clear with just the right amount of 
    bass. It's also waterproof, which tested true during a recent splashing 
    incident. Though it's compact, the volume can really fill the space.
    The price was a bargain for such high-quality sound.
    Shipping was also on point, arriving two days early in secure packaging.
    """,
    f"""
    Needed a new kitchen blender, but this model has been a nightmare.
    It's supposed to handle various foods, but it struggles with anything tougher 
    than cooked vegetables. It's also incredibly noisy, and the 'easy-clean' feature 
    is a joke; food gets stuck under the blades constantly.
    I thought the brand meant quality, but this product has proven me wrong.
    Plus, it arrived three days late. Definitely not worth the expense.
    """,
    f"""
    I tried to like this book and while the plot was really good, the print quality 
    was so not good
    """
]

We now create a prompt to do zero-shot text classification and run it against the 3 reviews using each of the 3 OpenAI models separately.

responses = {
    'gpt-3.5-turbo' : [],
    'gpt-4o' : [],
    'gpt-4o-mini' : []
}

for review in reviews:
  prompt = f"""
              Act as a product review analyst.
              Given the following review,
              Display the overall sentiment for the review 
              as only one of the following:
              Positive, Negative OR Neutral

              ```{review}```
              """
  response = get_completion(prompt, model="gpt-3.5-turbo")
  responses['gpt-3.5-turbo'].append(response)
  response = get_completion(prompt, model="gpt-4o")
  responses['gpt-4o'].append(response)
  response = get_completion(prompt, model="gpt-4o-mini")
  responses['gpt-4o-mini'].append(response)

# Display the output
import pandas as pd
pd.set_option('display.max_colwidth', None)

pd.DataFrame(responses)

OUTPUT

The results are mostly consistent across the models, except GPT-3.5 Turbo fails just to return the sentiment for the 2nd example.

Task 2: Few-shot Classification

This task tests an LLM’s text classification capabilities by prompting it to classify a text by providing examples of inputs and outputs. Here, we will classify the same customer reviews as those given in the previous example using few-shot prompting.

responses = {
    'gpt-3.5-turbo' : [],
    'gpt-4o' : [],
    'gpt-4o-mini' : []
}
for review in reviews:
  prompt = f"""
              Act as a product review analyst.
              Given the following review,
              Display only the overall sentiment for the review:
              Try to classify it by using the following examples as a reference:

              Review: Just received the Laptop I ordered for work, and it's amazing.
              Sentiment: 😊

              Review: Needed a new mechanical keyboard, but this model has been 
                      totally disappointing.
              Sentiment: 😡

              Review: ```{review}```
              """
  response = get_completion(prompt, model="gpt-3.5-turbo")
  responses['gpt-3.5-turbo'].append(response)
  response = get_completion(prompt, model="gpt-4o")
  responses['gpt-4o'].append(response)
  response = get_completion(prompt, model="gpt-4o-mini")
  responses['gpt-4o-mini'].append(response)

# Display the output
pd.DataFrame(responses)

OUTPUT

We see very similar results across models, although for the 3rd review is which is actually kind of mixed, we get interesting emoji outputs from the models, GPT-3.5 Turbo and GPT-4o give us a confused face emoji (😕), and GPT-4o mini give us a neutral or mildly disappointed face emoji (😐)

Task 3: Coding Tasks – Python

This task tests an LLM’s capabilities for generating Python code based on certain prompts. Here we try to focus on a key task of scaling your data before applying certain machine learning models.

prompt = f"""
Act as an expert in generating python code

Your task is to generate python code
to explain how to scale data for a ML problem.
Focus on just scaling and nothing else.
Keep into account key operations we should do on the data
to prevent data leakage before scaling.
Keep the code and answer concise.
"""

response = get_completion(prompt, model="gpt-3.5-turbo")
display(Markdown(response))

OUTPUT

We will try next with GPT-4o

response = get_completion(prompt, model="gpt-4o")
display(Markdown(response))

OUTPUT

Finally, we try the same task with the GPT-4o mini

response = get_completion(prompt, model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

Overall, all 3 models do pretty well, although personally, I like GPT-4o mini’s explanation better, especially point 3, where we talk about using the fitted scaler to transform the test data, which is explained better than the response from GPT-4o. We also see that the response styles of both GPT-4o and GPT-4o mini are quite similar!

Task 4:Coding Tasks – SQL

This task tests an LLM’s capabilities for generating SQL code based on certain prompts. Here we try to focus on a slightly more complex query involving multiple database tables.

prompt = f"""
Act as an expert in generating SQL code.

Understand the following schema of the database tables carefully:
Table departments, columns = [DepartmentId, DepartmentName]
Table employees, columns = [EmployeeId, EmployeeName, DepartmentId]
Table salaries, columns = [EmployeeId, Salary]

Create a MySQL query for the employee with max salary in the 'IT' Department.
"""

response = get_completion(prompt, model="gpt-3.5-turbo")
display(Markdown(response))

OUTPUT

We will try next with GPT-4o

response = get_completion(prompt, model="gpt-4o")
display(Markdown(response))

OUTPUT

Finally, we try the same task with the GPT-4o mini

response = get_completion(prompt, model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

Overall, all three models do pretty well. We also see that the response styles of both GPT-4o and GPT-4o mini are quite similar. Both give the same query and some detailed explanation of what is happening in the query. GPT-4o gives the most detailed explanation of the query step by step.

This task tests an LLM’s capabilities for extracting and analyzing key entities from documents. Here we will extract and expand on important entities in a clinical note.

clinical_note = """
60-year-old man in NAD with a h/o CAD, DM2, asthma, pharyngitis, SBP,
and HTN on altace for 8 years awoke from sleep around 1:00 am this morning
with a sore throat and swelling of the tongue.
He came immediately to the ED because he was having difficulty swallowing and
some trouble breathing due to obstruction caused by the swelling.
He did not have any associated SOB, chest pain, itching, or nausea.
He has not noticed any rashes.
He says that he feels like it is swollen down in his esophagus as well.
He does not recall vomiting but says he might have retched a bit.
In the ED he was given 25mg benadryl IV, 125 mg solumedrol IV,
and pepcid 20 mg IV.
Family history of CHF and esophageal cancer (father).
"""

prompt = f"""
Act as an expert in analyzing and understanding clinical doctor notes in healthcare.
Extract all symptoms only from the clinical note below in triple backticks.

Differentiate between symptoms that are present vs. absent.
Give me the probability (high/ medium/ low) of how sure you are about the result.
Add a note on the probabilities and why you think so.

Output as a markdown table with the following columns,
all symptoms should be expanded and no acronyms unless you don't know:

Symptoms | Present/Denies | Probability.


Also expand the acronyms in the note including symptoms and other medical terms.
Do not leave out any acronym related to healthcare.

Output that also as a separate appendix table in Markdown with the following columns,

Acronym | Expanded Term

Clinical Note:
```{clinical_note}```
"""

response = get_completion(prompt, model="gpt-3.5-turbo")
display(Markdown(response))

OUTPUT

We will try next with GPT-4o

response = get_completion(prompt, model="gpt-4o")
display(Markdown(response))

OUTPUT

Finally, we try the same task with the GPT-4o mini

response = get_completion(prompt, model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

Overall, GPT-3.5 Turbo fails to follow all the instructions and does not give reasoning on the probability scoring, which is followed faithfully by both GPT-4o and GPT-4o mini, which give answers in a similar style. GPT-4o probably is able to give the best responses although GPT-4o mini comes pretty close and actually gives more detailed reasoning on the probability scoring. Both the models perform neck to neck, the only shortcoming here is that GPT-4o mini failed to put SOB as shortness of breath in the 2nd table although it did expand it in the symptoms table. Interestingly, the last two rows of the appendix table of GPT-4o mini are common names of drugs where it has expanded the brand name to the actual drug ingredient names!

Also Read: The Omniscient GPT-4o + ChatGPT is HERE!

Task 6: Closed-Domain Question Answering

Question Answering (QA) is a natural language processing task that generates the desired answer for the given question. Question Answering can be open-domain QA or closed-domain QA, depending on whether the LLM is provided with the relevant context or not.

In closed-domain QA, a question along with relevant context is given. Here, the context is nothing but the relevant text, which ideally should have the answer, just like a RAG workflow.

report = """
Three quarters (77%) of the population saw an increase in their regular outgoings over the past year,
according to findings from our recent consumer survey. In contrast, just over half (54%) of respondents
had an increase in their salary, which suggests that the burden of costs outweighing income remains for
most. In total, across the 2,500 people surveyed, the increase in outgoings was 18%, three times higher
than the 6% increase in income.
Despite this, the findings of our survey suggest we have reached a plateau. Looking at savings,
for example, the share of people who expect to make regular savings this year is just over 70%,
broadly similar to last year. Over half of those saving plan to use some of the funds for residential
property. A third are saving for a deposit, and a further 20% for an investment property or second home.
But for some, their plans are being pushed back. 9% of respondents stated they had planned to purchase
a new home this year but have now changed their mind. While for many the deposit may be an issue,
the other driving factor remains the cost of the mortgage, which has been steadily rising the last
few years. For those that currently own a property, the survey showed that in the last year,
the average mortgage payment has increased from £668.51 to £748.94, or 12%."""
question = """
How much has the average mortage payment increased in the last year?
"""

prompt = f"""
Using the following context information below please answer the following question
to the best of your ability
Context:
{report}
Question:
{question}
Answer:
"""

response = get_completion(prompt, model="gpt-3.5-turbo")
display(Markdown(response))

OUTPUT

We will try next with GPT-4o

response = get_completion(prompt, model="gpt-4o")
display(Markdown(response))

OUTPUT

Finally, we try the same task with the GPT-4o mini

response = get_completion(prompt, model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

Pretty standard answers across all three models here; nothing significantly different.

Task 7: Open-Domain Question Answering

Question Answering (QA) is a natural language processing task that generates the desired answer for the given question.

In the case of open-domain QA, only the question is asked without providing any context or information. Here, the LLM answers the question using the knowledge gained from large volumes of text data during its training. This is basically Zero-Shot QA. This is where the model’s knowledge cutoff when it was trained, becomes very important to answer questions, especially on recent events!

prompt = f"""
Please answer the following question to the best of your ability
Question:
What is LangChain?

Answer:
"""

response = get_completion(prompt, model="gpt-3.5-turbo")
display(Markdown(response))

OUTPUT

We will try next with GPT-4o

response = get_completion(prompt, model="gpt-4o")
display(Markdown(response))

OUTPUT

Finally, we try the same task with the GPT-4o mini

response = get_completion(prompt, model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

Now, LangChain is a fairly new framework for building Generative AI applications, and that is why GPT-3.5 Turbo gives a totally wrong answer, as the data it was trained on never had any mentions of this LangChain library. While it can be called a hallucination, factually, it isn’t because long back, there actually used to be a blockchain framework called LangChain before Web 3.0, NFTs, and Blockchain went into slumber mode. GPT-4o and GPT-4o mini give the right answer here, with GPT-4o mini giving a slightly detailed answer, but this can be controlled by putting constraints on the output format for even GPT-4o.

Task 8: Document Summarization

Document summarization is a natural language processing task that involves creating a concise summary of the given text while still capturing all the important information.

doc = """
Coronaviruses are a large family of viruses which may cause illness in animals or humans.
In humans, several coronaviruses are known to cause respiratory infections ranging from the
common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS).
The most recently discovered coronavirus causes coronavirus disease COVID-19.
COVID-19 is the infectious disease caused by the most recently discovered coronavirus.
This new virus and disease were unknown before the outbreak began in Wuhan, China, in December 2019.
COVID-19 is now a pandemic affecting many countries globally.
The most common symptoms of COVID-19 are fever, dry cough, and tiredness.
Other symptoms that are less common and may affect some patients include aches
and pains, nasal congestion, headache, conjunctivitis, sore throat, diarrhea,
loss of taste or smell or a rash on skin or discoloration of fingers or toes.
These symptoms are usually mild and begin gradually.
Some people become infected but only have very mild symptoms.
Most people (about 80%) recover from the disease without needing hospital treatment.
Around 1 out of every 5 people who gets COVID-19 becomes seriously ill and develops difficulty breathing.
Older people, and those with underlying medical problems like high blood pressure, heart and lung problems,
diabetes, or cancer, are at higher risk of developing serious illness.
However, anyone can catch COVID-19 and become seriously ill.
People of all ages who experience fever and/or  cough associated with difficulty breathing/shortness of breath,
chest pain/pressure, or loss of speech or movement should seek medical attention immediately.
If possible, it is recommended to call the health care provider or facility first,
so the patient can be directed to the right clinic.
People can catch COVID-19 from others who have the virus.
The disease spreads primarily from person to person through small droplets from the nose or mouth,
which are expelled when a person with COVID-19 coughs, sneezes, or speaks.
These droplets are relatively heavy, do not travel far and quickly sink to the ground.
People can catch COVID-19 if they breathe in these droplets from a person infected with the virus.
This is why it is important to stay at least 1 meter) away from others.
These droplets can land on objects and surfaces around the person such as tables, doorknobs and handrails.
People can become infected by touching these objects or surfaces, then touching their eyes, nose or mouth.
This is why it is important to wash your hands regularly with soap and water or clean with alcohol-based hand rub.
Practicing hand and respiratory hygiene is important at ALL times and is the best way to protect others and yourself.
When possible maintain at least a 1 meter distance between yourself and others.
This is especially important if you are standing by someone who is coughing or sneezing.
Since some infected persons may not yet be exhibiting symptoms or their symptoms may be mild,
maintaining a physical distance with everyone is a good idea if you are in an area where COVID-19 is circulating."""

prompt = f"""
You are an expert in generating accurate document summaries.
Generate a summary of the given document.

Document:
{doc}

Constraints: Please start the summary with the delimiter 'Summary'
and limit the summary to 5 lines

Summary:
"""

response = get_completion(prompt, model="gpt-3.5-turbo")
display(Markdown(response))

OUTPUT

We will try next with GPT-4o

response = get_completion(prompt, model="gpt-4o")
display(Markdown(response))

OUTPUT

Finally, we try the same task with the GPT-4o mini

response = get_completion(prompt, model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

These are pretty good summaries all around, although personally, I like the summary generated by GPT-4o and GPT-4o mini as it gives some minor but important details, like the time when this disease emerged.

Task 9: Transformation

You can use LLMs to take an existing document and transform it into other formats of content and even generate training data for fine-tuning or training models

fact_sheet_mobile = """
PRODUCT NAME
Samsung Galaxy Z Fold4 5G Black
PRODUCT OVERVIEW
Stands out. Stands up. Unfolds.
The Galaxy Z Fold4 does a lot in one hand with its 15.73 cm(6.2-inch) Cover Screen.
Unfolded, the 19.21 cm(7.6-inch) Main Screen lets you really get into the zone.
Pushed-back bezels and the Under Display Camera means there's more screen
and no black dot getting between you and the breathtaking Infinity Flex Display.
Do more than more with Multi View. Whether toggling between texts or catching up
on emails, take full advantage of the expansive Main Screen with Multi View.
PC-like power thanks to Qualcomm Snapdragon 8+ Gen 1 processor in your pocket,
transforms apps optimized with One UI to give you menus and more in a glance
New Taskbar for PC-like multitasking. Wipe out tasks in fewer taps. Add
apps to the Taskbar for quick navigation and bouncing between windows when
you're in the groove.4 And with App Pair, one tap launches up to three apps,
all sharing one super-productive screen
Our toughest Samsung Galaxy foldables ever. From the inside out,
Galaxy Z Fold4 is made with materials that are not only stunning,
but stand up to life's bumps and fumbles. The front and rear panels,
made with exclusive Corning Gorilla Glass Victus+, are ready to resist
sneaky scrapes and scratches. With our toughest aluminum frame made with
Armor Aluminum, this is one durable smartphone.
World’s first water resistant foldable smartphones. Be adventurous, rain
or shine. You don't have to sweat the forecast when you've got one of the
world's first water-resistant foldable smartphones.

PRODUCT SPECS
OS - Android 12.0
RAM - 12 GB
Product Dimensions - 15.5 x 13 x 0.6 cm; 263 Grams
Batteries - 2 Lithium Ion batteries required. (included)
Item model number - SM-F936BZKDINU_5
Wireless communication technologies - Cellular
Connectivity technologies - Bluetooth, Wi-Fi, USB, NFC
GPS - True
Special features - Fast Charging Support, Dual SIM, Wireless Charging, Built-In GPS, Water Resistant
Other display features - Wireless
Device interface - primary - Touchscreen
Resolution - 2176x1812
Other camera features - Rear, Front
Form factor - Foldable Screen
Colour - Phantom Black
Battery Power Rating - 4400
Whats in the box - SIM Tray Ejector, USB Cable
Manufacturer - Samsung India pvt Ltd
Country of Origin - China
Item Weight - 263 g
"""

prompt =f"""Turn the following product description
into a list of frequently asked questions (FAQ).
Show both the question and its corresponding answer
Generate at the max 5 but diverse and useful FAQs

Product description:
```{fact_sheet_mobile}```
"""

response = get_completion(prompt, model="gpt-3.5-turbo")
display(Markdown(response))

OUTPUT

We will try next with GPT-4o

response = get_completion(prompt, model="gpt-4o")
display(Markdown(response))

OUTPUT

Finally, we try the same task with the GPT-4o mini

response = get_completion(prompt, model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

All three models perform the task successfully; however, it is quite clear that the quality of answers generated by GPT-4o and GPT-4o mini is richer and more detailed than the responses from GPT-3.5 Turbo.

Task 10: Translation

You can use LLMs to translate an existing document from a source to a target language and to multiple languages simultaneously. Here, we will try to translate a piece of text into multiple languages and force the LLM to output a valid JSON response.

prompt = """You are an expert translator.
Translate the given text from English to German and Spanish.
Show the output as key value pairs in JSON.
Output should have all 3 languages.

Text: 'Hello, how are you today?'
Translation:
"""

response = get_completion(prompt, model="gpt-3.5-turbo")
display(Markdown(response))

OUTPUT

We will try next with GPT-4o

response = get_completion(prompt, model="gpt-4o")
display(Markdown(response))

OUTPUT

Finally, we try the same task with the GPT-4o mini

response = get_completion(prompt, model="gpt-4o-mini")
display(Markdown(response))

OUTPUT

All three models perform the task successfully, however, GPT-4o and GPT-4o mini generate a formatted JSON string as compared to GPT-3.5 Turbo

The Verdict

While it is very difficult to say which LLM is better just by looking at a few tasks, considering factors like pricing, latency, multimodality, and quality of results across diverse tasks, definitely consider GPT-4o mini over GPT-3.5 Turbo. However, GPT-4o is probably still the model with the highest quality of results. Once again, do not go just by face value, try the models yourself on your use-cases and make a final decision. We did not consider other open SLMs like Llama 3, Gemma 2 and so on, I would also encourage you to compare GPT-4o mini to its other SLM counterparts!

Conclusion

In this guide, we have an in-depth understanding of the features and performance of Open AI’s newly launched GPT-4o mini. We also did a detailed comparative analysis of how GPT-4o mini fares against its predecessors, GPT-4o and GPT-3.5 Turbo, with a total of ten different tasks! Do check out this Colab notebook for easy access to the code and do try out GPT-4o mini, it is one of the most promising small language models so far!

References:

Source link