AI is a game-changer for any company – but training large language models can be a major problem due to the amounts of computational power needed. This can be a daunting challenge to implementing the use of the AI especially for the organizations that require the technology to make significant impacts without having to spend a great deal of money.
The Mixture of Experts technique provides accurate and efficient solution to the problem; a large model can be split into several sub-models to become instances of the specified networks. This way of building AI solutions not only makes more efficient use of resources but also allows businesses to adapt to their needs the best high-performance AI tools, making complex AI more affordable.
Learning Objectives
- Understand the concept and significance of Mixture of Experts (MoE) models in optimizing computational resources for AI applications.
- Explore the architecture and components of MoE models, including experts and router networks, and their practical implementations.
- Learn about the OLMoE model, its unique features, training techniques, and performance benchmarks.
- Gain hands-on experience in running OLMoE on Google Colab using Ollama and testing its capabilities with real-world tasks.
- Examine the practical use cases and efficiency of sparse model architectures like OLMoE in diverse AI applications.
This article was published as a part of the Data Science Blogathon.
Need for Mixture of Experts Models
Modern deep learning models use artificial neural networks composed of layers of “neurons” or nodes. Each neuron takes input, applies a simple math operation (called an activation function), and sends the result to the next layer. More advanced models, like transformers, have extra features like self-attention, which help them understand more complex patterns in data.
However, using the entire network for every input, as in dense models, can be very resource-heavy. Mixture of Experts (MoE) models solve this by leveraging a sparse architecture by activating only the most relevant parts of the network (called “experts”) for each input. This makes MoE models efficient, as they can handle more complex tasks like natural language processing without needing as much computational power.
How do Mixture of Experts Models Work?
When working on a group project, often the team comprises of small subgroup of members who are really good at different specific tasks. A Mixture of Experts (MoE) model works similar to this—it divides a complicated problem among smaller parts, called “experts,” that each specialize in solving one piece of the puzzle.
For example, if you were building a robot to help around the house, one expert might handle cleaning, another might be great at organizing, and a third might cook. Each expert focuses on what they’re best at, making the entire process faster and more accurate.
This way, the group works together efficiently, allowing them to get the job done better and faster instead of one person doing everything.
Main Components of MOE
In a Mixture of Experts (MoE) model, there are two important parts that make it work:
- Experts – Think of experts as special workers in a factory. Each worker is really good at one specific task. In the case of an MoE model, these “experts” are actually smaller neural networks (like FFNNs) that focus on specific parts of the problem. Only a few of these experts are needed to work on each task, depending on what’s required.
- Router or Gate Network – The router is like a manager who decides which experts should work on which task. It looks at the input data (like a piece of text or an image) and decides which experts are the best ones to handle it. The router activates only the necessary experts, instead of using the whole team for everything, making the process more efficient.
Experts
In a Mixture of Experts (MoE) model, the “experts” are like mini neural networks, each trained to handle different tasks or types of data.
Few Active Experts at a Time:
- However, in MoE models, these specialists don’t all work at the same time. The model is designed to be “sparse,” which means only a few experts are active at any given moment, depending on the task at hand.
- This helps the system stay focused and efficient, using just the right specialists for the job, rather than overloading it with too many tasks or experts working unnecessarily. This approach keeps the model from being overwhelmed and makes it faster and more efficient.
In the context of processing text inputs, experts could have for instance the following expertise (just for illustration)-
- An expert in a layer (e.g. Expert 1) can have expertise to handle the punctuation part of the words,
- Another expert (e.g. Expert 2) can be an expert in handling the adjectives (like good, bad, ugly)
- Another expert (e.g. Expert 2) can be an expert in handling the conjunctions (and, but, if)
Given an input text, the system chooses the expert best suited for the task, as shown below. Since most LLMs have several decoder blocks, the text passes through multiple experts in different layers before generation.
Router or Gate Network
In a Mixture of Experts (MoE) model, the “gating network” helps the model decide which experts (mini neural networks) should handle a specific task. Think of it like a smart guide that looks at the input (like a sentence to be translated) and chooses the best experts to work on it.
There are different ways the gating network can choose the experts, which we call “routing algorithms.” Here are a few simple ones:
- Top-k routing: The gating network picks the top ‘k’ experts with the highest scores to handle the task.
- Expert choice routing: Instead of the data picking the experts, the experts decide which tasks they’re best suited for. This helps keep everything balanced.
Once the experts finish their tasks, the model combines their results to make a final decision. Sometimes, more than one expert is needed for complex problems, but the gating network makes sure the right ones are used at the right time.
Details of OLMoE model
OLMoE is a new completely open source Mixture-of-Experts (MoE) based language model developed by researchers from the Allen Institute for AI, Contextual AI, University of Washington, and Princeton University.
It leverages a sparse architecture, meaning only a small number of “experts” are activated for each input, which helps save computational resources compared to traditional models that use all parameters for every token.
The OLMoE model comes in two versions:
- OLMoE-1B-7B, which has 7 billion total parameters but activates 1 billion parameters per token, and
- OLMoE-1B-7B-INSTRUCT, which is fine-tuned for better task-specific performance.
Architecture of OLMoE
- OLMoE uses a smart design to be more efficient by having small groups of experts (Mixture of Experts model) in each layer.
- In this model, there are 64 experts, but only eight are activated at a time, which helps save processing power. This method makes OLMoE better at handling different tasks without using too much computational energy, compared to other models that activate all parameters for every input.
How was OLMoE Trained?
OLMoE was trained on a massive dataset of 5 trillion tokens, helping it perform well across many language tasks. During training, special techniques were used, like auxiliary losses and load balancing, to make sure the model uses its resources efficiently and remains stable. This ensures that only the best-performing parts of the model are activated depending on the task, allowing OLMoE to handle different tasks effectively without overloading the system. The use of router z-losses further improves its ability to manage which parts of the model should be used at any time.
Performance of OLMoE-1b-7B
The OLMoE-1B-7B model has been tested against several top-performing models, like Llama2-13B and DeepSeekMoE-16B, as shown in the Figure below, and has shown notable improvements in both efficiency and performance. It excelled in key NLP tests, such as MMLU, GSM8k, and HumanEval, which evaluate a model’s skills in areas like logic, math, and language understanding. These benchmarks are important because they measure how well a model can perform various tasks, proving that OLMoE can compete with larger models while being more efficient.
Running OLMoE on Google Colab using Ollama
Ollama is an advanced AI tool that allows users to easily set up and run large language models locally (in CPU and GPU modes). We will explore how to run these small language models on Google Colab using Ollama in the following steps.
Step1: Installing the Required Libraries
!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
- !sudo apt update: This updates the package lists to ensure we are getting the latest versions.
- !sudo apt install -y pciutils: The pciutils package is required by Ollama to detect the GPU type.
- !curl -fsSL https://ollama.com/install.sh | sh command – this command uses curl to download and install Ollama
- !pip install langchain-ollama: Installs the langchain-ollama Python package, which is likely related to integrating the LangChain framework with the Ollama language model service.
Step2: Importing the Required Libraries
import threading
import subprocess
import time
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown
Step3: Running Ollama in Background on Colab
def run_ollama_serve():
subprocess.Popen(["ollama", "serve"])
thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)
The run_ollama_serve() function is defined to launch an external process (ollama serve) using subprocess.Popen().
A new thread is created using the threading package, which will run the run_ollama_serve() function.The thread is started which enables running the ollama service in the background. The main thread sleeps for 5 seconds as defined by time.sleep(5) commad, giving the server time to start up before proceeding with any further actions.
Step4: Pulling olmoe-1b-7b from Ollama
!ollama pull sam860/olmoe-1b-7b-0924
Running !ollama pull sam860/olmoe-1b-7b-0924
downloads the olmoe-1b-7b language model and prepares it for use.
Step5:. Prompting the olmoe-1b-7b model
template = """Question: {question}
Answer: Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="sam860/olmoe-1b-7b-0924")
chain = prompt | model
display(Markdown(chain.invoke({"question": """Summarize the following into one sentence: \"Bob was a boy. Bob had a dog. Bob and his dog went for a walk. Bob and his dog walked to the park. At the park, Bob threw a stick and his dog brought it back to him. The dog chased a squirrel, and Bob ran after him. Bob got his dog back and they walked home together.\""""})))
The above code creates a prompt template to format a question, feeds the question to the model, and outputs the response.
Testing OLMoE with Different Questions
Summarization Question
Question
"Summarize the following into one sentence: \"Bob was a boy. Bob had a dog.
And then Bob and his dog went for a walk. Then his dog and Bob walked to the park.
At the park, Bob threw a stick and his dog brought it back to him. The dog chased a
squirrel, and Bob ran after him. Bob got his dog back and they walked home
together.\""
Output from Model:
As we can see, the output has a fairly accurate summarized version of the paragraph.
Logical Reasoning Question
Question
“Give me a list of 13 words that have 9 letters.”
Output from Model
As we can see, the output has 13 words but not all words contain 9 letters. So, it is not completely accurate.
Word problem involving common sense
Question
“Create a birthday planning checklist.”
Output from Model
As we can see, the model has created a good list for birthday planning.
Coding Question
Question
"Write a Python program to Merge two sorted arrays into a single sorted array.”
Output from Model
The model accurately generated code to merge two sorted arrays into one sorted array.
Conclusion
The Mixture of Experts (MoE) technique breaks complex problems into smaller tasks. Specialized sub-networks, called “experts,” handle these tasks. A router assigns tasks to the most suitable experts based on the input. MoE models are efficient, activating only the required experts to save computational resources. They can tackle diverse challenges effectively. However, MoE models face challenges like complex training, overfitting, and the need for diverse datasets. Coordinating experts efficiently can also be difficult.
OLMoE, an open-source MoE model, optimizes resource usage with a sparse architecture, activating only eight out of 64 experts at a time. It comes in two versions: OLMoE-1B-7B, with 7 billion total parameters (1 billion active per token), and OLMoE-1B-7B-INSTRUCT, fine-tuned for task-specific applications. These innovations make OLMoE powerful yet computationally efficient.
Key Takeaways
- Mixture of Experts (MoE) models break down large tasks into smaller, manageable parts handled by specialized sub-networks called “experts.”
- By activating only the necessary experts for each task, MoE models save computational resources and effectively handle diverse challenges.
- A router (or gate network) ensures efficiency by dynamically assigning tasks to the most relevant experts based on input.
- MoE models face hurdles like complex training, potential overfitting, the need for diverse datasets, and managing expert coordination.
- The open-source OLMoE model uses sparse architecture, activating 8 out of 64 experts at a time, and offers two versions—OLMoE-1B-7B and OLMoE-1B-7B-INSTRUCT—delivering both efficiency and task-specific performance.
Frequently Asked Questions
A. In an MoE model, experts are small neural networks trained to specialize in specific tasks or data types. For example, they may focus on processing punctuation, adjectives, or conjunctions in text.
A. MoE models use a “sparse” design, activating only a few relevant experts at a time based on the task. This approach reduces unnecessary computation, keeps the system focused, and improves speed and efficiency.
A. OLMoE is available in two versions: OLMoE-1B-7B, with 7 billion total parameters and 1 billion activated per token, and OLMoE-1B-7B-INSTRUCT. The latter is fine-tuned for improved task-specific performance.
A. The sparse architecture of OLMoE activates only the necessary experts for each input, minimizing computational costs. This design makes the model more efficient than traditional models that engage all parameters for every input.
A. The gating network selects the best experts for each task using methods like top-k or expert choice routing. This approach enables the model to handle complex tasks efficiently while conserving computational resources.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.