Author:

Fernando Sigüenza

Published on:

October 15, 2024

AI technology brain background digital transformation concept

Rethinking LLMs: A Modular, Distributed Approach to AI

Large Language Models (LLMs) have emerged as powerful tools for understanding and generating human text across various applications. However, as these models grow, they become more complex and consume a lot of resources, similar to a monolithic application. This raises a key question: Is this the most efficient path forward, or is it time to challenge this paradigm?

This article proposes an alternative approach to language models, embracing modularity, specialisation, and distributed computing. Instead of relying on single, massive models, we imagine  a future where multiple smaller, highly specialised LLMs work together, communicating through event-driven architectures. Complex queries would no longer be processed by one enormous neural network but orchestrated across a network of focused, efficient models, with some small enough to run on edge devices.

Through this article, we will explore and develop an event-distributed system, train small models, and compare the results.

Too Long Didn’t Read (TLDR): Take me to the code:  https://github.com/fersiguenza/distributed_llms


A Little Context

Training an LLM can be costly in terms of time and resources, depending on the model size, data quality, and computational resources used.

For instance, training a model like OpenAI’s GPT-3, which has 175 billion parameters, required millions of dollars on computing hardware, extensive cloud computing resources, and several months of training the model using a large amount of high-quality data.

In contrast, training smaller LLMs focusing on a single task would likely be more cost-effective, requiring less data, computational resources, andtime.

This approach is often referred to as ‘task-specific’ or ‘domain-specific’ language models, and there are several related concepts and initiatives already in this area, such as:

  • Distillation and Compression: Techniques to create smaller, more efficient models from larger ones.
  • TinyML: Aims to run machine learning models on small, low-power devices such as microcontrollers or edge devices.
  • Edge AI: Runs AI models on local devices rather than in the cloud, which often requires smaller, more efficient models.
  • Sparse Models: Models that use only a subset of their parameters for specific tasks, allowing for more efficient computation.
  • Domain-Specific Models: Models trained on specific types of data for particular applications.

Introducing Small, Specific Language Models

Small Language Models (SLMs) are compact, efficient versions of Natural Language Processing (NLP) systems, designed to perform specific tasks with reduced computational requirements. These models are suitable for deployment on resource-constrained devices or for rapid, low-latency applications, often referred to as ‘efficient NLP models’ or ‘efficient language models’.
There are a few different terms and categories used to describe smaller language models, including:

  • Small Language Models (SLMs): This term is sometimes used to explicitly differentiate from LLMs.
  • Compact Language Models: Emphasises the reduced size of these models.
  • Efficient Language Models: Focuses on their ability to run with fewer computational resources.
  • Task-Specific Models: When fine-tuned for particular tasks, they might be referred to by the task (e.g., ‘sentiment analysis model’).
  • Distilled Models: If created through knowledge distillation from larger models (e.g., DistilBERT).
  • Compressed Models: When derived from larger models through various compression techniques.
  • Lightweight Models: Emphasising their ability to run on less powerful hardware.
  • Foundation Models: A broader term that can include both large and small pre-trained models.

There are some existing lightweight model approaches already such as DistilBERT, ALBERT, GPT-2 Small , Sentence-BERT, or things like PiGPT which is an effort to run GPT-2 on a Raspberry Pi or even TensorFlow Lite for deploying models on mobile and IoT devices. 

Combining these smaller models with an orchestrator aligns with some current research and development efforts in the field of AI, where existing concepts and ongoing research, such as mixture of experts’ or ‘ensemble of models’ architecture, are gaining relevance. Here are some examples that can be mentioned:

  • Mixture of Experts (MoE): This is a machine learning technique where multiple expert networks are used together, with a gating network that decides which expert(s) to use for a given input.
  • Routing Networks: These are neural networks that learn to route inputs to the most appropriate expert sub-networks.
  • Federated Learning: While not exactly the same, it shares some conceptual similarities by coordinating multiple decentralised models..

Some companies have also started exploring similar approaches:

  • Google’s Pathways AI Architecture: This aims to create a single model that can generalise across thousands of tasks, using only relevant parts of the network for each task.
  • Microsoft’s Z-Code: A multilingual neural network that uses experts for different languages and tasks.
  • OpenAI’s GPT-3 API: While not using multiple small models, it does use different “engines” optimised for different tasks.

Hugging-Face Transformers Library

The Hugging Face Transformers Library is a popular open-source library for Natural Language Processing (NLP) tasks, providing a unified interface to work with different pre-trained language models and to fine-tune these models for specific tasks.
The library provides a high-level API that allows users to quickly load pre-trained models and use them for inference or fine-tuning.

It includes pipelines for common NLP tasks, making it easy to perform complex operations with just a few lines of code.

```python
from transformers import pipeline
# Load a pre-trained sentiment analysis model
classifier=pipeline('sentiment-analysis')

# Use the model
result=classifier('I love using the Transformers library!')
print(result)
```

Event-Driven Distributed Architecture with Small Language Models

Training smaller LLMs might result in several benefits towards cost optimisation, as well as the ability to a more flexible system, where individual models can be optimised or deployed without affecting the rest, providing a more reliable environment. However, managing several LLMs on a distributed system can present several challenges, and often increase the cost of infrastructure and maintenance.

A high-level architecture of the approach we are describing will look like this:

Some challenges we will need to consider:

  • Latency: Ensuring quick responses despite multiple service calls.
  • Consistency: Maintaining consistent responses across distributed LLMs.
  • Resource Management: Efficiently allocating computational resources.
  • Complex State Management: Handling multi-step reasoning across services.

Considering tools like the Hugging Face Transformers Library, which can solve some of these steps already, does it still make sense to implement something like this?

It’s worth weighing the pros and cons and possibly testing with some examples before jumping to conclusions.


Pros

  • Scalability: Each LLM could be scaled independently based on demand.
  • Flexibility: New LLMs could be added or removed easily without disrupting the entire system.
  • Specialisation: Each LLM could be highly optimised for specific tasks or domains.
  • Efficiency: Resources could be allocated more effectively, with only relevant LLMs being engaged for each task.
  • Fault Tolerance: If one LLM fails, others could still operate, improving system reliability.
  • Continuous Improvement: Individual LLMs could be updated or replaced without taking down the entire system.

Cons

  • Event Bus: A robust message broker (like Apache Kafka or RabbitMQ) to handle event distribution.
  • Orchestrator Service: To receive initial queries and determine which LLMs to engage.
  • LLM Microservices: Each running a specialised LLM, listening for relevant events.
  • Result Aggregator: To combine outputs from multiple LLMs when necessary.
  • API Gateway: To handle external requests and responses.

Hands-on Experiment

Let’s try the hypothesis involving these steps:

  • Choose a base model: Let’s start with a smaller pre-trained model like DistilBERT, ALBERT, or TinyBERT.
  • Prepare the dataset: Use data specific to the task for the model to perform.
  • Fine-tune the model: Adapt the pre-trained model to your specific task using transfer learning.
  • Optimise for inference: Compress the model further if needed for local deployment.

We will create 2 models :
Sentiment Analysis Model

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
import numpy as np
from datasets import load_dataset

# Load a small pre-trained model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Prepare dataset (using IMDB dataset as an example)
dataset = load_dataset("imdb")

# Reduce dataset size
train_dataset = dataset["train"].shuffle(seed=42).select(range(5000))
test_dataset = dataset["test"].shuffle(seed=42).select(range(1000))

def tokenize_function(examples):    
     return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_datasets = {}
tokenized_datasets["train"] = train_dataset.map(tokenize_function, batched=True)
tokenized_datasets["test"] = test_dataset.map(tokenize_function, batched=True)

# Set up training arguments
training_args =TrainingArguments(    
     output_dir="./sentiment_analyzer/results",
     num_train_epochs=1, 
     per_device_train_batch_size=32,
     per_device_eval_batch_size=64,
     warmup_steps=0,
     weight_decay=0.01,
     logging_dir="./sentiment_analyzer/logs",
     logging_steps=100,
     eval_steps=500,
     save_steps=1000,
     evaluation_strategy="steps",
)
# Train the model
trainer = Trainer(
     model=model,
     args=training_args,
     train_dataset=tokenized_datasets["train"],
     eval_dataset=tokenized_datasets["test"],
)
trainer.train()

# Save the model
model.save_pretrained("./sentiment_analyzer/model")
tokenizer.save_pretrained("./sentiment_analyzer/model")
```

Question Answering Model

```python
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np

# Load a small pre-trained model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Prepare dataset (using SQuAD dataset as an example)
datasets = load_dataset("squad")

# Reduce dataset size
train_dataset = datasets["train"].shuffle(seed=42).select(range(3000))
val_dataset = datasets["validation"].shuffle(seed=42).select(range(300))

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=128,  # Reduced from 384
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_datasets = {}
tokenized_datasets["train"] = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)
tokenized_datasets["validation"] = val_dataset.map(preprocess_function, batched=True, remove_columns=val_dataset.column_names)

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./question_answerer/results",
    num_train_epochs=1,  # Reduced from 3
    per_device_train_batch_size=32,  # Increased from 16
    per_device_eval_batch_size=64,
    warmup_steps=0,  # Removed warmup
    weight_decay=0.01,
    logging_dir="./question_answerer/logs",
    logging_steps=100,
    eval_steps=500,
    save_steps=1000,
    evaluation_strategy="steps",
)

# Train the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

# Save the model
model.save_pretrained("./question_answerer/model")
tokenizer.save_pretrained("./question_answerer/model")
```

We can create and train a model for the prompt analyser as well:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset

# Load pre-trained model and tokenizer
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare dataset
data = [
    {"text": "I love this product!", "label": 0},
    {"text": "What is the capital of France?", "label": 1},
    {"text": "Summarize this article for me.", "label": 2},
    # Add more examples...
]

dataset = Dataset.from_list(data)

# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Split dataset
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.2)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./prompt_analyzer/model")
tokenizer.save_pretrained("./prompt_analyzer/model")

On the repository, you will find a convenient Shell script you can execute to install dependencies and test the different models.

However, if you prefer, you can go step by step executing each python file manually:

  • Create a virtual environment.
  • Install dependencies from the requirements.txt file.
  • Train the model by executing the scripts.
  • Adjust the scripts to point to the correct locations of the models.

You will notice a dependency called WandB (Weights & Biases), which is an AI developer platform powering the GenAI industry to train your own foundation models, fine-tune someone else’s foundation model, manage models from experimentation to production, or develop AI applications powered by frontier LLMs. You don’t need to have an account, you can choose not to visualise the results while training. However, if you’d like to, you can create an account. For more information, visit  WandB

** Keep in mind: Training the models can take a considerable amount of time depending on the device’s capabilities and the configurations you choose. This will, of course, impact the result and the accuracy of your model. For example, training just the sentiment analysis might take about one hour.
For these examples, we are using a reduced dataset size, shortened input sequence length, reduced number of epochs, removed warm-up steps, and more changes to be able to train the models within minutes.
You can experiment with different datasets and reduce the number of parameters to further improve this time based on your needs **

Now that all the models are trained, we will try three different options and compare the various approaches to see which is the most efficient in terms of time processing.

  1. Test with an LLM
  2. Test with Hugging Face Transformers
  3. Test with Distributed Small LLMs using Kafka as a message broker

These are the prompts we will be using:

Analyse sentiment and answer the following question:”How is it possible that the train is delayed once again, it’s been 3 times in less than 1 hour, is the train coming at all?”
Analyse sentiment and answer the following question:”Would you mind telling me the time the next train will arrive?”
Analyse sentiment and answer the following question:”This station is lovely, did you know this is the first train station in the city?”
ApproachSentiment AnalysisQuestion AnsweringAverage Total Response Time
Large LLM (e.g., GPT-4)600ms500ms1100ms
HuggingFace Transformers100ms150ms250ms
Distributed Small LLMs90ms125ms215ms

* It Is hard to measure the LLMs as all the tasks are being processed in the background so we are estimating based on the total response time

Although this comparison shows a good improvement in terms of speed there are some things we need to consider:

  • The LLM provides additional context and explanations that the current trained models might not have.
  • The difference between the Hugging face transformers library and the distributed LLMs is minimal, however, if using Hugging faces and multiple models you might end up with a tightly coupled implementation.
  • Hugging faces library does not combine LLMs to produce a single result, it just processes independently providing two results.

Conclusion

This article has explored the potential benefits of using a modular, distributed approach to LLMs. This approach not only promises improved performance and less resource utilisation but also opens doors to greater flexibility, scalability, and continuous improvement of AI systems. 

It is worth mentioning that it becomes quite complex and may require additional development effort, as well as infrastructure for event messaging.

This approach could be especially valuable for enterprises dealing with diverse, domain-specific tasks, or for creating more modular and maintainable AI systems.
Since it’s based on established principles in both AI and distributed systems, it’s a concept worth exploring further! Stay tuned for upcoming articles where we will dive deeper into this topic and explore other alternatives like an event distributed system using RAG agents.

Scroll to Top