[AI For Engineers Beta] Day 6: Running Your Own LLMs (and fine-tuning them)

                            January 4, 2024

                [AI For Engineers Beta] Day 6: Running Your Own LLMs (and fine-tuning them)

                        Day 6 — Running Open Source LLMs (and fine-tuning them)
Days 1-5 we’ve entirely relied on the OpenAI API. And for good reason. They offer the best in class models and are much easier to use. But it’s important that AI Engineers also know when and how to wield other open-source models that are slightly more cumbersome to use but offer greater flexibility and ownership.
So, for Day 6, we will focus on running your own LLMs (and fine-tuning them). While I say “running your own,” I don’t necessarily mean on your machine — although you certainly can, and some would argue that you SHOULD — for production work you'll have to learn to run it on someone else's machine (the cloud). 
In today’s example, we’re going to use Replicate. With that being said, we don’t have any bias toward Replicate, and there are a handful of services in the space currently. 

Custom Models: 

Modal.com,
brev.dev (primarily finetuning)
Runpod.co 
Banana.dev

Inference clouds (off the shelf models)

Anyscale.com
Fireworks.ai
Replicate
Baseten.co
Together.AI

This is non-exhaustive, and there are new ones popping up everywhere right now. All of these tools range up and down the abstraction ladder. Some are focused on creating a platform to train and fine-tune their own models, while others hold your hand through various UIs that take you from deployment to inference of off-the-shelf foundation models.

Open Source LLMs and Hosting Providers
1. Running GPT-2 and Other LLMs Locally
While we’re going to be focused on running a model in the cloud and performing inference through a provided API, it’s important to get a handle on what it looks like to run your own.

Did you know that you can run GPT2 on your own machine?
    - In fact, you can write the code yourself and run it with the downloaded weights. GPT2 is far from state of the art these days, but it can be a good baby step into running models locally.
In real life, you’re much more likely to run a more up to date open source model like Llama 2 (which is only ~500 lines of readable code as well - see also Lit-GPT for other readable LLM codebases) or Mistral, or whatever is top of the leaderboard (see below)

As you dive more into the open-source models, you'll notice how many variations and optimizations there are. 
For example, If you want to run Llama on your own machine:

There’s Llama.cpp — a version specifically meant to run on CPUs.  You can even talk to it! There’s also Llama2.c, Llama.mojo and other language adaptations.
Ollama is a new Docker-like experience that makes local model running easy, but it is one of many many options - see also GPT4All and Faraday.dev.

2. UI Wrappers for LLMs
If you remember Automatic1111 from day 4 (you did do the homework, right?!), you should be familiar with the idea of having extensive UI wrappers to tweak all of the settings on top of these models. 
Enter oobabooga! This will let you add LoRAs[1][2] and other advanced features in addition to the traditional text completion. While oobabooga seems to be leading the pack, there are plenty of options in the Local-LLM-UI-Wrapper ecosystem.
You should try out a few to see what the different offerings are:

Awesome list here — https://github.com/itsuka-dev/awesome-chatgpt-ui - LM Studio is the main competitor
Koboldcpp — easy-to-use AI text-generation software for GGML and GGUF models.
And instead of on your machine, why not in your browser? Run Vicuna in your browser and your phone (listen to MLC episode)
    - Read Vicuna Paper

3. Leaderboards for LLMs
When you thoroughly explore all of these options, you will spend a lot of time and develop a bit of intuition about these various OSS models. One thing you may notice is how MANY different models there are. And there are new ones popping up all the time. Each better than its predecessor at something(s).
This is where Leaderboards come in! 
These leaderboards are meant to track and benchmark all of these various models’ capacities. And while there are always some diamonds in the rough to be found, at the beginning, you can use popularity as an Okay Enough yardstick.
There are 3 main ones to look at and check in on

Huggingface Leaderboard: Popular benchmarks independently run by HuggingFace - however, easily/often gamed so cannot always trust
LMSys: ELO-style benchmarkless blind rating of models - Karpathy approved
Stanford Helm: academic collection across a LOT more benchmarks - slower to update

The Open Source LLM landscape changes every 2 months, so you’ll have to check Latent Space for the new hotness. But you should be aware of the top models from recent history as context for whatever else comes next.
But you should also be aware of other models as well:

Llama 2 Meta's SOTA Open Source LLM

Code Llama — Meta’s official Llama version, fine-tuned on code
Giraffe — Fine-tuned, increasing context from 4096 → 32,000
Vicuna 13B — Instruction-tuned Llama

Mistral 7B — The best 7B model to date, Apache 2.0

Zephyr 7b - Beats Llama 70b on MT Bench
Open Hermes 2 - Finetuned for multiturn chat skills

Falcon 180B — Typically sits somewhere between GPT 3.5 and GPT4 depending on the eval, but it is huge and is harder to run/fine-tune because of that

Redpajama 3B — For a truly tiny model
StableLM 3B — Stability’s entrance into the small model game

And as with any major OSS movement, licensing is always a point of contention. Doubly so with these models as ‘open source’ may not be what you think at first glance. Be sure to do your due diligence on the models you’re using.  Read more about licensing here.
Finetuning
Introduction to Finetuning
Finetuning can be described as a slight additional training phase layered on top of extensive pretraining (see Latent Space episodes with Jeremy Howard and Wing Lian). In simpler terms, imagine you've trained a large model on a broad spectrum of data. Finetuning is like giving this model a short, specialized course to help it perform better in certain tasks.

Primary Uses:
Style of Output: Finetuning often focuses on refining the model's style of output.
New Facts: While finetuning isn't typically used for memorizing new facts, this is a rapidly advancing area of study. When it comes to new fact retrieval or augmentation, solutions like RAG might be more effective.
Applications: You might finetune a model to produce results in JSON format or to answer in a specific style and tone, as highlighted by this guide. However, at the current stage, finetuning isn't the method to extend a model's knowledge cutoff.’

You probably don’t need finetuning yet: As OpenAI mentioned in their 2023 DevDay talk, you can get very far with prompting and RAG techniques (we covered in Day 2) before ever finetuning. However we do encourage passing familiarity with it so you know what you don’t know.

OpenAI Finetuning
Before diving into other platforms, it's a good idea to familiarize yourself with OpenAI's finetuning process:

Resources:
Official OpenAI Finetuning Guide.
Explore practical examples on GitHub: Fine-tuned Classification and Fine-tuned QA.
OpenAI's cookbooks and recommendations serve as handy references.

Homework: Go through the Fine-tuned QA example and add that functionality into the current Q/A that we have over the Mozilla documentation. We do expect that most AI Engineers will need familiarity with OpenAI's finetuning as well as being able to finetune models yourself - and understand the number of limitations with OpenAI finetuning that makes it insufficient for a broad number of usecases (chat in our Discord to check your intuition!)

History of Finetuning Techniques
For a deeper understanding, it's important to know where we came from:

ULMFiT: Jeremy Howard and Sebastian Ruder’s 3 stage finetuning framework (see Latent Space pod)
Instruction Tuning: An integral part of the finetuning landscape. Learn more from OpenAI's research on instruction-following and PPO.
FLAN-T5: This model showcases the power of instruction-finetuning, having been trained on over 1,800 language tasks on the FLAN dataset from Google. This has notably enhanced its prompting and multi-step reasoning capabilities. A concise 7-minute video offers an insightful summary.
LoRA: (Low Rank Adaptation) is a form of Parameter Efficient Finetuning that was featured in the Google “No Moats” memo. Replicate has a guide on finetuning Alpaca to be better at chat. More recently, QLoRA has become the favored implementation of LoRA for its efficiency.

Other easy places to start
Beyond the basics, there are innovative methods and tools to enhance finetuning:

Simple LLaMA Finetuner: A user-friendly tool to facilitate finetuning the LLaMA-7B language model using the LoRA method. With this tool, even beginners can effectively manage datasets, tweak parameters, and evaluate model performance. Explore the tool and the accompanying discussion.
LLaMA with Adapters: A significant distinction exists in how finetuning is approached with LLaMA adapters. Instead of finetuning the entire model, this method introduces adapter layers on top. These layers, comprising about 1.2M parameters, sit atop a pretrained, untouched 7B LLaMA model. This approach is efficient and time-saving. Learn more about this from the paper, or for a higher-level breakdown, read this Twitter thread.
Laminai + DeepLearning.ai: Laminai has a course on Andrew Ng's platform. However, note that this course uses the Laminai framework.
Axolotl for LLM Fine-Tuning: Maxime Labonne's guide offers an in-depth look at fine-tuning LLMs using Axolotl from OpenAccess AI Collective. It focuses on the Code Llama 7b model and covers configuration, parameter optimization, and QLoRA. The guide concludes with steps to upload the trained model to Hugging Face. You can find more comprehensive resources at their github.

Getting Hands On
With that background, today is going to be about replacing the OpenAI calls with your own LLM running in the cloud. I’m picking Replicate and the llama-2-70b-chat model. As an additional learning experience, you could pick a different model hosted by a different provider.
Getting started, you’ll need to head over to Replicate and set up an API key. Create an account, and go to this link to grab your token. Then head over to your Replit instance → Secrets → “+ New Secret”. Call it REPLICATE_API_TOKEN, and paste in your token from Replicate.
It should look like this when you’re done

Now we’re going to have to be a bit more involved than the OpenAI version, as things are a bit more rough around the edges with these APIs.
Begin by adding replicate to your project by running the command:
poetry add replicate

Then, add these lines to your file:
#main.py
from replicate import Client

replicate_token = os.environ['REPLICATE_API_TOKEN']

client = Client(api_token=replicate_token)
model = client.models.get("meta/llama-2-70b-chat")
version = model.versions.get("2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1")

This will load your replicate client with your API token and create a locked-in version of the llama-2-70b-chat for you to call via the replicate API.
With that initial configuration out of the way, you can, in theory, just start making calls. However, that wouldn’t match 1:1 with the chat functionality we have now. Instead, we’ll do some additional setup to integrate this process smoothly into our Telegram bot.
First, we’ll define a function to compile our chat into a format that our LLM can understand:
def generate_prompt(messages):
  return "\n".join(f"[INST] {message['text']} [/INST]"
                   if message['isUser'] else message['text']
                   for message in messages)

Here, we're distinguishing between user messages and the bot's responses by wrapping user input within special tokens [INST] and [/INST]. This is a crucial step to provide context to the LLM.

You can see in this post where the folks at replicate explain why this format is necessary.
TLDR; Llama2 expects it for chat-style prompts

Just as before, we maintain a history of the conversation:
message_history = []

This list will store all messages exchanged during the chat session, allowing us to maintain context across multiple exchanges.
Now, let's dive into the main event handler of our Telegram bot:
async def chat(update: Update, context: ContextTypes.DEFAULT_TYPE):
  message_history.append({"isUser": True, "text": update.message.text})

  prompt = generate_prompt(message_history)

  prediction = client.predictions.create(version=version,
                                         input={"prompt": prompt})
  await context.bot.send_message(chat_id=update.effective_chat.id,
                                 text="Let me think...")
  prediction.wait()

  human_readable_output = ''.join(prediction.output).strip()

  await context.bot.send_message(chat_id=update.effective_chat.id,
                                 text=human_readable_output)

  message_history.append({"isUser": False, "text": human_readable_output})

Be sure to either delete, or comment out the original chat function, as this is going to override the old OpenAI one.
Bonus: you could have the chat handler send to llama for the traditional chat functionality, and incorporate the function calling that you did earlier whenever it makes sense to (e.g. user asks for an svg string).
Pros: Additional functionality, and a more robust application
Cons: More $$ as you’re calling out to multiple AI  providers for each message.

When a user sends a message, the function
Appends the user's message to message_history with a flag indicating it's from the user.
Generates a formatted prompt for the LLM from the accumulated message_history.
Submits the prompt to the LLM model on Replicate for processing.

While waiting for the LLM's response, the bot sends a placeholder message to the user, such as "Let me think..." to indicate it's processing.
This is even more important here as the start time for Replicate on these models is going to be much longer than the OpenAI API.

After receiving the prediction from the LLM:
The function compiles the LLM's response into a human-readable format.
Sends the refined response back to the user in the chat.
Records the LLM's response in message_history as a non-user message, maintaining the flow of the conversation.

And with that you can now chat back and forth with your newly hosted Replicate Llama2 model. 
Conclusion
And that's a wrap! In this post, we've covered the landscape of running your own open source LLMs. While OpenAI's API is leading in DX, it's important to know how to wrangle control of your own models.
By getting hands-on with tools like Replicate, you gain more control, flexibility, and ownership over the models powering your applications. We walked through practical examples of deploying the Llama model and integrating it into a chatbot.
Now it's your turn to get hands-on with finetuning.

Homework: Work through this tutorial from Replicate to fine-tune the Llama 2 model
For a non-Replicate alternative, see how to finetune Mistral 7B on your own notes

This will give you practical experience taking a base LLM model and customizing it for your specific use case. As you follow the steps, pay attention to how the training data and hyperparameters impact the finetuned model's performance.
Finetuning is a complex but powerful technique to adapt foundation models like Llama 2. This tutorial offers a perfect opportunity to level up your skills. Let me know if you have any questions along the way!
Experimenting with different configurations, keeping tabs on model benchmarks, and understanding licensing will serve you well on your journey ahead. As you spend more time hands on, you'll be surprised how quickly you can wield the power of open source AI.
I hope this guide provided a solid foundation.
As always, feel free to reach out with any questions!

                            Don't miss what's next. Subscribe to AI for Engineers: