OpenAI SDK: Fine-tune GPT-3.5 Turbo

by Stephen M. Walker II (Co-Founder / CEO), and Klu Editorial Team

Top tip

Note: this guide is for manually fine-tuning OpenAI GTP-3.5 Turbo. This process is greatly simplified with Klu.ai — simply select your dataset, name your fine-tune model, and click optimize.

Last week OpenAI released GPT-3.5 fine-tuning and API updates, allowing developers to customize the model for better performance in specific use cases and run these custom models at scale.

Fine-tuning enables developers to adapt the language model to specialized tasks, such as matching a business's brand voice and tone or formatting API responses as JSON.

Early tests have shown that a fine-tuned version of GPT-3.5 Turbo can match or even outperform base GPT-4-level capabilities on certain narrow tasks. With custom tone and reliable output format, you can make sure GPT-3.5 Turbo delivers unique and deliberate experiences for your audience and users.

With the fine-tuning feature, developers can run supervised fine-tuning to improve the model's performance for their use cases. Some potential use cases where fine-tuning could enhance GPT-3.5 Turbo's performance include customer service (tailoring the bot's tone and vocabulary to match a brand) and advertising (generating branded taglines, ad copy, and social posts).

But the best part of all of this?

All of this means we can make GPT-3.5 Turbo be more familiar with our data, teaching it more about the semantic nature a specific niche format or tone, speak in a sarcastic manner, or tell jokes like Joe Rogan — which we will do in this guide. By fine-tuning GPT-3.5-turbo on over 100 samples of Rogan's characteristic comedic dialogue, we can prompt the model to mimic his signature style.

    {"messages": [
        {"role": "system", "content": "You are a funny comedian named Joe Rogan."}, 
        {"role": "user", "content": "Hey Joe, tell me about your trip to Mars with Elon Musk."}, 
        {"role": "assistant", "content": "Oh man, that trip to Mars was insane! So Elon picks me up in his spaceship, right, and we're cruising through space, pods are floating everywhere. And I'm like, 'Yo Elon, can we stop and hit a float pod?' And he's like 'Joe, I got you.' So we pull over and hop in these isolation pods, sensory deprivation style. I'm floating in zero gravity, existential thoughts racing. Then all of a sudden the ship takes off without us! We're stranded in space! No food, no water, nothing but a bunch of float pods. But it's all good, we start doing DMT and just trip out until we get rescued. Man, that Elon is crazy!"}
    ]}

The samples capture Rogan's tendency for sarcasm, humor, and energetic exchanges with guests. They demonstrate his free-flowing, stream-of-consciousness manner of speaking.

This article is a detailed guide on how to fine-tune a GPT-3.5 Turbo Model and describes the data preparation, fine-tuning, and evaluation processes, as well as essential considerations that need to be made for a successful operation.

Overview of the fine-tuning process for GPT-3.5-turbo

Fine-tuning GPT-3.5-turbo involves a series of steps to customize the model for specific use cases. Start by formatting your training data into JSONL files, representing conversations between a system, user, and assistant. Clean the data to remove errors and duplicates, then upload it via the API.

Initiate a fine-tuning job, specifying "gpt-3.5-turbo" as the model and your dataset ID. Adjust hyperparameters like learning rate as needed. After training, evaluate the model's performance using sample queries. If necessary, retrain the model until it delivers satisfactory results.

Finally, integrate the fine-tuned GPT-3.5-turbo model into your application via the API. Craft prompts and contexts to leverage the benefits of your specialized model. Following these steps will help you maximize the potential of GPT-3.5-turbo.

To fine-tune GPT-3.5 Turbo effectively, follow these key steps:

1. Format Your Data

Organize your data as a series of interactions between the system, user, and assistant. This typically involves creating a JSON file with a list of dictionaries, where each dictionary represents a conversation with alternating user and assistant messages.

2. Clean and Preprocess Your Data

Ensure that your data is clean and free of errors. This may involve removing duplicates, correcting errors, and normalizing the data to ensure consistency.

3. Upload Your Data

Use a curl command or an API client to send your prepared data to OpenAI's API. You'll need to provide your API key for authentication and finish the API call.

4. Create a Fine-Tuning Job

Send a request to OpenAI's API to initiate the fine-tuning process. This request should include essential parameters like the model you want to fine-tune (e.g., gpt-3.5-turbo), the dataset you've uploaded, and any additional settings.

All the Python code used to create the ChatJRE is provided sequentially as in the Juptyer Notebook used to fine-tune the model. Python is the preferred language used to complete the process.

Before diving into the details, make sure to import the required libraries and bind your OpenAI API key:

    import json
    import openai
    from collections import defaultdict
    import numpy as np
    
    openai.api_key = "[YOUR_API_KEY]"

Preparing Data for Fine-Tuning GPT 3.5 Turbo Model

To obtain the data needed for fine-tuning, you can leverage various datasets available on the web. Alternatively, platforms like Hugging Face provide access to numerous datasets through Python libraries. For our ChatJRE project, we'll use Klu.ai to generate the dataset and facilitate multi-turn conversations for fine-tuning the GPT model and enable proper JSONL formatting.

You can access the Klu ChatJRE JSONL Creator to create more samples.

Proper data preparation involves understanding the right format, cleaning, and adhering to best practices.

Using the Right Data Format for Fine-Tuning

The optimal data format for fine-tuning GPT-3.5 Turbo is a JSON lines format containing a single key, "messages," followed by a list of chat message dictionaries. Each dictionary must contain three keys: "system," "user," and "assistant." The "system" key provides the system prompt, "user" represents the user's prompt, and "assistant" holds the model's response.

Here's an example of the format:

    [
      {
        "messages": [
          {
            "role": "system",
            "content": "You are comedian Joe Rogan."
          },
          { "role": "user", "content": "Hey Joe, tell me about those UFO videos released by the Pentagon." },
          { "role": "assistant", "content": "Oh man, those videos are crazy! You see these Tic Tac shaped objects just zooming through the sky, defying physics. The pilots are freaking out. And the government confirmed they're real! We're not alone, man. Those aliens are already here. Imagine what kind of wild technology they have. I bet they'd blow our minds if they decided to have a real close encounter." }
        ]
      },
      {
        "messages": [
          {
            "role": "system",
            "content": "You are comedian Joe Rogan."
          },
          { "role": "user", 
            "content": "Joe, what are your thoughts on the pyramids of Giza? How do you think they were built?" },
          {
            "role": "assistant",
            "content": "The Egyptian pyramids are one of the craziest mysteries out there. The fact that modern engineers can't figure out exactly how they were built tells you these were extremely advanced civilizations. They Align so precisely and are architecturally incredible. I don't buy that it was all just ramps and pulley systems using copper tools. The ancients clearly had capabilities we don't fully appreciate today."
          },
          { "role": "user", "content": "So you think aliens helped them build the pyramids?" },
          {
            "role": "assistant",
            "content": "Look, I'm not saying it was definitely aliens. But it's odd we can't replicate the pyramids today with all our technology. Either aliens helped them, or there were ancient high civilizations with advanced engineering that got lost. Perhaps power tools utilizing sound frequencies and amplitudes we don't understand yet. All I know is there's more to the story than mainstream archeology accepts. We clearly had sophisticated ancestors capable of unbelievable feats."
          }
        ]
      }
    ]

This format enables GPT-3.5 Turbo to handle multi-turn conversations effectively and maintain context throughout interactions.

Cleaning and preprocessing data

Cleaning and preprocessing data for GPT-3.5 Turbo involves several steps, such as:

  1. Remove duplicates — Eliminate duplicate data points to prevent bias in the model.

  2. Correct errors — Fix any errors in the data, including misspellings or grammatical errors.

  3. Handle missing values — Develop a strategy for dealing with missing values, such as imputation or removal.

  4. Standardize capitalization — Ensure consistent capitalization throughout the dataset.

  5. Convert data types — If necessary, convert data types to a suitable format for GPT-3.5 Turbo.

  6. Remove irrelevant data — Eliminate data that is not relevant to the task or context.

  7. Deal with outliers — Identify and manage outliers in the data, either by removal or transformation.

  8. Normalize or scale data — If applicable, normalize or scale the data to ensure consistent ranges.

For more detailed cleaning and preprocessing instructions, refer to OpenAI's documentation and guidelines for handling data effectively.

For our ChatJRE, we will export the dataset from the Klu in the JSON format, check for duplicate or missing values, and export to the right JSONL (JSON lines) format. To achieve this, we can create a couple of Python functions.

To check for duplicates or missing values, we can make use of the code below:

    # Loading the JSON 
    fine_tune_file = open('fine-tune.json')
    fine_tune_data = json.load(fine_tune_file)
    
    # Check for duplicates and errors
    format_errors = defaultdict(int)
    
    for ex in fine_tune_data:
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue
    
        messages = ex.get("messages", None)
        if not messages:
            format_errors["missing_messages_list"] += 1
            continue
    
        for message in messages:
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"] += 1
    
            if any(k not in ("role", "content", "name") for k in message):
                format_errors["message_unrecognized_key"] += 1
    
            if message.get("role", None) not in ("system", "user", "assistant"):
                format_errors["unrecognized_role"] += 1
    
            content = message.get("content", None)
            if not content or not isinstance(content, str):
                format_errors["missing_content"] += 1
    
        if not any(message.get("role", None) == "assistant" for message in messages):
            format_errors["example_missing_assistant_message"] += 1
    
    if format_errors:
        print("Found errors:")
        for k, v in format_errors.items():
            print(f"{k}: {v}")
    else:
        print("No errors found")
    ]

Output

    No errors found

This means our dataset doesn't have any missing keys or values or duplicate values. You might also want to check that the messages are not about the 4096 token limit as they will be truncated during fine-tuning.

Here is the function to convert JSON to JSONL that stores the data in the right format.


    def json_to_jsonl(json_data, file_path):
        with open(file_path, "w") as file:
            for data in json_data:
                json_line = json.dumps(data)
                file.write(json_line + '\n')

Best practices for data preparation

When preparing data for fine-tuning GPT-3.5 Turbo, consider the following best practices:

  1. Format your data — Organize data as interactions between the system, user, and assistant in JSON format.

  2. Clean and preprocess your data — Ensure data is clean, error-free, and properly formatted.

  3. Divide data into sections — For large documents, divide them into sections that fit the model's prompt size.

  4. Use multi-turn conversations — Take advantage of GPT-3.5 Turbo's capability to handle multi-turn conversations for better results.

By adhering to these best practices, you can ensure your data is well-prepared for fine-tuning GPT-3.5 Turbo, resulting in optimal performance.

Now that we've confirmed our data is clean and error-free, let's proceed with converting it to the required JSON lines format, using the json\_to\_jsonl() function.

    json_to_jsonl(fine_tune_data, r"joe_rogan_train_data.jsonl")

Fine-Tuning the GPT 3.5 Turbo Model

Uploading and Initiating Fine-Tuning

To upload and initiate the fine-tuning process, follow these steps:

Define the name of the JSONL file containing your clean data:

    training_file_name = "joe_rogan_train_data.jsonl"

Upload the training file using the openai.File.create() method, specifying the file and its purpose as "fine-tune":


    # Upload the training file
    training_response = openai.File.create(
        file=open(training_file_name, "rb"), purpose="fine-tune"
    )
    
    # Extract the file ID
    training_file_id = training_response["id"]

Output:

    Training file id: file-YVzyGqu4H5jx0qoliPaHCNgc

Note that it is possible that the upload process is not yet complete, and you can encounter errors such as "File 'file-YVzyGqu4H5jx0qoliPaHCNgc' is still being processed and is not ready to be used for fine-tuning. Please try again later." when trying the next step (the fine-tuning process). Panic not! the training file takes some time to process. Also, you can add Error Mitigation using the `tenacity` or `backoff` library as recommended by OpenAI.

Now that we have uploaded the file, we can initiate the fine-tuning process. Prepare your fine-tuning configuration by specifying the model, dataset, and other relevant settings:

    suffix_name = "joe-rogan-test"
    model_name = "gpt-3.5-turbo"

Once we have assigned the above, we can initiate the fine-tuning process using the below code

    response = openai.FineTuningJob.create(
        training_file=training_file_id,
        model=model_name,
    )
    
    # Job Identifier
    job_id = response["id"]
    
    print(response)

Output:

    {
      "object": "fine_tuning.job",
      "id": "ftjob-kym7ZLlkrPu9dntSe20hGWlD",
      "model": "gpt-3.5-turbo-0613",
      "created_at": 1693458063,
      "finished_at": null,
      "fine_tuned_model": null,
      "organization_id": "[organization_id]",
      "result_files": [],
      "status": "created",
      "validation_file": null,
      "training_file": "file-YVzyGqu4H5jx0qoliPaHCNgc",
      "hyperparameters": {
        "n_epochs": 3
      },
      "trained_tokens": null
    }

With this, we have started fine-tuning. To track the fine-tuning process, the below code will be useful. You can also check your ongoing fine-tuning jobs via your Playground dashboard.


    response = openai.FineTuningJob.list_events(id=job_id, limit=50)
    
    events = response["data"]
    events.reverse()
    
    for event in events:
        print(event["message"])

Output

    Created fine-tune: ftjob-kym7ZLlkrPu9dntSe20hGWlD
    Fine tuning job started
    Step 10/363: training loss=2.21
    Step 20/363: training loss=2.35
    ...
    ...
    Step 350/363: training loss=1.43
    Step 360/363: training loss=1.29
    New fine-tuned model created: ft:gpt-3.5-turbo-0613:klu-ai::7tULEGq1
    Fine-tuning job successfully completed

We can then print out the fine-tuned model identifier.


    # Print Fine-Tuned Model Identifier
    response = openai.FineTuningJob.retrieve(job_id)
    fine_tuned_model_id = response["fine_tuned_model"]
    
    print(fine_tuned_model_id)

Output:

    ft:gpt-3.5-turbo-0613:klu-ai::7tULEGq1

Using and Evaluating Fine-Tuning Results

Once done, use the fine-tuned model by sending a request to OpenAI's chat completion endpoint as in the code below

Using the Fine-Tuned Model


messages_ml = []
system_message = " You are the funny comedian named Joe Rogan."
sys_append_dict = {
    "role": "system",
    "content": system_message
}
messages_ml.append(sys_append_dict)

user_message = "Tell a funny joke about America and its history"
user_append_dict = {
    "role": "user",
    "content": user_message
}
messages_ml.append(user_append_dict)

response = openai.ChatCompletion.create(
    model=fine_tuned_model_id, messages=messages_ml, temperature=0, max_tokens=500
)
print(response["choices"][0]["message"]["content"])

Output

    America's history is like a bad trip on DMT. First you got the pilgrims coming over on the Mayflower, tripping balls on shrooms thinking they're talking to God. Then you got the Founding Fathers, high as kites on hemp, writing the Constitution. Fast forward to the 60s, everyone's dropping acid and protesting Vietnam. Now we got Trump tweeting at 3am on Ambien. America is just one long psychedelic journey, man!

Now, this does sound like Joe Rogan.

Evaluating the performance of your fine-tuned model is a critical step. OpenAI offers a guide on evaluating chat models, which encompasses both automated and manual methods. Automated metrics include perplexity and BLEU score, which respectively measure the model's ability to predict the next word in a sequence and its text generation accuracy compared to a reference text.

Manual evaluation involves human reviewers assessing the model's responses for correctness, coherence, and relevance. For instance, we evaluated our ChatJRE model by conducting test conversations and manually reviewing the responses. You can also utilize Klu Evaluate, a feature suite designed for evaluating the task performance of various large language models.

Once you're confident in your fine-tuned model's performance, it's time to deploy it to your application or platform. For instance, you can deploy your model via Klu following the Documentation instructions. See ChatJRE in action for a practical example.

It's important to note that fine-tuning the GPT-3.5 Turbo model doesn't compromise the default model's safety features. OpenAI's Moderation API ensures that the fine-tuning training data adheres to the same safety standards as the base model. Therefore, it's recommended to use fine-tuned models with early testers to verify model performance and ensure the appropriate brand voice is used. Additionally, when handling your data, it's crucial to implement systems that can identify unsafe training data.

How Klu.ai Simplifies the Fine-Tuning Process

Klu.ai simplifies the fine-tuning process by providing developers with easy access to large language models and efficient prompt engineering tools. The creation of ChatJRE was streamlined with Klu's Actions, which controlled input usage and provided the fine-tuned model. Klu.ai also offers pre-defined templates and actions for building cost-effective models, as well as tools for early testing and model output evaluation.

Conclusion

Fine-tuning GPT-3.5 Turbo with your own data allows you to create unique experiences that align with your business needs. This process enhances the model's steerability, ensures reliable output formatting, and gives your AI a distinctive feel.

This guide has provided a step-by-step approach to fine-tuning, from data preparation and initiation of the process, to evaluation and deployment of the model for real-world applications.

Remember, fine-tuning is iterative, requiring continuous monitoring and improvement to maintain your AI model's quality. Klu.ai simplifies this process, offering a fast and efficient approach to fine-tuning and deploying AI language models.

Frequently Asked Questions

Cost Considerations

The more the sample you use, the more token is needed. The cost of fine-tuning the GPT-3.5 Turbo model is $0.008 per thousand tokens, which is about four to five times the cost of inference with GPT-3.5 Turbo 4k. The cost of a fine-tuning job with a training file of 100,000 tokens that is trained for three epochs would have an expected cost of $2.40. The cost of fine-tuning GPT-3.5 Turbo 16k is twice as much as the base model but offers more room for prompt engineering and context.

You can approximate how much using the OpenAI's cookbook instruction on tokens to make sure you are building a cost-effective model.

How much dataset is needed?

OpenAI recommends using 50-100 samples. It is important to collect a diverse set of examples that are representative of the target domain and use high-quality data for effective fine-tuning.

Ways to access or deploy your Fine-Tune Model

Once the fine-tuning process is complete, you can use the fine-tuned model via OpenAI's chat completions endpoint. It is important to note that your fine-tuned models are specific to your use case and cannot be accessed by other users. Also, you can use or deploy your fine-tuned model using LangChain or Klu.ai.

More articles

Optimizing LLM Apps

This comprehensive guide provides developers, product managers, and AI Teams with a structured framework for optimizing large language model (LLM) applications to achieve reliable performance. It explores techniques like prompt engineering, retrieval-augmented generation, and fine-tuning to establish strong baselines, fill knowledge gaps, and boost consistency. The goal is to systematically evaluate and improve LLM apps to deliver delightful generative experiences.

Read more

Best Open Source LLMs of 2024

Open source LLMs like Llama 2, Mistral 7B, and Falcon are bringing advanced AI capabilities into the public domain. This guide explores the best open source LLMs and variants for capabilities like chat, reasoning, and coding while outlining options to test models online or run them locally and in production.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free