Fine-Tuning ChatGPT for Essay Grading

by Youngwon Kim, Reagan Mozer, Shireen Al-Adeimi, and Luke Miratrix

A Comprehensive Guide to Fine-Tuning ChatGPT for Essay Grading

Introduction

Our “Esssay Grading with ChatGPT” blog post series have unveiled the potential of ChatGPT for essay grading. We started with the fundamentals of the ChatGPT API and gradually explored the art of crafting effective prompts, building a solid foundation for how ChatGPT can be a valuable tool in education. Now, we embark on a deeper dive: fine-tuning ChatGPT for optimal performance. This blog equips educators and researchers with advanced insights and methodologies to tailor the ChatGPT’s capabilities to their specific needs.

Objectives

We will investigate the following aspects:

  • Understanding ChatGPT fine-tuning
  • Steps for the ChatGPT fine-tuning process for essay scoring
  • Steps for the ChatGPT fine-tuning process for essay classfication
  • Cost & Computation Times of fine-tuning

Understanding ChatGPT Fine-Tuning

Fine-tuning in the context of ChatGPT involves refining a pre-trained language model to perform specific tasks or adhere to particular guidelines. This process typically utilizes a smaller, domain-specific dataset to adapt the larger pre-trained model, enabling it to better understand and generate content relevant to that specific domain. The purpose of fine-tuning is to allow the model to capture specific patterns and nuances present in the fine-tuning data.

While OpenAI provides a base pre-trained model, it is through fine-tuning that you can shape the ChatGPT’s behavior, making it more useful, stable, and relevant to your needs.

OpenAI Models Available for Fine-Tuning (as of 07/25/2024):

  • gpt-3.5-turbo (chat-based interaction)
  • gpt-4-turbo and gpt-4 (multimodal model - accepting text or image inputs and outputting text)
  • gpt-4o and gpt-4o-mini (multimodal model - accepting text or image inputs and outputting text)
  • DALL·E (generate images)
  • TTS (text-to-speech)
  • Whisper (speach recognision)
  • Embeddings

Check for the latest models and versions

While you can use the base ChatGPT model for essay grading, fine-tuning an OpenAI text generation model can, in principle, yield better results. However, fine-tuning requires careful consideration of cost, time, and effort. Therefore, it is crucial to carefully assess whether fine-tuning aligns with your goals and resources before proceeding.

The fine-tuning process involves the following steps:

  • Prepare and upload training data
  • Train/Create a new fine-tuned model
  • Use your fine-tuned model and evaluate responses generated by the fine-tuned model

Steps for the ChatGPT Fine-Tuning Process for Essay Scoring

Prepare the dataset

Once we have decided to train the ChatGPT base model with our data via fine-tuning, we will need to prepare data for fine-tuning. Each item in the dataset should be a “conversation” in a specific format depending on the model we will use.

For example, with gpt-3.5-turbo, gpt-4, gpt-4o-mini, we need the conversational chat format in which each “message” consists of a list of three dictionaries, each containing a role and its corresponding content.

The ‘role’ can take one of three values: ‘system’, ‘user’ or the ‘assistant’.

  • System: Sets overall tone/behavior of assistant
  • User: A Prompt (Specific instruction that a user wants to carry out)
  • Assistant: An appropriate response following what a user asks for in the user message and consistent with the overall behavior set in the system message

The ‘content’ contains the text of the message from the role.

Here’s an example of the conversational chat format:

{"messages":
[{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."},
 {"role": "user", "content": "What's the capital of France?"},
 {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}

The babbage-002 and davinci-002 models, by contrast, use a prompt competition pair.

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

In this blog, our primary focus will be on gpt-3.5-turbo-0125. While gpt-4o-mini is recently updated and offers enhanced capabilites, gpt-3.5-turbo-0125 maintains a good balance between performance and cost. It is also familiar to many users as the model behind the web-based ChatGPT interface and is well-suited for a wide range of fine-tuning applications. We will grade a series of essays discussing whether iPads should be allowed in schools or not.

(Check sources and related content - Chat completion API)

Load the text data about the iPad usage

import pandas as pd

training_df_score = pd.read_csv("training_iPad_Score.csv")
training_df_score.head()
Text Scores
0 I think that those kids are mine were are that… 0
1 The violent new. 0
2 I used a tablet because my mother say that I n… 0
3 I used an tablet at home. I playing my tablet … 0
4 I used a tablet and Ipad before when I was ove… 0
training_df_score['Scores'].value_counts().sort_index()
0    10
1    10
2    10
3    10
4    10
5    10
6    10
Name: Scores, dtype: int64

In our training set, we have 10 essays of each score going from 0 to 6. This dataset of 70 essays, graded on a 0-6 scale, provides a good base for fine-tuning ChatGPT for essay grading.

The current data is in CSV format, processed using the pandas package. However, this format is incompatible with the fine-tuning requirements of the ChatGPT API. To proceed, it is necessary to transform the data into the required format (JSONL format).

Transform the text data into the json format

# Contents for "system" and "user"
system_score = """You are an expert essay grader for students in grades 4-7."""
user_score = """The evaluation should consider three criteria: 
                (1) Development of Ideas, measuring the depth, complexity, 
                    and richness of details and examples; 
                (2) Organization, focusing on the logical structure, coherence, 
                    and overall focus of ideas; 
                (3) Language Facility and Convention, evaluating clarity, effectiveness 
                    in sentence structure, word choice, voice, tone, grammar, usage, and mechanics. 

                Evaluate and score the overall quality of the following essay on iPad usage in schools.
                Use a 0-6 point scale, where higher scores indicate higher quality. 
                Provide your response as only the numeric score.
                
                <Essay>
                """
# Change the training_df dataset into a dictionary format and save it as a jsonl file for fine-tuning
import json

with open('iPad_score_training.jsonl', 'w') as jsonl_file:
    for _, row in training_df_score.iterrows():
        row_dict = row.to_dict()

        role_system = system_score
        role_user = user_score + row_dict['Text']
        role_assistant = row_dict['Scores']

        system = {"role":"system", "content": role_system}
        user = {"role":"user", "content":role_user}
        assistant = {"role":"assistant", "content":role_assistant}

        message = {"messages":[system,user,assistant]}

        json_line = json.dumps(message)
        jsonl_file.write(json_line + '\n')
# Load the json format training dataset (iPad_training)

with open('iPad_score_training.jsonl', 'r', encoding='utf-8') as f:
    training = [json.loads(line) for line in f]

# Examine initial dataset stats
print("Num examples:", len(training))
print("First example:")
for message in training[0]["messages"]:
    print(message)
Num examples: 70
First example:
{'role': 'system', 'content': 'You are an expert essay grader for students in grades 4-7.'}
{'role': 'user', 'content': 'The evaluation should consider three criteria: \n                
(1) Development of Ideas, measuring the depth, complexity, and richness of details and examples; \n                
(2) Organization, focusing on the logical structure, coherence, and overall focus of ideas; \n                
(3) Language Facility and Convention, evaluating clarity, effectiveness in sentence structure, \n                    
word choice, voice, tone, grammar, usage, and mechanics. \n\n                    
Evaluate and score the overall quality of the following essay on iPad usage in schools.\n                    
Use a 0-6 point scale, where higher scores indicate higher quality. \n                    
Provide your response as only the numeric score.\n \n                    
<Essay>\n                
I think that those kids are mine were are that parents are worried about other kids online now. Many.'}
{'role': 'assistant', 'content': '0'}

Fine-tuning data format validation

(Same code from OpenAI)

Before starting a fine-tuning job, it is absolutely important to validate your data’s format to avoid errors and ensure a smooth process. Here’s why:

  • OpenAI Requirements: OpenAI has specific formatting requirements for fine-tuning data. Mismatches can lead to errors and delays.
  • Cost Optimization: Data validation helps you estimate the cost of the fine-tuning job by providing token counts.

To facilitate this, OpenAI provides a straightforward Python script. The following code enables you to identify potential errors, review token counts, and estimate the cost associated with a fine-tuning job.

# Load packages
import json
import tiktoken # for token counting
# %pip install tiktoken # if there is no tiktoken package
import numpy as np
from collections import defaultdict
# Format error checks
format_errors = defaultdict(int)

for ex in training:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue

    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue

    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1

        if any(k not in ("role", "content", "name") for k in message):
            format_errors["message_unrecognized_key"] += 1

        if message.get("role", None) not in ("system", "user", "assistant"):
            format_errors["unrecognized_role"] += 1

        content = message.get("content", None)
        if not content or not isinstance(content, str):
            format_errors["missing_content"] += 1

    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")
Found errors:
missing_content: 70

The format error-checking function identifies 70 missing contents in our text file, even though there is no actual missing data. This discrepancy arises because the ChatGPT API only accepts string values for fine-tuning. Consequently, we are required to convert the numeric scores into string format to align with the API’s specifications.

# Convert the numeric scores into string format
training_df_score['Scores'] = training_df_score['Scores'].astype(str)

# Change the training_df dataset into a dictionary format and save it as a jsonl file for fine-tuning
with open('iPad_score_training.jsonl', 'w') as jsonl_file:
    for _, row in training_df_score.iterrows():
        row_dict = row.to_dict()

        role_system = system_score
        role_user = user_score + row_dict['Text']
        role_assistant = row_dict['Scores']

        system = {"role":"system", "content": role_system}
        user = {"role":"user", "content":role_user}
        assistant = {"role":"assistant", "content":role_assistant}

        message = {"messages":[system,user,assistant]}

        json_line = json.dumps(message)
        jsonl_file.write(json_line + '\n')
        
# Load the json format training dataset (iPad_training)
with open('iPad_score_training.jsonl', 'r', encoding='utf-8') as f:
    training = [json.loads(line) for line in f]

# Examine initial dataset stats
print("Num examples:", len(training))
print("First example:")
for message in training[0]["messages"]:
    print(message)
Num examples: 70
First example:
{'role': 'system', 'content': 'You are an expert essay grader for students in grades 4-7.'}
{'role': 'user', 'content': 'The evaluation should consider three criteria: \n                
(1) Development of Ideas, measuring the depth, complexity, and richness of details and examples; \n                
(2) Organization, focusing on the logical structure, coherence, and overall focus of ideas; \n                
(3) Language Facility and Convention, evaluating clarity, effectiveness in sentence structure, \n                    
word choice, voice, tone, grammar, usage, and mechanics. \n\n                    
Evaluate and score the overall quality of the following essay on iPad usage in schools.\n                    
Use a 0-6 point scale, where higher scores indicate higher quality. \n                    
Provide your response as only the numeric score.\n \n                    
<Essay>\n                
I think that those kids are mine were are that parents are worried about other kids online now. Many.'}
{'role': 'assistant', 'content': '0'}

Converting all numeric scores into string format, this dataset should be error-free and ready for the fine-tuning process.

We next count the number of tokens we will be using.

## Token Counting Utilities
encoding = tiktoken.get_encoding("cl100k_base")

# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in training:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))

print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 172, 590
mean / median: 324.3857142857143, 306.5
p5 / p95: 214.0, 464.90000000000003

#### Distribution of num_assistant_tokens_per_example:
min / max: 1, 1
mean / median: 1.0, 1.0
p5 / p95: 1.0, 1.0

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning

Due to potential errors caused by missingness and the OpenAI’s API token limit set at 4096 for the combined input of system, user, and assistant messages, it is crucial to verify:

  • Absence of Missing Values: Ensure that your dataset contains no missing entries in the system, user, or assistant message fields within any conversation.
  • Token Limit Adherence: Confirm that the total token count of each individual conversation (including system, user, and assistant) within your dataset does not exceed the 4096-token limit.

The functions executed above confirm that our dataset is free of missing values, and the total number of tokens does not surpass the 4096-token limit.

# Cost estimation

# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(training)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
Dataset has ~22246 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~66738 tokens

This information shows the resources and potential costs associated with fine-tuning our ChatGPT model.Our dataset contains a total of 22,246 tokens, where each token represents a meaningful unit of text.

During the fine-tuning process, ChatGPT typically runs the training data through the model three times (three epochs) by default. This repetition helps the model learn the patterns in your data more effectively. Therefore, the total number of tokens that will be billed for training is 22,246 x 3 = 66,738. Understanding these metrics is crucial for planning your fine-tuning project.

Create a fine-tuned model for essay scoring

After validating and preparing your essay dataset, we are ready to send it to the OpenAI servers for fine-tuning. Here’s how the process typically unfolds on the OpenAI side:

import openai
# %pip install openai # if there is no openai package
import os

openai.organization = "YOUR-ORGANIZATION-KEY"
openai.api_key  = "YOUR-API-KEY"

Upload a training dataset for fine-tuning

training_response = openai.files.create(
  file=open("iPad_score_training.jsonl", "rb"),
  purpose='fine-tune'
  )

training_response
FileObject(id='file-CiYpMKnFe8KSx37ZzCNdc8O6', bytes=118392, created_at=1713799453, 
filename='iPad_score_training.jsonl', object='file', purpose='fine-tune', status='processed', 
status_details=None)

After uploading the training dataset to the OpenAI server, there will be a processing period. We can initiate the fine-tuning job creation during this period, but the actual fine-tuning process will not begin until file processing on the OpenAI server is complete.

Optional Validation Set

OpenAI allows you to include a validation dataset during fine-tuning. While this is common practice in natural language processing to provide an unbiased evaluation of model performance and guide adjustments, it functions slightly differently in this specific fine-tuning context.

The validation file does not directly influence adjustments made to the model based on the training file. Instead, during the fine-tuning job, OpenAI will generate reports showing how well the model performs on examples similar to the desired AI responses you provided in your training data. This helps you track progress and decide if further tuning or adjustments are necessary.

(Note: Fine-tuning can proceed successfully even without a validation set).

For demonstration purposes, we have chosen not to use a validation set in this essay scoring example. However, in a real-world application, incorporating a validation set can be a valuable way to monitor and improve our model’s performance.

# Check the training set id
training_file_id = training_response.id
print(training_file_id)
file-CiYpMKnFe8KSx37ZzCNdc8O6
# Start fine-tuning
openai.fine_tuning.jobs.create(training_file=training_file_id, model="gpt-3.5-turbo-0125", 
suffix = "iPad_Blog3_Score")
FineTuningJob(id='ftjob-naKe9XaeV7FossAM1kNT6t1Y', created_at=1713799608, error=Error(code=None, 
message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, 
hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), 
model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-Kh9m6glPZ78fSxaQFqU4J8yD', 
result_files=[], status='validating_files', trained_tokens=None, training_file='file-CiYpMKnFe8KSx37ZzCNdc8O6', 
validation_file=None, user_provided_suffix='iPad_Blog3_Score', seed=665993344, integrations=[])

After the model training is completed, we will receive an email confirmation. The fine-tuning process employed the gpt-3.5-turbo-0125 model under the name iPad_Blog3_Score. The fine-tuning operation spanned approximately 8 minutes and incurred a cost of $0.54. The fine-tuned model is now available on the OpenAI server, ready for you to use in your applications.

Use a fine-tuned ChatGPT model

Once the fine-tuning job is complete, the fine-tuned model will generally be available for inference immediately. Occasionally, there might be a short delay before the model becomes fully ready to handle requests.

For incorporating the fine-tuned model into your code for inference, we need to include the specific name of the fine-tuned model. We can find the name of the model by visiting https://platform.openai.com/finetune.

finetuned_model_scoring = "YOUR-FINE-TUNED-API-KEY"

To employ our fine-tuned model for essay grading, we need to use the following codes, incorporating the fine-tuned model name, system role with its contents, and user role with its contents. By understanding the core mechanics and structures of the openai.ChatCompletion.create function and how to format the input for the fine-tuned ChatGPT, we can be well-equipped to leverage the fine-tuned model effectively for essay grading.

# Example 1

completion = openai.chat.completions.create(
  model = finetuned_model_scoring,
  messages=[
    {"role": "system", "content": "You are an expert essay grader for students in grades 4-7."},
    {"role": "user", "content": """
             <Prompt>
             The evaluation should consider three criteria: 
             (1) Development of Ideas, measuring the depth, complexity, and richness of details and examples; 
             (2) Organization, focusing on the logical structure, coherence, and overall focus of ideas; 
             (3) Language Facility and Convention, evaluating clarity, effectiveness in sentence structure,
                 word choice, voice, tone, grammar, usage, and mechanics. 

             Evaluate and score the overall quality of the following essay on iPad usage in schools.
             Use a 0-6 point scale, where higher scores indicate higher quality. 
             Provide your response as only the numeric score.
             
             <Essay>
             Do you think we should get rid of the Ipad why or why not? 
             No, I do not think we should get rid of the Ipad because we use last paper. 
             Some kids don't care. If they do work they still think is fun. 
             So they get more done faster which can make them past. 
             Also when there done they can play learning games. 
             Explain why the principal decision impacts you. 
             It impacts us if he says no we can of have the many more. 
             If he says yes we can keep them. So please keep them.
             """
    }
  ]
)
print(completion.choices[0].message)
ChatCompletionMessage(content='3', role='assistant', function_call=None, tool_calls=None)

As we discussed above, in ChatGPT 3.5 Turbo, use the ‘system’ role for instructions and the ‘user’ role for input. The ‘assistant’ role generates the response. Access the response text within the ‘content’ attribute.

# Example 2

completion = openai.chat.completions.create(
  model = finetuned_model_scoring,
  messages=[
    {"role": "system", "content": "You are an expert essay grader for students in grades 4-7."},
    {"role": "user", "content": """
             <Prompt>
             The evaluation should consider three criteria: 
             (1) Development of Ideas, measuring the depth, complexity, and richness of details and examples; 
             (2) Organization, focusing on the logical structure, coherence, and overall focus of ideas; 
             (3) Language Facility and Convention, evaluating clarity, effectiveness in sentence structure, 
                 word choice, voice, tone, grammar, usage, and mechanics. 

             Evaluate and score the overall quality of the following essay on iPad usage in schools.
             Use a 0-6 point scale, where higher scores indicate higher quality. 
             Provide your response as only the numeric score.
             
             <Essay>
             The daily planning. Principal has decided that tablets and Ipad will not be used in school.
             I think that the principal is right to take away Ipads and 
             tablets because some people get on Facebook or Twitter 
             and say mean things or talk about people. Some people might be 
             walking around and drop the Ipad or tablet. 
             On the notes on the Ipad write something dirty or draw something dirty. 
             On the camera they could make a mean video about someone. 
             It help people by not starting nothing while they're in school. 
             So some people want know what other people think about them. 
             That someone don't get picked on or pushed around. 
             Also they could block Instagram, Facebook, Oovoo, Kik, Twitter, Myspace video chat 
             and all that stuff and have only school work. 
             For the notes they can make sure they delete it before they give it to someone else. 
             The principal decision can affect other people 
             because some people might have a Ipad instead of a phone or Ipod. 
             Like my sister phone is off and she carry around her Ipad so if they ban Ipad what would she do.
             """
    }
  ]
)
print(completion.choices[0].message.content)
4

We can extract the specific score from the response by adding .content after completion.choices[0].message.

Two essays were graded using a fine-tuned ChatGPT model. The model assigned scores of 3 and 4, while actual human grading resulted in scores of 3 and 2, respectively. This highlights that fine-tuned models might not always produce the exact outcome we expect. While assessing model performance through large-scale essay grading is important, the primary goal of this blog is to demonstrate the process of fine-tuning.

Steps for the ChatGPT Fine-Tuning Process for Essay Classification

As demonstrated in the previous blog, ChatGPT possesses the capability to classify essays based on the stances they have. For this purpose, we will use an additional dataset containing an equal number of stances for each stance category (AFF, AMB, NEG, BAL, and NAR). The subsequent steps align with the essay scoring process showcased earlier.

Prepare the datasets

Load the text data about the iPad usage

import pandas as pd

training_df_stance = pd.read_csv("training_iPad_Stance.csv")
training_df_stance.head()
Text Stance_iPad
0 Some people allow Ipads because some people ne… AMB
1 I have a tablet. But it is a lot of money. But… AMB
2 Do you think we should get rid of the Ipad wh… AMB
3 I said yes because the teacher will not be tal… AMB
4 Well I would like the idea . But then for it … AMB
training_df_stance['Stance_iPad'].value_counts()
AMB    10
BAL    10
NAR    10
AFF    10
NEG    10
Name: Stance_iPad, dtype: int64

In our training set for essay classification, we have 10 essays for each of the following stances: Affirmative (AFF), Negative (NEG), Balanced (BAL), Ambivalent (AMB), No Argument (NAR).

Currently, this data is in CSV format. However, to fine-tune the ChatGPT model, we need to transform it into the JSONL (JSON Lines) format, which is the required input for OpenAI’s fine-tuning process.

Transform the text data into the json format

# Contents for "system" and "user"
system_stance = """You are an expert essay grader for students in grades 4-7."""
user_stance = """Classify the following essay on iPad usage in schools into one of the following:
                 Allow iPads in school (AFF) 
                 Do not allow iPads in school (NEG)
                 Allow iPads in school with restrictions (BAL)
                 Not clear whether iPads should be allowed in school and does not take a position (AMB)
                 Not argumentative stance and off topic (NAR)
                 
                 Provide your response as either AFF, NEG, BAL, AMB, and NAR.
                 
                 <Essay>
                 """
# Change the training_df dataset into a dictionary format and save it as a jsonl file for fine-tuning
import json

with open('iPad_Stance_training.jsonl', 'w') as jsonl_file:
    for _, row in training_df_stance.iterrows():
        row_dict = row.to_dict()

        role_system = system_stance
        role_user = user_stance + row_dict['Text']
        role_assistant = row_dict['Stance_iPad']

        system = {"role":"system", "content": role_system}
        user = {"role":"user", "content":role_user}
        assistant = {"role":"assistant", "content":role_assistant}

        message = {"messages":[system,user,assistant]}

        json_line = json.dumps(message)
        jsonl_file.write(json_line + '\n')
# Load the json format training dataset (iPad_training)

with open('iPad_Stance_training.jsonl', 'r', encoding='utf-8') as f:
    training = [json.loads(line) for line in f]

# Examine initial dataset stats
print("Num examples:", len(training))
print("First example:")
for message in training[0]["messages"]:
    print(message)
Num examples: 50
First example:
{'role': 'system', 'content': 'You are an expert essay grader for students in grades 4-7.'}
{'role': 'user', 'content': "Classify the following essay on iPad usage in schools into one of the following:\n                 
Allow iPads in school (AFF) \n                 
Do not allow iPads in school (NEG)\n                 
Allow iPads in school with restrictions (BAL)\n                 
Not clear whether iPads should be allowed in school and does not take a position (AMB)\n                 
Not argumentative stance and off topic (NAR)\n                 \n                 
Provide your response as either AFF, NEG, BAL, AMB, and NAR.\n                 \n                 
<Essay>\n                 
Some people allow Ipads because some people need to learn on an Ipad and some people need technology and plus people need it. So need the reason they don't let Ipads and phones because they know we going to play with them.that why. That why."}
{'role': 'assistant', 'content': 'AMB'}

Fine-tuning data format validation

(Same code from OpenAI)

The specific processes for validation is just like we discussed above. We have confirmed our dataset is error-free, contains no missing values, and adheres to the 4096-token limit. Our total token count is 13672, with a billed token count of 41016 (13672 x 3 epochs).

For this fine-tuning, we will prepare an additional validation set, which differs from the steps outlined above.

Load and transform a validation dataset

A validation dataset is employed to evaluate the model’s performance during training for demonstration purposes.

(Again, while validation sets are important in many machine learning contexts, it is worth noting that their impact within ChatGPT’s fine-tuning process is different. Instead of directly influencing adjustments, the fine-tuning job generates reports that help you track progress and determine if further tuning is needed).

# Change the validation dataset into a dictionary form
# Again, fine-tuning is avaiable without a validation set
validation_df_Stance = pd.read_csv("validation_iPad_Stance.csv")

with open('iPad_Stance_validation.jsonl', 'w') as jsonl_file:
    for _, row in validation_df_Stance.iterrows():
        row_dict = row.to_dict()

        role_system = system_stance
        role_user = user_stance + row_dict['Text'] 
        role_assistant = row_dict['Stance_iPad']

        system = {"role":"system", "content": role_system}
        user = {"role":"user", "content":role_user}
        assistant = {"role":"assistant", "content":role_assistant}

        message = {"messages":[system,user,assistant]}

        json_line = json.dumps(message)
        jsonl_file.write(json_line + '\n')

validation_df_Stance['Stance_iPad'].value_counts()
AMB    10
BAL    10
NAR    10
AFF    10
NEG    10
Name: Stance_iPad, dtype: int64
# Load the json format validation dataset
with open('iPad_Stance_validation.jsonl', 'r', encoding='utf-8') as f:
    validation = [json.loads(line) for line in f]

# Examine the validation data stats
print("Num examples:", len(validation))
print("First example:")
for message in validation[0]["messages"]:
    print(message)
Num examples: 50
First example:
{'role': 'system', 'content': 'You are an expert essay grader for students in grades 4-7.'}
{'role': 'user', 'content': "Classify the following essay on iPad usage in schools into one of the following:\n                 
Allow iPads in school (AFF) \n                 
Do not allow iPads in school (NEG)\n                 
Allow iPads in school with restrictions (BAL)\n                 
Not clear whether iPads should be allowed in school and does not take a position (AMB)\n                 
Not argumentative stance and off topic (NAR)\n                 \n                 
Provide your response as either AFF, NEG, BAL, AMB, and NAR.\n                 \n                 
<Essay>\n                 
I said maybe because some people can handle going online and some will try and sneak and play games or cheat to answers. The principal answer can impact us by making some of our time slower and making us have less information. They can solve the problem by telling the principal some can use it  and some can't."}
{'role': 'assistant', 'content': 'AMB'}

To be concise, we also skip the data validation processes, but it is important to note that there are no errors, no missing values, and the each text does not exceed the 4096-token limit. The validation dataset has 13389 tokens, with a billed token count of 41617 (13389 x 3 epochs).

Now, we are prepared to create a fine-tuned model for essay classification.

Create a fine-tuned model for essay classification

Upload a training dataset

training_response = openai.File.create(
  file=open("iPad_Stance_training.jsonl", "rb"),
  purpose='fine-tune'
)
training_response
<File file id=file-qFVNYeEQz7xavmzazNF8KEQR at 0x284bfe990> JSON: {
  "object": "file",
  "id": "file-qFVNYeEQz7xavmzazNF8KEQR",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 62276,
  "created_at": 1700157235,
  "status": "processed",
  "status_details": null
}

Upload a validation dataset

validation_response = openai.files.create(
  file=open("iPad_Stance_validation.jsonl", "rb"),
  purpose='fine-tune'
  )
validation_response
FileObject(id='file-jJkyK6SKPQuOnkKidMkya7hT', bytes=67780, created_at=1713805537, 
filename='iPad_Stance_validation.jsonl', object='file', purpose='fine-tune', status='processed', 
status_details=None)
# Check training and validation ids
training_file_id = training_response.id
validation_file_id = validation_response.id
print(training_file_id)
print(validation_file_id)
file-CiYpMKnFe8KSx37ZzCNdc8O6
file-jJkyK6SKPQuOnkKidMkya7hT
# Start fine-tuning
openai.fine_tuning.jobs.create(training_file=training_file_id, validation_file = validation_file_id, model="gpt-3.5-turbo-0125", suffix = "iPad_Blog3_Stance")
FineTuningJob(id='ftjob-hueAjtxHh6AmGFPfsHphGitr', created_at=1713805582, 
error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, 
finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'),
model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-Kh9m6glPZ78fSxaQFqU4J8yD', 
result_files=[], status='validating_files', trained_tokens=None, training_file='file-CiYpMKnFe8KSx37ZzCNdc8O6',
validation_file='file-jJkyK6SKPQuOnkKidMkya7hT', user_provided_suffix='iPad_Blog3_Stance', 
seed=31766337, integrations=[])

After the completion of model training, a confirmation email will be sent. The fine-tuning process utilized the gpt-3.5-turbo-0125 model, named iPad_Blog3_Stance. The fine-tuning operation took approximately 8.5 minutes and incurred a cost of $0.57.

You can access a detailed report of the fine-tuning process on the OpenAI platform: https://platform.openai.com/finetune/.

The following image shows a partial view of a typical fine-tuning report: Report

Interpreting the Curves from the Report

The training loss curve shows some fluctuations, which is normal. However, the overall trend is a consistent decrease over time, suggesting that the model is learning and improving its ability to predict essay stances.

The final validation loss (0.68) represents the model’s performance on the entire validation set at the end of training. It is higher than the average validation loss during training, indicating that the model might have a harder time generalizing to the full validation set compared to smaller batches seen during training.

Use a fine-tuned ChatGPT model

finetuned_model_classification = "ft:gpt-3.5-turbo-0125:miratrix-mozer-text-grant-team:ipad-blog3-stance:9GrXRpT4"
# Example 1

completion = openai.chat.completions.create(
  model = finetuned_model_classification,
  messages=[
    {"role": "system", "content": "You are an expert essay grader for students in grades 4-7."},
    {"role": "user", "content": """
          <Prompt>
          Classify the following essay on iPad usage in schools into one of the following:
          
          Allow iPads in school (AFF) 
          Do not allow iPads in school (NEG)
          Allow iPads in school with restrictions (BAL)
          Not clear whether iPads should be allowed in school and does not take a position (AMB) 
          Not argumentative stance and off topic (NAR)
          
          <Essay>
          Do you think we should get rid of the Ipad why or why not? 
          No, I do not think we should get rid of the Ipad because we use last paper. 
          Some kids don't care. If they do work they still think is fun. 
          So they get more done faster which can make them past. 
          Also when there done they can play learning games. 
          Explain why the principal decision impacts you. 
          It impacts us if he says no we can of have the many more. 
          If he says yes we can keep them. So please keep them. 
          """}
  ]
)
print(completion.choices[0].message)
ChatCompletionMessage(content='Classify the essay on iPad usage in schools as: Allow iPads in school (AFF)', role='assistant', function_call=None, tool_calls=None)

As we discussed above, the response is stored in the content of “assistant”.

# Example 2

completion = openai.chat.completions.create(
  model = finetuned_model_classification,
  messages=[
    {"role": "system", "content": "You are an expert essay grader for students in grades 4-7."},
    {"role": "user", "content": """
          < Prompt >
          Classify the following essay on iPad usage in schools into one of the following: 
          
          Allow iPads in school (AFF) 
          Do not allow iPads in school (NEG)
          Allow iPads in school with restrictions (BAL)
          Not clear whether iPads should be allowed in school and does not take a position (AMB)
          Not argumentative stance and off topic (NAR)
          
          <Essay>
          The daily planning. Principal has decided that tablets and Ipad will not be used in school. 
          I think that the principal is right to take away Ipads and tablets 
          because some people get on Facebook or Twitter 
          and say mean things or talk about people. 
          Some people might be walking around and drop the Ipad or tablet. 
          On the notes on the Ipad write something dirty or draw something dirty. 
          On the camera they could make a mean video about someone. 
          It help people by not starting nothing while they're in school. 
          So some people want know what other people think about them. 
          That someone don't get picked on or pushed around. 
          Also they could block Instagram, Facebook, Oovoo, Kik, Twitter, Myspace video chat 
          and all that stuff and have only school work. 
          For the notes they can make sure they delete it before they give it to someone else. 
          The principal decision can affect other people because some people might have a Ipad 
          instead of a phone or Ipod. 
          Like my sister phone is off and she carry around her Ipad 
          so if they ban Ipad what would she do.
          """}
  ]
)
print(completion.choices[0].message.content)
Classify the essay on iPad usage in schools as "Allow iPads in school with restrictions (BAL)"

Again, to extract ChatGPT’s response from the “assistant” role, we can add .content after completion.choices[0].message.

# Example 3
completion = openai.chat.completions.create(
  model = finetuned_model_classification,
  messages=[
    {"role": "system", "content": "You are an expert essay grader for students in grades 4-7."},
    {"role": "user", "content": """
          <Prompt>
          Classify the following essay on iPad usage in schools into one of the following: 
          
          Allow iPads in school (AFF) 
          Do not allow iPads in school (NEG)
          Allow iPads in school with restrictions (BAL)
          Not clear whether iPads should be allowed in school and does not take a position (AMB)
          Not argumentative stance and off topic (NAR)
          
          <Essay>
          We should use the Ipad and tablet sometimes.
          The reason I said that because the people 
          who don't do work will just play on them like it's a toy. 
          Second reason I said that because we can also use them as dictionaries. 
          And search important stuff. 
          Third reason I said that because some people 
          might just take them home or steal them or break them. 
          Also I said that because some people are responsible to handle it.
          """}
  ]
)
print(completion.choices[0].message.content)
Allow iPads in school with restrictions (BAL)

We used three example essays for classification. The fine-tuned ChatGPT model classified them as AFF, BAL, and BAL, while human classification was AFF, NEG, and BAL. This shows a 2/3 match with the provided data format. While fine-tuning reduces the randomness of ChatGPT’s responses, we can still observe variations in output format. As previously mentioned, the implementation of grading multiple essays through a loop is not covered in this blog.

Cost & Computation Times of Fine-Tuning

Fine-Tuning Cost and Time

Task Time Cost Data Usage
Essay Scoring 8 minutes $0.54 (66,738 tokens) 70 training samples
Essay Classification 8.5 minutes $0.57 (41,016 + 41,617 = 82,633 tokens) 50 training samples + 50 valid ation samples

Essay scoring and classification took similar amounts of time to fine-tune. Interestingly, essay classification used more tokens but had a similar cost to essay scoring. This suggests that the cost per token for validation sets and processing might be lower than for training sets.

Cost Comparision

For detailed pricing information on ChatGPT usage, OpenAI’s pricing page can be visited here.

Model Fine-Tuning Input Usage Output Usage
Standard model (gpt-3.5-turbo) \$0.50/1M tokens \$1.50/1M tokens
Fine-tuned model (gpt-3.5-turbo) \$8.00/1M tokens \$3.00/1K tokens \$6.00/1M tokens

Each time we use the model (even a fine-tuned version), we are charged for both the input we provide and the response the model generates. Additionally, fine-tuned models offer specialized capabilities and generally produce higher quality outputs. This is reflected in the increased input and output costs.

Conclusion

We fine-tuned the ChatGPT 3.5 model through the ChatGPT API, focusing on 70 text samples (essay scoring) and 50 text samples (essay classification) related to iPad usage in schools. While fine-tuning can lead to more reliable and context-specific results, our limited testing with 2-3 essays suggests that improvement isn’t always guaranteed, aligning with findings that fine-tuning may not consistently enhance performance in every case (Kim et al., 2024).

By following the recommended steps and adhering to the correct data formatting, we could harness the power of customization, tailoring the model to our specific needs in grading and classifying essays about iPad usage. While initially challenging, fine-tuning becomes a relatively straightforward process once you invest time in converting your existing CSV data into the appropriate JSON format.

Fine-tuned models may offer increased efficiency for larger volumes of essays, potentially reducing randomness and improving performance. However, it is crucial to remember that fine-tuning and subsequent usage incur significantly higher costs compared to the standard model. This means users should carefully consider their needs and budget before choosing between the two options (standard vs. fine-tuning).

Moreover, it is important to acknowledge that this field is rapidly evolving. OpenAI recently released gpt-40 mini, which is faster, cheaper, and can handle multimodal data like images, audio, and more. Staying abreast of these advancements is essential to use the latest capabilities and make informed decisions about which models and techniques best suit your needs.

We hope this blog post demonstrates the potential of fine-tuning for essay grading and inspires you to explore its possibilities within education. As technology advances, the capacity to fine-tune models like ChatGPT presents opportunities and innovations that can positively impact student outcomes and formative assessment, while also alleviating the burdens on educators and researchers.

References

  • Kim, Y., Mozer, R., Miratrix, L., & Al-Ademi, S. (2024). ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring (in preparation).

Comments: