How to Grade Essays with ChatGPT

by Youngwon Kim, Reagan Mozer, Shireen Al-Adeimi, and Luke Miratrix

How to Grade Essays with ChatGPT

Introduction

The rise of large language models (LLMs) like OpenAI’s ChatGPT has opened exciting possibilities in essay grading. With its advanced natural language processing capabilities, ChatGPT offers a new dimension in assessing written work, potentially revolutionizing the grading process for educators and researchers. Let’s delve into how ChatGPT could potentially make essay grading easier, more efficient, and more accurate.

ChatGPT can analyze written content for various parameters, including content quality, argument structure, coherence, and adherence to guidelines. Whether you use a continuous scoring system (e.g., quality of writing) or a discrete one (e.g., essay positions), ChatGPT can be tailored to your specific needs, offering customized feedback for different writing styles and assignments. Literature also suggests that LLMs can significantly increase grading efficiency, alleviating some of the burden on educators (Abedi et al., 2023; Okonkwo & Ade-Ibijola, 2021; Richter et al., 2019). Imagine grading hundreds of essays and providing feedback on them – a time-consuming and tiring task. ChatGPT can automate the initial assessment, flagging essays that require further attention based on specific criteria. Additionally, ChatGPT can identify stylistic strengths and weaknesses, analyze the use of literary devices, and even point out potential inconsistencies in an argument’s logic. This could free up valuable educator time for student interaction and curriculum development.

However, caution against over-reliance on this new technology is adivsed in scenarios where biased or inaccurate models could unfairly impact individual students. It is essential to recognize both the potential advantages and limitations of LLMs. This blog post aims to delve into and reflect on ChatGPT’s capabilities for grading and classifying essays and to provide insights into the practical application of using ChatGPT in educational settings.

Objectives

In this blog, we will explore:

  1. Essay grading with ChatGPT and ChatGPT API
  2. Steps for essay grading with ChatGPT API
  3. Steps for essay classification with ChatGPT API
  4. Cost & computation times

For steps 2 and 3, we will provide detailed instructions on how to access and set up the ChatGPT API, prepare and upload your text dataset, and efficiently grade or classify numerous essays. Additionally, we will compare the outcomes of human grading to those obtained through GPT grading.

## Essay Grading with ChatGPT and ChatGPT API

For a single essay, we can simply ask ChatGPT to grade as follows:

For multiple essays, we could request ChatGPT to grade each one individually. However, when dealing with a large number of essays (e.g., 50, 100, 1000, etc.), manually grading them in this way becomes a laborious and time-consuming task. In such cases, we can leverage the ChatGPT API service to evaluate numerous essays at once, providing greater flexibility and efficiency. ChatGPT API is a versatile tool that enables developers to integrate ChatGPT into their own applications, services, or websites. When you use the API, you also gain more control over the interaction, such as the ability to adjust temperature, maximum tokens, and the presence of system messages.

It is important to understand the distinctions between ChatGPT’s web interface and the pretrained models accessible through the OpenAI API.

ChatGPT’s web version provides a user-friendly chat interface, requiring no coding knowledge and offering features like integrated system tools. However, it is less customizable and is not designed for managing high volumes of requests. Additionally, due to its internal short-term memory span, previous conversations can influence later responses. In contrast, the OpenAI API offers pretrained models without a built-in interface, necessitating coding experience for integration. These models excel at managing large request volumes, but lack ChatGPT’s conversational memory; they process each input independently. This fundamental difference can lead to variations in the outputs generated by ChatGPT’s web interface and the OpenAI API.

Here’s an example of grading a single essay using the ChatAPI with Python:

# Import the "openai" library
import openai

# Open API key from "https://platform.openai.com/api-keys"
openai.api_key = "YOUR-API-KEY"

# A function to generate a response
def chatGPT(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0 # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content
# Prompt requesting OpenAI to grade an essay
prompt = """ Rate the overall quality of the essay below, 
             using a scale from 1 to 7, 
             where higher scores indicate better quality. 
             Provide only the overall quality score.

             >>> Essay >>>
             My perspective on this school community issue is that Ipads 
             should be allowed in the school. 
             Even though students abused their opportunity and got it abolished, 
             there is a chance that it can be solved. 
             Each Ipad comes with a Internet Protocol address. 
             After that the principal can dismiss their privileges 
             and punish them depending on what they did. 
             One reason I believe that Ipads should be allowed 
             in school is innovative learning. 
             Children can easily do hands on learning. 
             It can be convenient for use, and make a way 
             to search for things you don't understand, 
             instead of asking your neighbor beside you for help. 
             Another reason I believe that Ipads should be allowed in school 
             is that it provides, a way to communicate 
             to school teachers, principals, stuff, et cetera. 
             Sometimes it can provide a way to say. 
             If a mean message is sent, as I stated above, 
             you can track them down. 
             They also can program or set up a senior for certain words. 
             A final reason why I believe that the school should distribute 
             Ipads is because some students do not have computers, 
             And if they have to do a report, they can study it on their Ipads. 
             This will increase academic performance of some students 
             and provide essential information for the student. 
             This is my view on whether or not the school should have Ipads."""

# Grade the essay above
chatGPT(prompt)
'4'

Interestingly, this example produces a single score rather than the sentence generated above via the ChatGPT web interface. This difference could be attributed to the ChatGPT API interpreting the prompt more directly than the ChatGPT online service, even though they both use the same pretrained model. Alternatively, the variability in ChatGPT’s results might be due to inherent randomness in its responses.

By implementing a loop with multiple texts, we can acquire scores for an entire set of essays. Let’s see how to do that.

Steps for Essay Grading with ChatGPT API

Get and set up a ChatGPT API key

We assume that you have already installed the Python OpenAI library on your system and have an active OpenAI account. Setting up and obtaining access to the ChatGPT API involves the following steps:

  • Obtain an OpenAI key: Vist the OpenAI API website at https://platform.openai.com/api-keys and click +Create a new secret key button. Save your key securely, as you cannot regenerate the same code due to OpenAI’s security policies.

  • Set ip API key: In your Python script or notebook, set up the API key using the following code, replacing “YOUR-API-KEY” with your actual API key:

import openai

openai.api_key = "YOUR-API-KEY"

Load the text dataset

In this post, we will grade a series of essays about the iPad usage in schools

# Import the "pandas" library
import pandas as pd

text_df = pd.read_csv("iPad1.csv")
text_df.head()
Text Stance_iPad Scores
0 Some people allow Ipads because some people ne… AMB 1
1 I have a tablet. But it is a lot of money. But… AMB 1
2 Do you think we should get rid of the Ipad wh… AMB 1
3 I said yes because the teacher will not be tal… AMB 2
4 Well I would like the idea . But then for it … AMB 4

Score the multiple essays

# Create an empty list for multiple results
GPT_score_results = []
# Import the "tqdm" library to show the processing time of each loop 
from tqdm import tqdm 

# Score the essays through the implementation of a loop
for i in tqdm(range(len(text_df["Text"])), desc="Processing", unit="iteration"):
    prompt = f"""
    Rate the overall quality of the essay below, 
    using a scale from 1 to 7, where higher scores indicate better quality. 
    
    Provide only the overall quality score.
    ```{text_df["Text"][i]}```
    """
    score = chatGPT(prompt)
    GPT_score_results.append(score)
Processing: 100%|████████████████████████| 50/50 [00:25<00:00,  1.96iteration/s]

Grading 50 essays takes only 25 seconds.

# Check the grading results by ChatGPT
GPT_score_results[0:5]
['2', '2', '2', '2', '4']
# Convert strings to floats using list comprehension
GPT_score_results2 = [float(x) for x in GPT_score_results] 
GPT_score_results2[0:5]
[2.0, 2.0, 2.0, 2.0, 4.0]
# Add the results to the original dataset
text_df['Scores_GPT'] = GPT_score_results2
text_df.head()
Text Stance_iPad Scores Scores_GPT
0 Some people allow Ipads because some people ne… AMB 1 2.0
1 I have a tablet. But it is a lot of money. But… AMB 1 2.0
2 Do you think we should get rid of the Ipad wh… AMB 1 2.0
3 I said yes because the teacher will not be tal… AMB 2 2.0
4 Well I would like the idea . But then for it … AMB 4 4.0

Compare human grading scores with GPT grading scores

For these data, we happend to have scores given by human raters as well, allowing us how similar the human scores are to the scores generated by ChatGPT.

Using the code provided in the accompanying script, we get the following:

import matplotlib.pyplot as plt

# Calculate the overall range of scores
min_score = min(text_df['Scores'].min(), text_df['Scores_GPT'].min())
max_score = max(text_df['Scores'].max(), text_df['Scores_GPT'].max())

# Calculate the maximum frequency across both datasets
max_freq = max(max(text_df['Scores'].value_counts()), max(text_df['Scores_GPT'].value_counts()))

plt.subplot(1, 2, 1)
plt.hist(text_df['Scores'], bins=10, range=(min_score, max_score), color='blue', edgecolor='black')
plt.title('Histogram of Human Grading Scores')
plt.xlabel('Scores')
plt.ylabel('Frequency')
plt.ylim(0, max_freq)  # Set y-axis limits

plt.subplot(1, 2, 2)
plt.hist(text_df['Scores_GPT'], bins=10, range=(min_score, max_score), color='green', edgecolor='black')
plt.title('Histogram of GPT Grading Scores')
plt.xlabel('Scores_GPT')
plt.ylim(0, max_freq)  # Set y-axis limits

plt.tight_layout()
plt.show()

# Averages and standard deviations

import numpy as np

mean_human = np.mean(text_df['Scores'])
mean_gpt = np.mean(text_df['Scores_GPT'])
sd_human = np.std(text_df['Scores'])
sd_gpt = np.std(text_df['Scores_GPT'])

print(f"Average of Human Grading Scores: {mean_human}")
print(f"SD of Human Grading Scores: {round(sd_human, 2)}")
print(f"Average of GPT Grading Scores: {mean_gpt}")
print(f"SD of GPT Grading Scores: {round(sd_gpt, 2)}")
Average of Human Grading Scores: 2.54
SD of Human Grading Scores: 1.68
Average of GPT Grading Scores: 2.36
SD of GPT Grading Scores: 0.74

A contigency table (confusion matrix) of the scores is:

# Contingency table
pd.crosstab(text_df['Scores'],text_df['Scores_GPT'])
Scores_GPT 1.0 2.0 3.0 4.0 5.0
Scores
0 1 7 0 0 0
1 0 9 0 0 0
2 0 4 1 0 0
3 0 8 2 0 0
4 0 8 3 2 0
5 0 0 2 2 0
6 0 0 0 0 1
from sklearn.metrics import mean_squared_error

# Calculate Correlation and Root Mean Squared Error
correlation = text_df['Scores'].corr(text_df['Scores_GPT'])
rmse = np.sqrt(mean_squared_error(text_df['Scores'], text_df['Scores_GPT']))

print(f"Correlation of Human and GPT Scores: {round(correlation, 2)}")
print(f"Root Mean Squared Error (RMSE): {round(rmse, 2)}")
Correlation of Human and GPT Scores: 0.62
Root Mean Squared Error (RMSE): 1.36

The averages and standard deviations of human grading and GPT grading scores are 2.54 (SD = 1.68) and 2.34 (SD = 0.74), respectively. The correlation between them is 0.62, indicating a fairly strong positive linear relationship. Additionally, the Root Mean Squared Error (RMSE) is 1.36, providing a measure of the GPT’s prediction accuracy compared to the actual human grading scores.

Steps for Essay Classification with ChatGPT API

ChatGPT can be utilized not only for scoring essays but also for classifying essays based on some categorical variable such as writers’ opinions regarding iPad usage in schools. Here are the steps to guide you through the process, assuming you already have access to the ChatGPT API and have loaded your text dataset:

Classify multiple essays

# Create an empty list for multiple results
GPT_stance_results = []
# Import the "tqdm" library to show the processing time of each loop 
from tqdm import tqdm 

for i in tqdm(range(len(text_df["Text"])), desc="Processing", unit="iteration"):
    prompt = f"""
    Evaluate and classify the position on the use of iPads in schools into one of the following categories:

    Allow iPads in school (AFF)
    Do not allow iPads in school (NEG)
    Or, if the essay does not align with either of these categories, designate it as (OTHER). 
    
    Provide your response as either AFF, NEG, or OTHER.
    ```{text_df["Text"][i]}```
    """
    stance = chatGPT(prompt)
    GPT_stance_results.append(stance)
Processing: 100%|████████████████████████| 50/50 [00:27<00:00,  1.85iteration/s]

Classifying 50 essays takes only 27 seconds.

# Check the classification results by ChatGPT
GPT_stance_results[0:5]
['OTHER', 'OTHER', 'OTHER', 'OTHER', 'OTHER']

We create a new column re_Stance_iPad based on the mapping of values from the existing Stance_iPad column. Except for AFF and NEG opinions, opinions on AMB, BAL, and NAR are unclear. Therefore, AMB, BAL, and NAR are combined as OTHER.

# Create a new column re_Stance_iPad 
text_df['re_Stance_iPad'] = text_df['Stance_iPad'].map({
                                                        'AFF': 'AFF',
                                                        'AMB': 'OTHER', 
                                                        'BAL': 'OTHER',
                                                        'NAR': 'OTHER',  
                                                        'NEG': 'OTHER'  
                                                       })
# Add the results to the original dataset
text_df['Stance_iPad_GPT'] = GPT_stance_results
text_df.head()
Text Stance_iPad Scores Scores_GPT re_Stance_iPad Stance_iPad_GPT
0 Some people allow Ipads because some people ne… AMB 1 2.0 OTHER OTHER
1 I have a tablet. But it is a lot of money. But… AMB 1 2.0 OTHER OTHER
2 Do you think we should get rid of the Ipad wh… AMB 1 2.0 OTHER OTHER
3 I said yes because the teacher will not be tal… AMB 2 2.0 OTHER OTHER
4 Well I would like the idea . But then for it … AMB 4 4.0 OTHER OTHER

Compare human classification with GPT classification

# Contingency table
pd.crosstab(text_df['re_Stance_iPad'],text_df['Stance_iPad_GPT'])
Stance_iPad_GPT AFF NEG OTHER
re_Stance_iPad
AFF 7 0 3
NEG 0 9 1
OTHER 3 1 26
# Evaluate the classification results
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score

# Suppress warning messages
import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

# Assuming y_true is the true labels and y_pred is the predicted labels
y_true = text_df['re_Stance_iPad']
y_pred = text_df['Stance_iPad_GPT']

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
kappa = cohen_kappa_score(y_true, y_pred)

# Display the metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {round(precision,2)}")
print(f"Recall: {recall}")
print(f"F1 Score: {round(f1, 2)}")
print(f"Cohen's Kappa: {round(kappa, 2)}")
Accuracy: 0.84
Precision: 0.84
Recall: 0.84
F1 Score: 0.84
Cohen's Kappa: 0.71

ChatGPT achieves an accuracy of approximately 84%, demonstrating its correctness in classification. An F1 score of 0.84, reflecting the harmonic mean of precision and recall, signifies a well-balanced performance in terms of both precision and recall. Additionally, the Cohen’s Kappa value of 0.71, which measures the agreement between predicted and actual classifications while accounting for chance, indicates substantial agreement beyond what would be expected by chance alone.

Cost & Computation times

How long does it take to assess all essays?

Grading and classifying 50 essays each took 25 and 27 seconds, resulting in a rate of about 2 essays per second.

What is the cost of assessing all essays?

In this blog, we utilized GPT-3.5-turbo-0125. According to OpenAI’s pricing page, the cost for input processing is $0.0005 per 1,000 tokens, and for output, it is $0.0015 per 1,000 tokens, indicating that the ChatGPT API charges for both tokens sent out and tokens received.

The total expenditure for grading all essays—50 assessing essay quality and 50 for essay classification—was approximately $0.01.

What are tokens and how to count them?

Tokens can be viewed as fragments of words. When the API receives prompts, it breaks down the input into tokens. These divisions do not always align with the beginning or end of words; tokens may include spaces and even parts of words. To grasp the concept of tokens and their length equivalencies better, here are some helpful rules of thumb:

  • 1 token ≈ 4 characters in English.
  • 1 token ≈ ¾ of a word.
  • 100 tokens ≈ 75 words.

Or

  • 1 to 2 sentences ≈ 30 tokens.
  • 1 paragraph ≈ 100 tokens.
  • 1,500 words ≈ 2,048 tokens.

To get additional context on how tokens are counted, consider this:

# Import 'tiktoken' package and define the function to count tokens
import tiktoken

encoding = tiktoken.get_encoding("cl100k_base")

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens
num_tokens_from_string("""Spread love everywhere you go. 
                       Let no one ever come without leaving happier - Mother Teresa""", 
                       "cl100k_base")
17
num_tokens_from_string(prompt, "cl100k_base")
129
num_tokens_from_string("I would rate this essay a 4 out of 7", "cl100k_base")
12

The prompt at the beginning of this blog, requesting that OpenAI grade an essay, contains 129 tokens, and the output contains 12 tokens.

The input cost is $0.0000645, and the output cost is $0.000018.

Conclusion

ChatGPT provides an alternative approach to essay grading. This post has delved into the practical application of ChatGPT’s natural language processing capabilities, demonstrating how it can be used for efficient and accurate essay grading, with a comparison to human grading. The flexibility of ChatGPT is particularly evident when handling large volumes of essays, making it a viable alternative tool for educators and researchers. By employing the ChatGPT API key service, the grading process becomes not only streamlined but also adaptable to varying scales, from individual essays to hundreds or even thousands.

This technology has the potential to significantly enhance the efficiency of the grading process. By automating the assessment of written work, teachers and researchers can devote more time to other critical aspects of education. However, it’s important to acknowledge the limitations of current LLMs in this context. While they can assist in grading, relying solely on LLMs for final grades could be problematic, especially if LLMs are biased or inaccurate. Such scenarios could lead to unfair outcomes for individual students, highlighting the need for human oversight in the grading process. For large scale research, where we look at always across many essays, this is less of a concern (see e.g., Mozer et al., 2023)

The guide in this blog has provided a step-by-step walkthrough of setting up and accessing the ChatGPT API essay grading.

We also explored the reliability of ChatGPT’s grading, as compared to human grading. The moderate positive correlation of 0.62 attests to same consistency between human grading and ChatGPT’s evaluations. The classification results reveal that the model achieves an accuracy of approximately 84%, and the Cohen’s Kappa value of 0.71 indicates substantial agreement beyond what would be expected by chance alone. See the related study (Kim et al., 2024) for more on this.

In essence, this comprehensive guide underscores the transformative potential of ChatGPT in essay grading, presenting it as a valuable approach in the ever-evolving educational fields. This post gives an overview; we next dig in a bit more, thinking about prompt engineering + providing examples to improve accuracy.

Writer’s Comments

The API Experience: A Blend of Ease and Challenge

Starting your journey with the ChatGPT API will be surprisingly smooth, especially if you have some Python experience. Copying and pasting code from this blog, followed by acquiring your own ChatGPT API and tweaking prompts and datasets, might seem like a breeze. However, this simplicity masks the underlying complexity. Bumps along the road are inevitable, reminding us that “mostly” easy does not mean entirely challenge-free.

The biggest hurdle you will likely face is mastering the art of crafting effective prompts. While ChatGPT’s responses are impressive, they can also be unpredictably variable. Conducting multiple pilot runs with 5-10 essays is crucial. Experimenting with diverse prompts on the same essays can act as a stepping stone, refining your approach and building confidence for wider application.

When things click, the benefits are undeniable. Automating the grading process with ChatGPT can save considerable time. Human graders, myself included, can struggle with maintaining consistent standards across a mountain of essays. ChatGPT, on the other hand, might be more stable when grading large batches in a row.

It is crucial to acknowledge that this method is not a magic bullet. Continuous scoring is not quite there yet, and limitations still exist. But the good news is that LLMs like ChatGPT are constantly improving, and new options are emerging.

Overall Reflections: A Journey of Discovery

The exploration of the ChatGPT API can be a blend of innovation, learning, and the occasional frustration. While AI grading systems like ChatGPT are not perfect, their ability to save time and provide consistent grading scheme makes them an intriguing addition to the educational toolkit. As we explore and refine these tools, the horizon for their application in educational settings seems ever-expanding, offering a glimpse into a future where AI and human educators work together to enhance the learning experience. Who knows, maybe AI will become a valuable partner in the grading process in the future!

Call to Action

Have you experimented with using ChatGPT for grading? Share your experiences and questions in the comments below! We can all learn from each other as we explore the potential of AI in education.

References

  • Abedi, M., Alshybani, I., Shahadat, M. R. B., & Murillo, M. (2023). Beyond Traditional Teaching: The Potential of Large Language Models and Chatbots in Graduate Engineering Education. Qeios. https://doi.org/10.32388/MD04B0
  • Kim, Y., Mozer, R., Miratrix, L., & Al-Ademi, S. (2024). ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring (in preparation).
  • Okonkwo, C. W., & Ade-Ibijola, A. (2021). Chatbots applications in education: A systematic review. Computers and Education: Artificial Intelligence, 2, 100033. https://doi.org/10.1016/j.caeai.2021.100033
  • Pricing. (n.d.). OpenAI. Retrieved March 2, 2024, from https://openai.com/pricing#language-models
  • Mozer, R., Miratrix, L., Relyea, J. E., & Kim, J. S. (2023). Combining Human and Automated Scoring Methods in Experimental Assessments of Writing: A Case Study Tutorial. Journal of Educational and Behavioral Statistics, 10769986231207886. https://doi.org/10.3102/10769986231207886
  • Zawacki-Richter, O., Marín, V. I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education–where are the educators?. International Journal of Educational Technology in Higher Education, 16(1), 1-27. https://doi.org/10.1186/s41239-019-0171-0

Comments: