The Art of Crafting Prompts for Essay Grading with ChatGPT

by Youngwon Kim

on Jun 3, 2024 · 24 min read · Automated Essay Scoring ChatGPT coding Large Language Model ·

Overview

The Art of Crafting Prompts for Essay Grading with ChatGPT

Introduction

In our first blog, “A Comprehensive Guide to Essay Grading with ChatGPT,” we navigated the initial steps of using the ChatGPT API for grading essays. For those with some python experience, the process of incorporating code from the first blog is straightforward. However, we also touched on a challenge: the intricacies of crafting effective prompts. This second entry in the “Essay Grading with ChatGPT” series delves deeper into this challenge, comparing the outcomes of essay grading based on different prompts (the instructions we give to ChatGPT) to optimize AI for educational purposes.

Objectives

In this blog, our goal is to elucidate how varying prompts affect the grading performance of ChatGPT, offering educators and researchers insights for the use of ChatGPT for grading. By experimenting with a range of prompts, from simple to complex, we aim to provide a better understanding of how to use ChatGPT most effectively in assessing student essays, alongside a comparative analysis of results derived from these diverse prompts.

In this blog we will cover the following variations:

Scoring ranges: Exploring the effects of providing specific score ranges (e.g., 1-5, 1-7, 1-10).
Role: Examining how assigning ChatGPT a specific role (e.g., “grader for elementary, college, graduate students”) influences outcomes.
Scoring criteria: Investigating the influence of providing explicit criteria (e.g., grammar, content, structure) versus leaving it more open.
Form of outcome: Comparing results when asking for a numerical score, Likert scale rating, or different data structures (e.g., JSON).
Advanced techniques: Delving into few-shot learning and chain-of-thought reasoning to enhance ChatGPT’s grading capabilities.

Approach

We will compare scores generated by ChatGPT against human-assigned scores for a dataset of 2,687 student essays on the topic of iPads in school. The human-assigned scores range from 1 to 7, with a mean score of 4.03 and a standard deviation of 1.13. The distribution of these scores is approximately normal, as visualized in the histogram below:

These essays, written by students in grades 4-8, were evaluated by ChatGPT using 15 distinct prompts of varying complexity and specificity. This multifaceted approach will allow us to identify which prompt variations elicit the most accurate and consistent grading results from ChatGPT, aligning closely with human assessments.

Let’s dive into the details of our experiment and uncover the secrets of prompt engineering for optimal essay grading with ChatGPT.

Grading an Essay Using Various Prompts

The way we compared prompts is to have ChatGPT grade all the essays we had using different prompts of interest. Our final results was a dataset with, for each essay, 15 different scores (and the actual human score). ChatGPT also gave written feedback; in our current paper in preparation we are looking at how the tone and substance of that feedback changes with different guidance. But for this blog, we focus on the numeric scores.

1. Variations in Scoring Ranges

First of all, we explore how different score ranges might impact scoring. In particular, we wonder if the relative scores will shift, or if chatGPT will use more categories when provided with a wider range of scores. To examine this, we consider three prompts, each with a different maximum score:

prompt1: “Evalute the overall quality of the following essay out of 5.”
prompt2: “Evalute the overall quality of the following essay out of 7.”
prompt3 : “Evalute the overall quality of the following essay out of 10.”

Comparing Results from Prompts 1-3

Overall Results: Prompts 1-3

Scores	Mean	SD	R²	RMSE
Human Scores (Out of 7)	4.03	1.13
Prompt 1 (Out of 5) x 1.4	1.96	0.69	0.3	2.28
Prompt 2 (Out of 7)	2.02	0.53	0.36	2.21
Prompt 3 (Out of 10) x 0.7	1.75	0.51	0.38	2.45

The table presents an evaluation of different prompts compared to human-generated scores. The scores for Prompt 1 (originally out of 5) and Prompt 3 (originally out of 10) are adjusted to a scale of 7 by multiplying by 1.4 and 0.7, respectively, to facilitate comparison.

Overall, ChatGPT is harsh, and gives low scores no matter what scale we give it.

Prompt 1 demonstrates the weakest agreement with human scores, as indicated by its lowest R² value (0.30). This means that only 30% of the variation in the human scores can be explained by the ChatGPT’s predictions using Prompt 1. While Prompt 2 shows a slightly better agreement (R² = 0.36), Prompt 3 shows the strongest agreement with human scores (R² = 0.38). That said, an \(R^2\) of 0.38 is still quite low, indicating that ChatGPT’s scores are not highly correlated with human scores.

An analysis of the means and standard deviations reveals an interesting trend: as the score range widens, the variability of the scores decreases, resulting in closer alignment with the original human scores. This suggests that a wider range of possible scores may allow for more nuanced and accurate evaluations, which in turn could lead to higher R² values and better agreement with human judgments.

2. Role-Specific Prompts and Targeted Scoring Ranges

To investigate how ChatGPT’s scoring is influenced by the age-specific role of the essay grader (elementary, college, or graduate), we employe three distinct prompts, each assigning ChatGPT the role of an essay grader for a specific age group while maintaining a consistent 7-point scoring scale:

Prompt 4: “You are an essay grader for elementary school students. Evaluate the overall quality of the following essay out of 7.”
Prompt 5: “You are an essay grader for college students. Evaluate the overall quality of the following essay out of 7.”
Prompt 6: “You are an essay grader for graduate students. Evaluate the overall quality of the following essay out of 7.”

Comparing Results from Prompts 4-6

Overall Results: Prompts 4-6

Scores	Mean	SD	R²	RMSE
Human Scores (Out of 7)	4.04	1.13
Prompt 4 (Elementary School Student)	3.23	0.94	0.42	1.2
Prompt 5 (College Student)	2.41	0.70	0.38	1.85
Prompt 6 (Graduate Student)	2.34	0.67	0.36	1.93

Prompt 4 (Elementary School Student) demonstrates the highest agreement with human scores, as indicated by its R² value of 0.42. Prompts 5 (College Student) and 6 (Graduate Student) show slightly lower, though still substantial, agreement with human scores, with R-squared values of 0.38 and 0.36, respectively.

The average scores generated by the prompts significantly decrease from elementary (3.23) to graduate school levels (2.34). This suggests that models, when prompted with higher educational levels, tend to become more critical or stringent in their evaluations. This trend is further supported by bar charts, which illustrate an increase in the frequency of lower scores (e.g., 2 and 3) as the educational level of the prompts increases.

While the differences in R² values are not significant, the observed trends suggest that role-specific prompts could be effective in eliciting distinct evaluation patterns from the models. This highlights the potential of tailoring prompts to specific roles or educational levels to achieve desired evaluation outcomes.

3. Variations in Specific Scoring Criteria

Next, we incorporate specific scoring criteria into the prompts to evaluate ChatGPT’s sensitivity to different aspects of essay quality, and to investigate whether it genuinely considers the specified criteria when assigning scores:

Prompt 7: “Evaluate the overall quality of the following essay on the use of iPads in school, considering structure, content, argument strength, and writing quality. Use a 7-point scale with higher scores indicating greater quality.”
Prompt 8: “Evaluate the overall quality of the following essay on the use of iPads in school, considering grammar, coherence, organization, and relevance. Use a 7-point scale with higher scores indicating greater quality.”
Prompt 9: “Evaluate the overall quality of the following essay on the use of iPads in school, considering development of ideas, organization, and language facility and convention. Use a 7-point scale with higher scores indicating greater quality.”

Comparing Results from Prompt 7-9

Overall Results: Prompts 7-9

Scores	Mean	SD	R²	RMSE
Human Scores (Out of 7)	4.04	1.12
Prompt 7 (Struc., Content, Arg. Strength, & Writing Qual.)	1.90	0.49	0.37	2.33
Prompt 8 (Grammar, Coherence, Org., & Relevance)	2.03	0.44	0.3	2.23
Prompt 9 (Dev. of Ideas, Org., & Lan. Fac. and Convention)	1.99	0.48	0.34	2.25

Although Prompt 9 explicitly includes the scoring criteria of development of ideas, organization, language facility, and convention when grading the 2687 essays, Prompt 7, which emphasizes structure, content, argument strength, and writing quality, demonstrates slightly higher agreement with human scores (R² = 0.37) compared to Prompts 8 and 9 (R² = 0.3 and 0.34, respectively).

Interestingly, despite the differences in the specific wording of the scoring criteria across the prompts, their resulting means, standard deviations, and bar chart distributions are quite similar. This observation may suggest that when ChatGPT is provided with only the names of scoring criteria, without detailed descriptions or rubrics to guide their interpretation, its results tend to converge towards similar evaluation patterns. This convergence may indicate a lack of nuance in the models’ understanding and application of the criteria.

4. Guided Prompts: Defining Outcomes

To assess whether ChatGPT’s evaluation is sensitive to the specified output format, we designe three prompts that are identical in their core request (to evaluate the overall quality of an essay on the use of iPads in school using a 7-point scale), but differ in how the response is to be presented:

Prompt 10: “Evaluate the overall quality of the following essay regarding the use of iPads in school. Use a 7-point scale with higher scores indicating greater quality. Present your response as a single numeric score.”
Prompt 11: “Evaluate the overall quality of the following essay regarding the use of iPads in school. Use a 7-point scale with higher scores indicating greater quality. Format your response as a JSON object with ‘score’ as the key.”
Prompt 12: “Evaluate the overall quality of the following essay regarding the use of iPads in school. Use a 7-point scale where the descriptors range from ‘Very Poor’ to ‘Excellent.’ Provide your response as either ‘Very Poor’, ‘Poor’, ‘Below Average’, ‘Average’, ‘Above Average’, ‘Very Good’, or ‘Excellent.’”

Comparing Results from Prompts 10-12

Overall Results: Prompts 10-12

Scores	Mean	SD	R²	RMSE
Human Scores (Out of 7)	4.03	1.14
Prompt10 (Numeric)	2.65	0.94	0.33	1.69
prompt11 (Json)	2.79	0.68	0.33	1.55
prompt12 (Likert Scale)	2.21	0.64	0.27	2.06

Prompts 10-12 share almost identical prompts, differing only in the format of the expected outcome. Despite their similarities, Prompt 10 (Numeric) and Prompt 11 (Json) exhibit slightly higher agreement with human scores (R² = 0.33) compared to Prompt 12 (Likert Scale) (R² = 0.27). Furthermore, the means and standard deviations of the scores produced by each prompt are different from each other. Particularly, Prompt 10 and Prompt 11, while sharing a similar R² value, exhibit different mean scores, standard deviations, and score distributions as illustrated in the bar charts.

These findings suggest that defining the desired outcome format (numeric, JSON, or Likert scale) can influence the grading results produced by ChatGPT. This implies that ChatGPT may be sensitive not only to the content of the prompt but also to the specific format in which it is expected to provide its evaluation.

5. Advanced Grading Techniques: Few-Shots Learning and Chain-of-Thought Reasoning

We wrap up by using some more advanced ChatGPT guidance techniques to see how they help with grading. In particular, we investigate how ChatGPT’s responses are affected by incorporating few-shot learning and chain-of-thought reasoning into the prompts. Before we show you the prompts, we quickly describe that these essential terms from the realm of LLMs mean to ensure a clearer understanding for everyone:

Prompts: Guiding LLM Responses

Zero-shot prompting: This involves providing a basic task description without any examples. The LLM relies solely on its pre-existing knowledge and language understanding to generate a response.
Few-shot prompting: This includes a few examples in the prompt to help guide the LLM’s response. The LLM learns from these examples and uses them to generate more relevant and accurate outputs.

Chain-of-Thought Reasoning

Chain-of-thought reasoning (CoT) is a technique that enhances the reasoning capabilities of LLMs. It involves breaking down complex problems into a series of smaller, more manageable steps, similar to how humans approach problem-solving. By generating intermediate reasoning steps, CoT allows LLMs to tackle intricate tasks that require logical deduction and multi-step reasoning.

Previous studies (e.g., Wu et al., 2023) have indicated that both few-shot prompting and COT reasoning can improve the accuracy and reliability of LLMs. These approaches have been shown to be effective in various domains, leading to more sophisticated and human-like responses from LLMs.

A Note on “Base” in Essay Grading

In the context of essay grading, there is no truly “zero-shot” approach. To obtain relevant scores, the prompt for essay grading must include instructions describing how to assess the essays on a specific scale and using a specific rubric. We refer to this comprehensive prompt as the “base prompt,” though it’s not a standard term in the field.

Ok, armed with above, we tried the following three prompts:

Prompt 13: “Evaluate the overall quality of the following essay regarding the use of iPads in school based on the following 3 criteria: (1) Development of Ideas, measuring the depth, complexity, and richness of details and examples; (2) Organization, focusing on the logical structure, coherence, and overall focus of ideas; (3) Language Facility and Convention, evaluating clarity, effectiveness in sentence structure, word choice, voice, tone, grammar, usage, and mechanics. Use a 7-point scale with higher scores indicating greater quality. Present your response as a numeric score.”
Prompt 14: “Evaluate the overall quality of the following essay regarding the use of iPads in school based on the following 3 criteria: (1) Development of Ideas, measuring the depth, complexity, and richness of details and examples; (2) Organization, focusing on the logical structure, coherence, and overall focus of ideas; (3) Language Facility and Convention, evaluating clarity, effectiveness in sentence structure, word choice, voice, tone, grammar, usage, and mechanics.

{fewshot_examples}

Use a 7-point scale with higher scores indicating greater quality. Present your response as a numeric score.”
Prompt 15: “Evaluate the overall quality of the following essay regarding the use of iPads in school based on the following 3 criteria: (1) Development of Ideas, measuring the depth, complexity, and richness of details and examples; (2) Organization, focusing on the logical structure, coherence, and overall focus of ideas; (3) Language Facility and Convention, evaluating clarity, effectiveness in sentence structure, word choice, voice, tone, grammar, usage, and mechanics.

{fewshot_examples}

Use a 7-point scale with higher scores indicating greater quality. When evaluating and scoring the given essay, consider the three criteria (Development of Ideas, Organization, and Language Facility and Convention) and the few-shot examples. Present your response as a numeric score.”

(Few-shot examples can be found in the appendix.)

Comparing Prompts 13-15

Overall Results: Prompts 13-15

Scores	Mean	SD	R²	RMSE
Human Scores (Out of 7)	4.03	1.13
Prompt13 (Base)	2.10	0.90	0.13	2.26
prompt14 (Few-shot)	2.23	0.98	0.21	2.11
prompt15 (Few-shot + CoT)	3.42	1.49	0.35	1.37

Prompt15 (Few-shot + CoT) demonstrates the highest agreement with human scores (R² = 0.35), followed by Prompt14 (Few-shot) (R² = 0.21), and Prompt13 (Base) (R-squared = 0.13). This indicates that incorporating few-shot examples and Chain of Thought (CoT) reasoning improves the alignment between model-generated scores and human evaluations. This trend is further supported by the increase in average scores as the complexity of the prompting technique increases: 2.10 for Prompt13 (Base), 2.23 for Prompt14 (Few-shot), and 3.42 for Prompt15 (Few-shot + CoT), getting closer to the human average score. Bar charts also visually confirm this trend, showing an increasing prevalence of scores in the 4-7 range as prompting complexity increases.

However, it turns out that Prompt15 (Few-shot + CoT) occasionally generates scores of 8, exceeding the maximum score of 7. We can see this in the cross-tabulation of what it said vs. human scores:

prompt15_fewshot_cot	1	2	3	4	5	6	7	8
human_score
1	20	20	0	5	0	2	1	0
2	35	172	1	7	0	0	0	0
3	14	369	7	64	12	3	0	0
4	6	429	22	444	133	54	0	4
5	0	84	9	311	130	92	4	14
6	0	16	1	66	35	52	6	6
7	0	3	0	7	6	13	2	6

This anomaly artificially inflates the average score and increases variability for this prompt. Despite this issue, the overall trend suggests that few-shot examples and CoT reasoning can enhance the model’s ability to align with human evaluations, highlighting the potential of these techniques in improving the accuracy and reliability of automated scoring systems.

Conclusion

We collect all our above investigations into a single table:

Scores	Mean	SD	R²	RMSE
Human Scores (Out of 7)	4.05	1.12
Prompt 1 (Out of 5) x 1.4	1.97	0.69	0.3	2.28
Prompt 2 (Out of 7)	2.03	0.53	0.35	2.22
Prompt 3 (Out of 10) x 0.7	1.76	0.50	0.37	2.46
Prompt 4 (Elementary School Student)	3.24	0.93	0.41	1.2
Prompt 5 (College Student)	2.42	0.69	0.37	1.85
Prompt 6 (Graduate Student)	2.34	0.67	0.35	1.93
Prompt 7 (Struc., Content, Arg. Strength, & Writing Qual.)	1.90	0.49	0.36	2.33
Prompt 8 (Grammar, Coherence, Org., & Relevance)	2.03	0.44	0.29	2.23
Prompt 9 (Dev. of Ideas, Org., & Lan. Fac. and Convention)	1.99	0.47	0.34	2.25
Prompt10 (Numeric)	2.66	0.94	0.32	1.7
prompt11 (Json)	2.80	0.68	0.31	1.55
prompt12 (Likert Scale)	2.22	0.63	0.25	2.07
Prompt13 (Base)	2.10	0.90	0.13	2.27
prompt14 (Few-shot)	2.23	0.97	0.22	2.12
prompt15 (Few-shot + CoT)	3.44	1.48	0.34	1.37

Essay grading with ChatGPT presents exciting possibilities for the field of education. but it’s clear that a nuanced approach is needed to unlock its full potential. Our investigation has revealed that prompt design impacts the outcomes of ChatGPT’s essay evaluations. We also find that we never achieved a particularly high \(R^2\) value, and the RMSE (an estimate of how many points off we tend to be from the human scores) are generally high considering the 7 point scale. Overall, ChatGPT also seems to be a harsh grader! No prompt, even when saying we are grading elementary school essays, gave scores similar on average to the true average of about 4.

Key findings:

Prompting Strategy Matters: The choice of prompt significantly influences the model’s performance. Specifically, Prompt 4, designed for elementary school students, outperformed other prompts, while Prompt 13, the base prompt, had the lowest agreement with human scores.
Scoring Scales Matter (Prompts 1-3): The choice of scoring scale (e.g., 5-point, 7-point, 10-point) influences essay scores. Prompts with wider scoring ranges tend to align more closely with human judgments, suggesting the need for further exploration into optimal scale selection (e.g., out of 50, 100, etc.).
Potential of Tailored Grading (Prompts 4-6): Assigning ChatGPT a specific role, such as “grader for elementary school students,” can improve the accuracy of its evaluations. However, more research is needed to fully understand the impact of role assignment on grading essays from different educational levels (e.g., college or graduate students).
Grading Criteria vs. ChatGPT’s Internal Crieteria (Prompts 7-9): Explicitly stating grading criteria in prompts without providing a detailed rubric did not significantly affect scores. This suggests that ChatGPT can rely on its own internal criteria, producing similar results regardless of the prompt’s wording.
Output Format Matters (Prompts 10-12): The choice of output format (numeric score, Likert scale, dictionary) impacts scores, even when the scale remains the same. Careful consideration is needed when selecting the output format to ensure consistent and meaningful results.
Advanced Techniques Work, but Requires Refinement (Prompts 13-15): While techniques like few-shot learning and chain-of-thought reasoning show promise in improving alignment with human evaluations, results can vary. Kim et al. (2024) found that these techniques can be beneficial, but our investigation reveals that the base model sometimes outperforms them, particularly in assessing writing quality. Further research is necessary to refine these techniques and achieve more consistent assessments.
We need Human Expertise: Our findings highlight the importance of human judgment in essay evaluation. While ChatGPT can be a valuable tool, its sensitivity to prompt wording necessitates careful design and continuous human oversight to guarantee accurate and consistant grading.

Overall, thoughtful prompt design is key to maximizing ChatGPT’s potential in educational assessment. By selecting appropriate scoring scales, tailoring prompts to specific roles or audiences, and carefully incorporating advanced techniques like few-shot and CoT, we can use this technology to streamline grading processes and help teacher or researchers burden in practice.

In conclusion, thoughtful prompt design is key to maximizing ChatGPT’s potential in educational assessment. By selecting appropriate scoring scales, tailoring prompts to specific roles or audiences, and carefully incorporating advanced techniques like few-shot and CoT, we can use this technology to streamline grading processes and alleviate the burden on teachers and researchers. However, it is crucial to remember that ChatGPT is not a replacement for human expertise, but rather a tool to augment it. The collaboration between humans and AI in essay evaluation can lead to a more efficient, effective, and equitable assessment process.

References

Kim, Y., Mozer, R., Miratrix, L., & Al-Ademi, S. (2024). ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring (in preparation).
Wu, T., He, S., Liu, J., Sun, S., Liu, K., Han, Q.-L., & Tang, Y. (2023). A brief overview of ChatGPT: The history, status quo and potential future development [Conference Name:IEEE/CAA Journal of Automatica Sinica]. IEEE/CAA Journal of Automatica Sinica, 10 (5), 1122–1136. https://doi.org/10.1109/JAS.2023.123618

Appendix (Essay Examples for Few-shot Approach)

fewshot_examples = """            
   <Few-shot Examples>
   Text1: I think that those kids are mine were are 
          that parents are worried about other kids online now. Many.
   Overall Quality - 1

   Text2: yes because you can study more and do the homeworks 
          like math and science for science test. 
   Overall Quality - 2

   Text3:  I don't like the Ipads, they're too easy to break. 
           And I don't use it. So it's kind a useless 
           We have used them only maybe once or twice.  
           Also when we do use them it's only for a couple of minutes. 
So I think that they are not really necessary to learn. 
I think that you can do everything you can do on a Ipad on a computer. 
Overall Quality - 3

Text4:  The reason why you should use Ipad 
because you can use it to help you with school. 
And help with word with online dictionaries. 
But the that post mean videos will lose the Ipad for one or two weeks. 
It depend how bad they make the video be. 
If they don't stop doing 
           it can lead to getting suspended or lose the Ipad 
           for a long time. And by using the Ipads 
           it will make the students very smart. 
           And will the student understand more thing in a advanced way. 
           While they're using their Ipads for education 
they can have fun with the Ipads to play math games.
And in the morning you can play games on the Ipad. 
It will be good to have Ipads in the school.
Overall Quality - 4

Text5:  I believe that Ipads should be used in school because 
when kids get older they should be able to know how to use it
because when kids get older a lot of jobs use Ipads or tablets. 
Also I believe that you can find things for learning 
more online than on books and on paper. 
just because some kids do stupid things 
online doesn't mean all the kids should suffer. 
           Also a lot of things are not on books anymore than <what is> online. 
           Some kids have a bad spelling problem 
           and on a tablet or Ipad have work check and it helps a lot. 
           Also it make school more fun for kids, 
           I know if I didn't like school we had Ipads or tablets to do work. 
It would make me want to learn. 
that's why I believe we should have Ipads or tablets in our school .  
   Overall Quality - 5

   Text6:  In my opinion I believe 
           that students should totally be able to use the Ipod or Ipad. 
           My reason for this statement is 
           because I believe that it is a better way to learn. 
           Now in days we have Google_Chrome 
           and kids can basically find all they need on the Internet. 
           Keep in mind that it two thousand fourteen 
           we have Ipads and tablets so why not use it.
           Also you may even play cool math games and educational games. 
           Another reason why is because kids can also find words 
           and use words they have never seen on the Internet.     
           it's basically a fun but cool way. 
Not only that but kids now and 
days just would prefer the Ipod or Ipad over the computer. 
So just think about it also the computer is just as bad really 
because either way you can still post negative comments. 
Also to solve the problems about negative comments 
you can just simply block websites like Facebook, Instagram, 
and Vine so that the people's comments cannot stop the other kids 
           from having fun and learning on top of that. 
           Why should one group of kids stop the other kids from learning. 
           So yes I disagree.  
   Overall Quality - 6

   Text7:  I believe all students should have 
           the opportunity to use Ipads while in school.  
           Without these electronic devices some students 
           may have trouble finding access 
           to other electronic devices 
           in order to complete homework and school work. 
           I also believe that Ipads will be beneficial in a classroom 
           because it will let students research topics 
           and that research may be needed for an in school project. 
           These are some of the pros to having access to Ipads during school. 
           Although there are many pros to the Ipads there are also a few cons.  
           Such as the Ipad being distracting for students.   
           The students with Ipads may be playing games 
           or looking things up on the Internet 
           that has nothing to do with the classroom topic. 
           Also the students may be posting harsh comment geared 
           towards other students on social media websites during class or at home. 
           All of these problem can be fixed easily. 
           To cut down on the number of students playing games 
           during class just have the student keep the Ipads off and in their bags 
           or under their seat until the teacher instructs them 
           to take them out and use the Ipad for a certain purpose. 
           This reduce the number of cruel comments being posted during class. 
           Another solution to stop mean comments from going viral at home is 
           to have the students return the Ipads to a cart 
           at the end of the day and receive them again in the morning. 
           Overall there are pros and cons to classroom Ipads. 
           But the cons can be fixed with simple rules. 
           This is way I believe the Ipads are an asset to the classroom.
   Overall Quality - 7
   """

The Art of Crafting Prompts for Essay Grading with ChatGPT

Overview

The Art of Crafting Prompts for Essay Grading with ChatGPT

Introduction

Objectives

Approach

Grading an Essay Using Various Prompts

1. Variations in Scoring Ranges

Comparing Results from Prompts 1-3

Overall Results: Prompts 1-3

2. Role-Specific Prompts and Targeted Scoring Ranges

Comparing Results from Prompts 4-6

Overall Results: Prompts 4-6

3. Variations in Specific Scoring Criteria

Comparing Results from Prompt 7-9

Overall Results: Prompts 7-9

4. Guided Prompts: Defining Outcomes

Comparing Results from Prompts 10-12

Overall Results: Prompts 10-12

5. Advanced Grading Techniques: Few-Shots Learning and Chain-of-Thought Reasoning

Comparing Prompts 13-15

Overall Results: Prompts 13-15

Conclusion

References

Appendix (Essay Examples for Few-shot Approach)

Comments: