Scoring Student Essays by Matching a Model Answer against Student Answers
Here we give an overview of the scoring algorithm. A model answer is prepared by the instructor that contains the core knowledge required to achieve a 100% score. The system can score a student essay against a number of model answers, in case the instructor wishes to use numerous content models. The instructor provides 100 human graded essays and their scores, for training purposes. These model and training answers are then processed as follows. The system performs a content matching task in which the model answer content summary is compared against each of the training essay content summaries. Many aspects of the relationships between the model and training essays are then computed, and a linear regression model computed to derive a scoring equation. Unmarked student essays are then processed to build their content summaries. Finally, the scoring equation is used to produce a score for each essay.
Comparison of Human and System Scores
In a large scale test of the system, 390 essays hand written by year 10 high school students on the topic of "The School Leaving Age" were transcribed to Microsoft Word document format. These essays were graded on a number of categories by three different human graders. The essays and scores were forwarded to the system development team for processing. A model answer was chosen from amongst the essays by selecting the essay with the highest average score given by the three human graders - this essay had a score of 48.5 out of a possible 54, representing an overall score of 90%.
Figure 1 shows the variation amongst the first two graders on the essays. The essays scores are arranged in ascending order of one of the human assigned grades. Note the substantial disagreement in the scores for some essays.
Figure 1: Comparison two Human Grader Scores on Essays
The mean score for AS for the essays was 29.40, while the mean grade given by JB was 30.80, a difference of 1.40. The correlation between these two humans was 0.80. After a scoring algorithm was built using 100 training essays, the remaining 289 essays were scored by MarkIT. Figure 2 shows the results, arranged in ascending order of the computer assigned score.
Figure 2: Results of Computer Scoring of 289 Essays vs Average Human
The mean score for the human average grade for these 289 essays was 30.36, while the mean grade given by the computer was 29.68, a difference of 0.68. The correlation between the human and computer grades was 0.81.
The computer assigned scores were close to the agreement between the humans amongst themselves, and the error rates similar. We can conclude that in this particular test, the system performed as well as human graders.
A key strength of the Blue Wren system over its counterparts is the emphasis on providing feedback other than just a grade or number. Numerous aspects of assessed essays which are useful to students from an improvement point of view include: spelling, grammar, reading ease, and grade level statistics. Such data is derived from existing technology and is incorporated by MarkIT into a comprehensive report on each essay. Essay content relative to model answer content is presented as a graph of concepts juxtaposing student answer concept content with model answer concept content. This graph is interactive in that one can drill down to the thesaurus level and also to the assignment level in order to discover where, and to some degree how, errors and omissions can be rectified.
The following figures show the system's visual feedback components and are available to the teacher and student on completion of the grading. Figures 3 and 4 show the essay selection screen, with essay identifiers appearing in the left window, and grading reports showing reading ease and level, and spelling and grammatical errors.
Figure 3: Main Control Panel and Selected Essay Grading Report
Figure 4: Main Control Panel and Selected Essay Grading Report
When an essay identifier is selected, the screens shown in figures 3 and 4 result. The upper window can be toggled via the tabs to display the Student essay or the Model essay. The lower window can be toggled via the tabs to show further features:
Figure 5 presents a graph of the 'concepts' associated with both the model answer and the student answer. Naturally, the better the correspondence between the 'concepts' in both, the better the score.
- the Graph of the concept counts for Student and Model essays (Figure 5), and
- a Thesaurus entry that can be shown by clicking on a concept bar in Figure 5 (Figure 6).
Figure 5: Concepts' Frequencies
Figure 6: Thesaurus Entry for a Concept (Being)
Additional reports can be provided to show:
- the Graph of the Top Concepts covered by the student cohort (Figure 7),
- a Similarity Detection feature, which may be used to detect possible plagiarism (Figure 8),
Figure 7: Top Concepts' Frequencies
Figure 8: Similarity Detection