CheckMyC - Qualitative Code Evaluator Using Large Language Models

Traditional automated assessment of programming assignments relies on functional testing based on predefined inputs and expected outputs. While effective for detecting functional defects, this approach fails to measure qualitative aspects central to educational goals. This thesis investigates the reliability of LLMs when used to support qualitative assessment of student-written source code in academic contexts. The study is guided by three research questions: (RQ1) the degree of correspondence between LLMs and human evaluations; (RQ2) the stability and internal reliability of evaluations across repeated applications; and (RQ3) the systematic differences and comparative reliability among multiple LLMs under identical criteria. To address these questions, the thesis proposes CheckMyC, an evaluation framework designed to produce structured, traceable, and comparable qualitative assessments. The framework combines a topic-based rubric with constrained prompting strategies and rigid output schemas, enabling LLMs to generate localized evidences and per-topic scores while avoiding unstructured or purely subjective judgments. The methodology is implemented through a modular pipeline that performs repeated LLMs-based evaluations across multiple models and programs, followed by post-processing and statistical analysis on a dataset of 44 C programs. The experimental results on this dataset show that high-capacity models like Gemini 2.5 Pro achieve superior alignment with human judgment, reaching high detection ability for most of the defined topics conditions, and high stability in the responses. In constrast, smaller architectures exhibit significant limitations due to omission bias and high hallucination rates. The study concludes that model architecture is the primary determinant of performance. While large models demonstrate the reliability necessary for student self-assessment and instructional support, smaller models remain unsuitable for academic grading due to stochastic noise and intrinsic consistency limitations.

CheckMyC - Qualitative Code Evaluator Using Large Language Models

ROMANO, FEDERICO MARIA

2024/2025

Abstract

Traditional automated assessment of programming assignments relies on functional testing based on predefined inputs and expected outputs. While effective for detecting functional defects, this approach fails to measure qualitative aspects central to educational goals. This thesis investigates the reliability of LLMs when used to support qualitative assessment of student-written source code in academic contexts. The study is guided by three research questions: (RQ1) the degree of correspondence between LLMs and human evaluations; (RQ2) the stability and internal reliability of evaluations across repeated applications; and (RQ3) the systematic differences and comparative reliability among multiple LLMs under identical criteria. To address these questions, the thesis proposes CheckMyC, an evaluation framework designed to produce structured, traceable, and comparable qualitative assessments. The framework combines a topic-based rubric with constrained prompting strategies and rigid output schemas, enabling LLMs to generate localized evidences and per-topic scores while avoiding unstructured or purely subjective judgments. The methodology is implemented through a modular pipeline that performs repeated LLMs-based evaluations across multiple models and programs, followed by post-processing and statistical analysis on a dataset of 44 C programs. The experimental results on this dataset show that high-capacity models like Gemini 2.5 Pro achieve superior alignment with human judgment, reaching high detection ability for most of the defined topics conditions, and high stability in the responses. In constrast, smaller architectures exhibit significant limitations due to omission bias and high hallucination rates. The study concludes that model architecture is the primary determinant of performance. While large models demonstrate the reliability necessary for student self-assessment and instructional support, smaller models remain unsuitable for academic grading due to stochastic noise and intrinsic consistency limitations.

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INDUSTRIALE E DELL'INFORMAZIONE
			
	Corso di studio
	
				COMPUTER ENGINEERING [06415]
			
	Anno Accademico
	
				2024
			
	Titolo inglese
	
				CheckMyC - Qualitative Code Evaluator Using Large Language Models
			
	Relatore
	
				FACCHINETTI, TULLIO
			
	Appare nelle tipologie:
	
				Lauree Magistrali

File in questo prodotto:

File	Dimensione	Formato
thesis.pdf accesso aperto Dimensione 2.22 MB Formato Adobe PDF Visualizza/Apri	2.22 MB	Adobe PDF	Visualizza/Apri

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/34979