Traditional automated assessment of programming assignments relies on functional testing based on predefined inputs and expected outputs. While effective for detecting functional defects, this approach fails to measure qualitative aspects central to educational goals. This thesis investigates the reliability of LLMs when used to support qualitative assessment of student-written source code in academic contexts. The study is guided by three research questions: (RQ1) the degree of correspondence between LLMs and human evaluations; (RQ2) the stability and internal reliability of evaluations across repeated applications; and (RQ3) the systematic differences and comparative reliability among multiple LLMs under identical criteria. To address these questions, the thesis proposes CheckMyC, an evaluation framework designed to produce structured, traceable, and comparable qualitative assessments. The framework combines a topic-based rubric with constrained prompting strategies and rigid output schemas, enabling LLMs to generate localized evidences and per-topic scores while avoiding unstructured or purely subjective judgments. The methodology is implemented through a modular pipeline that performs repeated LLMs-based evaluations across multiple models and programs, followed by post-processing and statistical analysis on a dataset of 44 C programs. The experimental results on this dataset show that high-capacity models like Gemini 2.5 Pro achieve superior alignment with human judgment, reaching high detection ability for most of the defined topics conditions, and high stability in the responses. In constrast, smaller architectures exhibit significant limitations due to omission bias and high hallucination rates. The study concludes that model architecture is the primary determinant of performance. While large models demonstrate the reliability necessary for student self-assessment and instructional support, smaller models remain unsuitable for academic grading due to stochastic noise and intrinsic consistency limitations.

CheckMyC - Qualitative Code Evaluator Using Large Language Models

ROMANO, FEDERICO MARIA
2024/2025

Abstract

Traditional automated assessment of programming assignments relies on functional testing based on predefined inputs and expected outputs. While effective for detecting functional defects, this approach fails to measure qualitative aspects central to educational goals. This thesis investigates the reliability of LLMs when used to support qualitative assessment of student-written source code in academic contexts. The study is guided by three research questions: (RQ1) the degree of correspondence between LLMs and human evaluations; (RQ2) the stability and internal reliability of evaluations across repeated applications; and (RQ3) the systematic differences and comparative reliability among multiple LLMs under identical criteria. To address these questions, the thesis proposes CheckMyC, an evaluation framework designed to produce structured, traceable, and comparable qualitative assessments. The framework combines a topic-based rubric with constrained prompting strategies and rigid output schemas, enabling LLMs to generate localized evidences and per-topic scores while avoiding unstructured or purely subjective judgments. The methodology is implemented through a modular pipeline that performs repeated LLMs-based evaluations across multiple models and programs, followed by post-processing and statistical analysis on a dataset of 44 C programs. The experimental results on this dataset show that high-capacity models like Gemini 2.5 Pro achieve superior alignment with human judgment, reaching high detection ability for most of the defined topics conditions, and high stability in the responses. In constrast, smaller architectures exhibit significant limitations due to omission bias and high hallucination rates. The study concludes that model architecture is the primary determinant of performance. While large models demonstrate the reliability necessary for student self-assessment and instructional support, smaller models remain unsuitable for academic grading due to stochastic noise and intrinsic consistency limitations.
2024
CheckMyC - Qualitative Code Evaluator Using Large Language Models
File in questo prodotto:
File Dimensione Formato  
thesis.pdf

accesso aperto

Dimensione 2.22 MB
Formato Adobe PDF
2.22 MB Adobe PDF Visualizza/Apri

È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14239/34979