| Title | Automated Test Generation and Marking Using LLMs |
| Publication Type | Journal Article |
| Year of Publication | 2025 |
| Authors | Papachristou I, Dimitroulakos G, Vassilakis C |
| Journal | Electronics |
| Volume | 14 |
| Pagination | 2835 |
| ISSN | 2079-9292 |
| Keywords | artificial intelligence, DeepSeek, exams, large language models, Llama, named entity recognition, Natural language processing, question–answer generation, retrieval-augmented generation, Semantic Similarity, test generation, test marking |
| Abstract | This paper presents an innovative exam-creation and grading system powered by advanced natural language processing and local large language models. The system automatically generates clear, grammatically accurate questions from both short passages and longer documents across different languages, supports multiple formats and difficulty levels, and ensures semantic diversity while minimizing redundancy, thus maximizing the percentage of the material that is covered in the generated exam paper. For grading, it employs a semantic-similarity model to evaluate essays and open-ended responses, awards partial credit, and mitigates bias from phrasing or syntax via named entity recognition. A major advantage of the proposed approach is its ability to run entirely on standard personal computers, without specialized artificial intelligence hardware, promoting privacy and exam security while maintaining low operational and maintenance costs. Moreover, its modular architecture allows the seamless swapping of models with minimal intervention, ensuring adaptability and the easy integration of future improvements. A requirements–compliance evaluation, combined with established performance metrics, was used to review and compare two popular multilingual LLMs and monolingual alternatives, demonstrating the system’s effectiveness and flexibility. The experimental results show that the system achieves a grading accuracy within a 17% normalized error margin compared to that of human experts, with generated questions reaching up to 89.5% semantic similarity to source content. The full exam generation and grading pipeline runs efficiently on consumer-grade hardware, with average inference times under 30 s. |
| DOI | 10.3390/electronics14142835 |