Master's Thesis · MSc Artificial Intelligence · 2025
Multi-Ontology Augmentation for LLM-Based Cooking Instruction
Does giving an LLM structured knowledge actually make it a better teacher? A multi-agent study of ontology-grounded models, and why the answer turns out to depend heavily on which model you use.
Degree
MSc Artificial Intelligence
Institution
Vrije Universiteit Amsterdam
First Supervisor
Jiahuan Pei
Second Reader
Ilias Gerostathopoulos
LLM providers compared
Simulated teaching conversations
Pedagogical dimensions evaluated
Performance drop for Claude 3.5 Sonnet
The Problem
Smart, but ungrounded and unsafe
Large language models are increasingly used in educational settings, but they operate as black boxes: they hallucinate plausible-but-wrong information, lack grounding in specialised domain knowledge, and offer little transparency for the verifiable, reliable instruction that teaching requires.
Cooking instruction is a sharp test case for this. Most prior work focuses on single-agent recipe generation, but real teaching is interactive and safety-critical: an ingredient substitution suggested without regard to allergens or food safety isn't just unhelpful, it's dangerous. I wanted to know whether grounding an LLM in structured knowledge sources could improve both its reliability and its effectiveness as a teacher.
What I Built
A multi-agent, ontology-grounded teaching system
I designed a multi-agent conversational AI system in which two LLMs interact as a chef and a trainee, generating structured teaching dialogues. On top of this, I built an ontology-driven ingredient-substitution mechanism that combines the FoodOn ontology with USDA nutritional databases, grounding the model's suggestions in verified, structured data rather than free-form generation.
To measure the effect rigorously, I built a deterministic benchmarking and LLM-as-judge evaluation pipeline, running a controlled comparison of baseline (LLM-only) versus ontology-augmented approaches across three providers (GPT-4 Mini, Claude 3.5 Sonnet, and Grok 3 Mini) over 270 simulated teaching conversations spanning different conversation types, user experience levels, and eight pedagogical dimensions. The framework also tracks tool usage and produces structured analytics for comparing knowledge-augmented LLM systems.
Key Findings
Knowledge integration helps, but not universally
The headline result is that ontology grounding produced provider-specific outcomes rather than a universal improvement. The same structured knowledge made one model better and another markedly worse:
- GPT-4 Mini showed minimal change with ontology integration (−0.2% to +3.1% across pedagogical metrics), and Grok 3 Mini stayed similarly stable (−3.2% to +0.7%).
- Claude 3.5 Sonnet, by contrast, experienced a 13.9% overall performance drop, concentrated in ingredient-substitution accuracy (−17.7%) and safety management (−6.6%).
- Safety Risk Management was the single weakest dimension across every provider (scoring just 2.25 to 2.56 on a 0–5 scale), a consistent blind spot regardless of model.
- Each provider showed a distinct pedagogical 'personality': GPT-4 Mini adapts its readability to the learner, Claude 3.5 Sonnet uses sophisticated vocabulary suited to advanced users, and Grok 3 Mini maintains consistent simplicity across experience levels.
Takeaways
Why this matters for deploying educational AI
These findings challenge a common assumption that adding structured knowledge will reliably make an LLM more capable. Instead, the effect of ontology grounding is mediated by the underlying model's architecture, so educational-AI systems need provider-specific integration strategies, not a one-size-fits-all knowledge layer.
The work also surfaces safety as the consistent weak point of these systems, and contributes reusable tooling (automated evaluation and structured analysis for comparative assessment of knowledge-augmented LLMs) for the broader study of when, and for whom, grounding actually helps.