Modeling the Formula 1 Universe: Ontology, Databases, and the Question of Provenance

Designing a Formula 1 ontology means deciding what truly counts as knowledge in motorsport. In this article, I explore why F1 needs an ontology-driven knowledge base, how to balance graphs with tabular telemetry, and why provenance and “greatest ever” questions really matter.

Lukas Raich, CC BY-SA 4.0, via Wikimedia Commons

In my last article, I outlined the requirements for designing a domain-specific ontology, using Formula 1 as a case study. There, I focused on defining the ontology’s purpose, what kinds of questions it should answer, and discussed challenges such as data sparsity and completeness.

Ultimately, my primary goal in developing an ontology for Formula 1 is to build an authoritative search engine and knowledge base for motorsport as a whole. This engine should become a definitive resource: a hybrid between Wikipedia’s open, contributor-driven model and the verified accuracy of official sources such as the Formula 1 and Ferrari websites.

Before moving toward implementation, however, these ideas deserve deeper scrutiny. Why should an ontology serve as the foundational architecture for such a search engine?

The guide Ontology Development 101: A Guide to Creating Your First Ontology offers valuable insight here. It outlines several key motivations for developing an ontology:

Sharing a common understanding of information structures among people or software agents.
Enabling the reuse of domain knowledge.
Making domain assumptions explicit.
Separating domain knowledge from operational knowledge.
Analyzing domain knowledge systematically.

These principles lie at the heart of the Semantic Web, whose goal is to make information reusable, discoverable, and meaningful to both humans and machines.

What I did not explore previously, however, were the more technical and philosophical dimensions of designing an ontology for Formula 1. Consider the seemingly simple statement: “Lewis Hamilton is a driver for Ferrari.” At first glance, it appears unambiguous, yet it is dense with implicit meaning often overlooked in everyday language.

First, there is a temporal dimension. The statement assumes it reflects a current truth, that Lewis Hamilton is presently driving for Ferrari, while saying nothing about past or hypothetical states of the world.

Second, in ordinary language, we describe entities in terms of who and what. Ontology design, on the other hand, requires breaking these entities into their most fundamental, data-centric representations. We are less concerned with who Hamilton is as a person and more concerned with what he is (a driver) and when he occupies that role. This way of thinking enables more sophisticated temporal modeling: how long Hamilton has driven for Ferrari, how many championships he had by 2017, or how long Sauber carried BMW branding.

This perspective quickly raises deeper questions about scope and fit. How far should the ontology reach? Should it capture lap-by-lap telemetry data, or is that beyond its intended scope? Ontologies excel at defining concepts, hierarchies, and relationships—the semantic structure of meaning and how things interrelate within a defined universe. Attempting to merge these abstractions with granular, time-dependent data like lap timings risks creating a Frankenstein model—an architecture caught between incompatible paradigms.

In reality, there may be no way to escape the use of tabular data. Projects such as FastF1 and OpenF1 already provide structured telemetry in tabular form, well-suited for time-series analysis.

Take, for example, the 1997 European Grand Prix, where Jacques Villeneuve, Michael Schumacher, and Heinz-Harald Frentzen each recorded identical pole lap times. How could an ontology represent and retrieve such an anomaly? If the framework cannot accommodate temporal or numerical data like lap times effectively, perhaps a conventional database—optimized for structured, time-dependent data—would be more appropriate.

This distinction raises a fundamental boundary question: what should an ontology model, and what should remain within a relational or time-series database? Furthermore, how can we answer complex, interpretive questions such as “Who was the greatest team principal?” or, more specifically, “Who was the best team principal within Ferrari?” (it's Jean Todt, by the way).

Tony Harrison, CC BY-SA 2.0, via Wikimedia Commons

Current search engines like Google and Perplexity tend to surface subjective content—Reddit threads, rankings, or opinion articles—rather than structured, data-driven reasoning. To model such a question computationally, we would need clearly defined rules. Do we measure “best” by total team wins under a principal’s tenure? By championships earned? By driver performance within that same period?

Ontology design, by necessity, forces abstraction and explicit logic around questions that humans answer intuitively. It turns subjective curiosity into structured reasoning.

These considerations bring us back to a foundational inquiry: what framework best supports the representation and reasoning required for this ontology? Which relationships and properties must be formalized, and which can remain implicit?

An equally pressing concern is provenance. Consider the 2020 championship-winning Mercedes-AMG F1 W11 EQ Performance. Much of the data associated with this car now points to defunct or inaccessible sources. This raises questions of trust and reliability: how much should the origin of a data point influence its inclusion in the ontology? Should details like the car’s designers come from Wikipedia, or only from its original citations—even when those sources are offline or outdated?

There is no universal answer. Balancing data completeness, accuracy, and traceability demands a pragmatic approach—one guided by the ontology’s ultimate goal: to serve as a transparent, authoritative, and context-aware representation of the Formula 1 universe, and perhaps a model for how structured knowledge can clarify any complex domain.