Building a domain database when no single source has what you need

The original idea for Choose My Plane was a search interface for aircraft buyers. Not organised by engine type and count, which is how the industry categorises things, but by what a buyer actually knows: how many seats they need, what range, what kind of cabin. The UX was easy to sketch. Then came the immediate practical question: where does the data come from?

That question took longer to answer than the UX did. What followed was a data sourcing and schema design problem that is, in some form, common to almost any product built for a specific domain. The aviation details are specific to Choose My Plane. The pattern is not.

What the data landscape looked like

The first thing to establish was whether the data already existed somewhere accessible. It did, in two forms, neither of which worked.

The first form: closed databases used by brokers, insurers, and fleet operators. These exist, they are comprehensive, and they are priced for commercial buyers. A licence fee that makes sense for an insurance underwriter does not make sense for an early-stage product that does not yet know if it has users. Commercial data was ruled out immediately.

The second form: public sources, primarily the FAA. I found a substantial file: a genuine starting point. It contained manufacturer names, model names, and registration counts. It included approach and stall speeds, wingspan, height, and other dimensional data. What it did not contain is the information a buyer comparison tool actually needs: passenger capacity and cruise speed. It also included airliners: but a tool for general aviation buyers does not need the Airbus A320 family in its database, and filtering them out requires making a judgment call about where GA ends and commercial aviation begins.

So: the free source was incomplete and required significant curation. The paid sources were out of reach. The product needed data that did not exist in one place at any accessible price.

The schema question comes first

Before sourcing anything, there is a prior question: what does the database actually need to contain? This sounds obvious but it is easy to skip. The temptation is to start with whatever data is available and build the product around it. The problem with that approach is that you end up with a product shaped by your data sources rather than by what users need.

For Choose My Plane, the core fields the product required were passenger capacity and cruise speed. Those two fields are what make a buyer search possible: filter by seats, filter by range, return a list. Without them, the interface cannot do its job. The FAA file did not have them. That gap, identified precisely, is what turned a sourcing problem into a design problem: I needed to define each field clearly enough that it could be sourced consistently from whatever combination of sources I ended up using.

The schema also evolved. Passenger capacity and cruise speed were the first iteration. Fuel burn and operating cost data came later, once the core comparison feature was working and the next product question became total cost of ownership. A domain database is not something you define once and freeze. It grows with the product, and each addition requires the same sourcing question answered again.

Building a source hierarchy

The practical solution was a hierarchy: a ranked list of sources, applied in order, with a rule for what to do when they conflict.

The FAA file serves as the foundation. It provides the aircraft identity layer: manufacturer, model, and the dimensional and speed data it does contain. For a general aviation database, that means filtering out commercial airliners and working with what remains as the starting point for each profile.

For the fields the FAA file does not cover, the hierarchy runs: manufacturer sites first, including archived versions of pages for discontinued models; Wikipedia and aviation reference articles second; owner forums third. The ranking reflects reliability and specificity. A manufacturer’s published cruise speed figure is more trustworthy than a forum post, but forums are sometimes the only source for data on older or less common aircraft that manufacturers no longer support.

The hierarchy matters because sources conflict. When they do, you need a rule for which one wins that is not made case by case. Ad hoc decisions produce an inconsistent database. A defined hierarchy produces a consistent one, and it produces a process that someone else could follow, or that can be automated.

Automating the workflow

Once the source hierarchy was defined and the structured output format was clear, the per-aircraft data retrieval process became a candidate for automation. The workflow now uses an LLM instructed on the source hierarchy and given a structured output schema. It retrieves data for a new aircraft profile and returns it in that format, along with sourcing notes: where each figure came from and any cases where sources disagreed or data was absent.

The sourcing notes are the important part. The automation handles retrieval and formatting. The judgment calls — what to do when the manufacturer page and a Wikipedia article give different cruise speeds, or when a figure only appears in a forum post — still require a human decision. The sourcing notes surface those cases rather than resolving them silently. An admin dashboard receives the structured output and flags anything that needs review before the profile goes live.

What changed with automation was speed and consistency. Adding a new aircraft profile went from an hour of manual research across multiple tabs to a review-and-validate workflow. The database grew faster and with fewer inconsistencies introduced by fatigue or varying attention to source priority.

What this generalises to

The sequence is not specific to aviation. Domain-specific products regularly encounter the same situation: the data you need exists, but it is either closed and expensive or public and incomplete, and neither option works as-is.

The steps that follow are the same regardless of domain. Define what the data needs to do before evaluating any source against it. Find the closest public source and establish what it covers and what it does not. Fill the gaps with a defined source hierarchy, ranked by reliability, with explicit rules for conflicts. Automate retrieval once the process is stable enough to describe precisely, and preserve the judgment layer for cases the automation cannot resolve cleanly.

The design decision precedes the data work. Getting that order right is most of the problem.

Get in touch if that is where you are.