Chapter 2: Data Models and Query Languages

6 min read

Excellent. Let’s move on.

We’ve established the “what” – we need reliable, scalable, and maintainable systems. Now we get to the first big “how”: how do we represent the data? This is Chapter 2: Data Models and Query Languages.

This is probably the most important decision you’ll make. It profoundly affects not only how you write the software, but how you even think about the problem you’re solving. Get this wrong, and you’ll be fighting your database for years.

What are we ultimately trying to achieve here? We’re trying to choose a structure for our data that fits the access patterns of our application. We want to make common operations easy and fast, without making uncommon (but necessary) operations impossible.

Think of it like building with LEGOs. You have different types of bricks. If you’re building a car, you’ll want wheels and axles. If you’re building a house, you’ll want rectangular bricks and roof pieces. Using the wrong bricks makes the job awkward and the result fragile. Data models are your different types of LEGO bricks.

The Big Three: Relational vs. Document vs. Graph

For the last 30 years, one model dominated: the Relational Model (SQL). Data is in tables (relations), made of rows (tuples). It’s fantastic for structured data and was a huge leap forward because it hid the messy implementation details from developers.

But in the last 15 years, two other models have become major players, driven by the needs of big web companies.

Document Model (NoSQL): Data is stored in self-contained “documents,” usually JSON.
Graph Model: Data is represented as vertices (nodes) and edges (relationships).

Let’s compare them.

Relational Model vs. Document Model

This is the most common battleground. Imagine we’re building a LinkedIn-style resume.

In a Relational (SQL) model, you’d normalize the data. You’d have a users table, a positions table, an education table, etc. The positions table would have a user_id foreign key pointing back to the users table.

erDiagram
    users {
        int user_id PK
        varchar first_name
        varchar last_name
    }
    positions {
        int position_id PK
        int user_id FK
        varchar job_title
        varchar organization
    }
    education {
        int education_id PK
        int user_id FK
        varchar school_name
    }
    users ||--o{ positions : "has"
    users ||--o{ education : "has"

To get a full resume: You need to perform multiple queries or a complex JOIN across several tables.
Relationships: Handles many-to-one and many-to-many relationships beautifully (user_id is a many-to-one link).

In a Document model, you’d likely store the entire resume as a single JSON document. This is similar to the hierarchical model of IBM’s IMS database from the 1960s – history repeats itself.

{
  "user_id": 251,
  "first_name": "Bill",
  "last_name": "Gates",
  "positions": [
    {"job_title": "Co-chair", "organization": "Gates Foundation"},
    {"job_title": "Co-founder", "organization": "Microsoft"}
  ],
  "education": [
    {"school_name": "Harvard University"},
    {"school_name": "Lakeside School"}
  ]
}

To get a full resume: You fetch one document. That’s it. This is a huge win for locality.
The “Impedance Mismatch”: This JSON structure maps very cleanly to the objects in your application code. You don’t need an awkward translation layer (like an ORM) that SQL databases often require.

So, which is better? The classic “it depends.”

If your data has a document-like structure (a central object with nested one-to-many relationships, like a resume or a blog post with comments), the document model is a natural fit. Fetching the whole thing at once is efficient.
If your data is highly interconnected, with complex many-to-many relationships, the relational model is superior. Joins are its superpower.

Case Study: Interconnected Data What if we want to make “Microsoft” and “Harvard University” in our resume clickable entities with their own pages, logos, etc.? Now we have a many-to-many relationship. Many people worked at Microsoft; many people went to Harvard.

In the document model, this gets awkward.

You could store the organization’s name as a string, but then if “Microsoft” rebrands, you have to update millions of resume documents (denormalization).
Or, you can store just an organization_id and then emulate a join in your application code: fetch resume -> for each position, fetch organization details. This is slow and shifts complexity from the database to your code.

The trend (Post-2024 Era): The lines are blurring.

Timeless: Pick the model that fits your primary access patterns.
Post-2024: Relational databases like PostgreSQL have excellent JSON support. Document databases like MongoDB are adding better join-like capabilities. This is a good thing. It’s not “SQL vs NoSQL” anymore, it’s about using the right tool for the job, and increasingly, databases are becoming multi-model.

Graph-Like Data Models

What if everything is potentially related to everything else?

Social networks (who is friends with whom)
Web graphs (which pages link to which)
Knowledge graphs (connecting entities like people, organizations, products)

This is where graph models shine. The model is simple:

Vertices (or nodes): The entities (e.g., people, companies).
Edges (or relationships): The connections between them (e.g., (Lucy)-[:MARRIED_TO]->(Alain), (Alain)-[:WORKS_AT]->(BigTech)).

You can model this in SQL, but it’s painful. A query like “Find all of my friends’ friends who live in London” requires a variable number of joins, which SQL handles awkwardly with recursive common table expressions (CTEs).

Declarative Graph Query Languages like Cypher (Neo4j) or SPARQL (RDF triple stores) are designed for this. A query looks like you’re drawing the pattern you want to find.

Cypher example: Find people who emigrated from the US to Europe.

MATCH
  (person:Person)-[:BORN_IN]->(:Location)-[:WITHIN*]->(us:Location {name:'United States'}),
  (person)-[:LIVES_IN]->(:Location)-[:WITHIN*]->(eu:Location {name:'Europe'})
RETURN person.name

The query optimizer figures out the most efficient way to traverse the graph to find this pattern. This is a powerful, declarative way to handle highly connected data.

Key Takeaways for System Design Interviews

When designing a system, your choice of data model is a primary decision. Justify it based on the data’s structure and your application’s access patterns.

Identify the main entities and their relationships.
- Is it structured, like user profiles? Relational is a good default.
- Is it self-contained documents with nested data? Document model might be simpler and faster.
- Is it a web of complex relationships? Think Graph.
Analyze the access patterns.
- Do you need powerful joins? Relational/SQL.
- Do you primarily fetch an entire object by its ID? Document.
- Do you need to traverse relationships (e.g., find friends of friends)? Graph.
Consider evolvability.
- How will the data change? Schemas are not a bad thing; they enforce structure. Schema-on-read (Document DBs) gives flexibility but can lead to a mess if you’re not careful. Schema-on-write (SQL) is more rigid but provides guarantees. Modern binary formats like Avro and Protobuf offer a middle ground.

Don’t just say “I’ll use MongoDB.” Say, “The primary data entity is a user profile, which includes their list of recent posts. Since this is a self-contained, document-like structure and our main read pattern is fetching the full profile, a document model is a natural fit. It offers good locality and avoids complex joins for this common operation.”

Now, we’ve talked about how to model the data. Next, we need to talk about how a database actually stores and retrieves it. That’s Chapter 3. Any questions on data models?