\

Chapter 1: Reliable, Scalable, and Maintainable Applications

6 min read

This isn’t fluff. These three words—Reliability, Scalability, and Maintainability—are the pillars of any serious system. In a system design interview, these are the top-level concerns you must address. Everything else is an implementation detail that serves these goals.

What are we ultimately trying to achieve here? We’re trying to build systems that work, can grow, and don’t become a nightmare to manage. Let’s break that down.


1. Reliability: It Works, Even When Things Go Wrong

Reliability means the system continues to work correctly, performing its function at the desired level of performance, even in the face of adversity.

The key idea is Faults vs. Failures.

  • A fault is when one component of the system deviates from its spec (e.g., a server crashes, a network link is slow).
  • A failure is when the system as a whole stops providing the required service to the user.

You cannot build a system with zero faults. It’s impossible. Your job is to design a fault-tolerant system that prevents faults from causing failures.

Types of Faults:

  • Hardware Faults: Disks crash, RAM goes bad, someone unplugs the wrong cable.

    • Timeless: Redundancy. RAID for disks, dual power supplies.
    • Post-2024 Era: We now live in the cloud. We don’t just plan for a disk to fail; we plan for the entire virtual machine to disappear without warning. We build resilience at the software layer, not just the hardware layer. This is a fundamental shift from the old on-premise world. Think of Netflix’s famous Chaos Monkey—it deliberately introduces faults to ensure the system is resilient. That’s the mindset.
  • Software Errors: Bugs that cause a system-wide issue. A bad input crashes every server instance. These are often harder to deal with than random hardware faults.

  • Human Errors: Operators make mistakes. Configuration errors are a leading cause of outages.

In a system design interview, when you say “reliability,” you should be thinking about specific fault scenarios and your plan to mitigate them. How do you handle a node going down? A network partition between data centers?


2. Scalability: Having a Plan for Growth

Scalability isn’t a magic label you slap on a system. Saying “my system is scalable” is meaningless. Scalability is about answering the question: “As the system grows in a particular way, what’s our plan for coping with that growth?”

First, you have to Describe Load. You can’t talk about scaling if you don’t know what you’re scaling for. Use specific load parameters.

  • Requests per second?
  • Read-to-write ratio?
  • Simultaneously active users?
  • Cache hit rate?

Case Study: Twitter Timeline (DDIA Figure 1-2, 1-3) This is a classic. The problem isn’t just “lots of tweets.” It’s the fan-out. A celebrity with 30 million followers tweets once. How do you deliver that tweet to all 30 million timelines?

There are two naive approaches:

  1. Read-time Fan-out (Pull): A user requests their timeline. The system looks up everyone they follow, gets the recent tweets for each, and merges them.
  2. Write-time Fan-out (Push): A user posts a tweet. The system looks up everyone who follows them and inserts the new tweet into each of their timeline “inboxes” (caches).
graph TD
    subgraph "Approach 1: Read-time Fan-out (Slow Reads)"
        User[User Requests Timeline] --> A{Find all followed users}
        A --> B[For each followed user...]
        B --> C[Fetch their recent tweets]
        C --> D{Merge all tweets}
        D --> Result[Show Timeline]
    end
    subgraph "Approach 2: Write-time Fan-out (Slow Writes)"
        Tweet[User Posts Tweet] --> E{Find all followers}
        E --> F[For each follower...]
        F --> G[Insert tweet into their timeline cache]
        G --> H[Done]
    end

Twitter started with Approach 1, but timeline reads were too slow. They switched to Approach 2. This is a perfect example of a scalability trade-off: they made writes more expensive to make reads cheaper. The real system is a hybrid: most users are “push,” but for celebrities, it’s “pull” to avoid a single tweet overwhelming the system.

Next, you Describe Performance.

Don’t talk about “average response time.” It’s a mostly useless metric. Talk about percentiles.

  • Median (p50): Half your users get this response time or faster. The other half are getting a worse experience.
  • 95th, 99th, 99.9th percentiles (p95, p99, p999): These are your tail latencies. They represent the experience of your unluckiest users.

Why do tail latencies matter? Amazon found that the customers with the slowest requests were often their most valuable—the ones with long purchase histories. Making the p999 fast makes your best customers happy.

In modern systems (microservices, ML inference pipelines), you get tail latency amplification.

graph LR
    UserRequest --> API_Gateway
    subgraph Backend Services
        API_Gateway --> ServiceA
        API_Gateway --> ServiceB
        API_Gateway --> ServiceC
        API_Gateway --> ServiceD
    end
    ServiceA --> API_Gateway
    ServiceB --> API_Gateway
    ServiceC --> API_Gateway
    ServiceD --> API_Gateway
    API_Gateway --> UserResponse

    style ServiceC fill:#f9f,stroke:#333,stroke-width:2px

If a user request requires calling 5 backend services in parallel, and each service has a 1% chance of being slow (p99), the chance that the user sees a slow response is much higher than 1%. It only takes one slow service to delay the entire request. This is why focusing on p99 and p999 is critical.


3. Maintainability: Avoiding a Future Mess

This is the most overlooked aspect by junior engineers, but it’s where most of the cost of software lies. The goal is to design a system that future engineers (including you in 6 months) can work on productively.

It boils down to three principles:

  1. Operability: Make it easy for operations teams to keep the system running smoothly. This means good monitoring, good automation, standard tools, and predictable behavior. No magic.
  2. Simplicity: Manage complexity. This isn’t about dumbing things down. It’s about finding the right abstractions that hide a great deal of implementation detail behind a clean interface. A well-designed database is a great abstraction.
  3. Evolvability (or Plasticity): Make it easy to change the system in the future. Business needs change. You’ll need to add features. How easy is it to “refactor” your architecture, like Twitter did with its timeline?

Summary for System Design Interviews

When you get a design question, these three concepts are your high-level checklist. For any component you propose, ask:

  • Reliability: What are its failure modes? How will we make it fault-tolerant?
  • Scalability: What are the load parameters? Where are the bottlenecks? What is our strategy for handling 10x or 100x the load? How will we measure performance?
  • Maintainability: Is this design easy to understand? Can a new engineer get up to speed quickly? How will we evolve it when requirements change?

This chapter gives us the vocabulary and the framework. In the next chapter, we’ll dive into the first major design choice that impacts all three of these: the Data Model.