AI

System Design Fundamentals: A Complete Guide for Developers

Learning System Design Backwards: A Practical Guide for Developers Creativity Coder Engineering Most engineers learn system design backwards. They memorize t…

By Kenil Sangani · · 22 min read
System Design Fundamentals: A Complete Guide for Developers

Learning System Design Backwards: A Practical Guide for Developers

Creativity Coder Engineering


Most engineers learn system design backwards. They memorize the vocabulary load balancers, sharding, CAP theorem and can draw the boxes-and-arrows diagram in an interview. Then they ship to production, traffic grows, and they discover that knowing what a read replica is tells you nothing about when you actually need one, or what breaks when you add it.

This guide is about the second kind of knowledge. We’ll cover the same fundamentals you’d find anywhere, but grounded in the decisions that actually matter: when each technique earns its complexity, what it costs, and where it fails. The goal isn’t to make you sound like you work at Google. It’s to help you make better decisions on the system you’re building right now.


What system design actually is

System design is the practice of deciding how the pieces of a software system fit together to meet a set of requirements and, more importantly, deciding which requirements you’re willing to sacrifice.

That second clause is the whole game. Every interesting design decision is a trade-off:

  • You can have strong consistency or high availability under network partition, not both.
  • You can optimize for low latency or high throughput, rarely both at once.
  • You can keep the system simple or make it infinitely scalable, but simplicity and scalability pull in opposite directions.

A senior engineer isn’t someone who knows more patterns. It’s someone who knows which trade-off the situation calls for and can defend the choice. Junior engineers ask “what’s the best database?” Senior engineers ask “best for what and what are we giving up?”

The best architecture is the simplest one that satisfies your actual requirements. Everything beyond that is cost you’re paying for problems you don’t have yet.


Start with requirements, or you’ll design the wrong thing

Before any architecture, you need two kinds of requirements. Skipping this is the most common reason systems get rebuilt within 18 months.

Functional requirements describe what the system does: users can upload files, send messages, make payments. These define features.

Non-functional requirements describe how well it does them: 99.99% uptime, sub-200ms responses, 1 million concurrent users, data durability of eleven nines. These define architecture.

Here’s why the distinction matters. The functional requirement “users can post a photo” is identical whether you have a thousand users or a hundred million. The non-functional requirements are what force every hard decision. A system targeting 99% availability (3.65 days of downtime a year) and one targeting 99.99% (52 minutes a year) are structurally different systems different redundancy, different failover, different cost by an order of magnitude.

So before designing, get specific:

  • How many users, and what’s the read-to-write ratio?
  • How much data, growing how fast?
  • What latency is acceptable at p99, not just average?
  • How consistent must the data be can a user tolerate seeing a stale value for two seconds?
  • What’s the blast radius when a component fails?

Write the answers down. Most architecture arguments are actually disagreements about unstated requirements.


Do the capacity math before you draw boxes

Back-of-the-envelope estimation is the cheapest way to discover that your plan is impossible. It takes ten minutes and routinely saves months.

Say you’re building a photo-sharing app targeting 10 million daily active users.

Traffic. If each user is active for ~30 minutes a day and makes a request roughly every 10 seconds:

Peak concurrent users ≈ (10M × 30 min) / (24 × 60 min) ≈ 200,000
Peak requests/second  ≈ 200,000 / 10s ≈ 20,000–35,000 RPS

That number alone tells you a single application server is out of the question and a single unsharded database will be under serious pressure.

Storage. Two photos per user per day, 2 MB each:

10M users × 2 photos × 365 days × 2 MB ≈ 14.6 PB/year

Fourteen petabytes a year is more than most companies will ever store. Seeing this number before designing forces the right decisions early: you will not store originals at full resolution, you will generate thumbnails, you will push everything to object storage and a CDN rather than serving from your own disks.

Bandwidth. If each photo is viewed ~50 times at 1 MB per view, naive serving implies hundreds of exabytes of monthly egress physically and financially absurd. This is the math that makes a CDN non-optional rather than a nice-to-have.

None of this requires precision. The point of the estimate isn’t to be right to two decimal places; it’s to be right about the order of magnitude, because order of magnitude is what dictates architecture.


The shape of a typical system

Most web systems converge on a similar layered structure, where each layer owns exactly one concern:

Client  →  CDN  →  Load Balancer  →  API layer  →  Services
                                                      ├→  Database (+ replicas)
                                                      ├→  Cache
                                                      ├→  Object storage
                                                      └→  Message queue → Workers

The discipline here is separation. The API layer shouldn’t be reading large files off disk. The database shouldn’t be storing images. The cache shouldn’t be your source of truth. When one component starts doing another’s job, you lose the ability to scale them independently which is the entire reason you split them in the first place.

We’ll walk through the load-bearing pieces and, for each, the question that actually matters: when do you need it, and what does it cost you?


Choosing a database (the decision you’ll regret longest)

Database choice has the longest half-life of any architectural decision. Changing it later means a migration, and migrations at scale are measured in quarters.

The real split isn’t “SQL vs NoSQL” it’s “do I need transactional consistency and relational queries, or do I need horizontal write scalability and schema flexibility?”

Engineering RecommendationRelational (Postgres, MySQL)Document/Wide-column (Mongo, Cassandra, DynamoDB)
ConsistencyACID transactionsUsually eventual, tunable
QueriesJoins, complex filters, aggregationsKey-based access, limited cross-entity queries
Scaling writesHard (sharding required)Designed for horizontal scale
SchemaEnforced, migrations neededFlexible per-document
Best whenYou have relationships and need correctnessYou have high write volume and a known access pattern

In practice, the right default for most SaaS products is Postgres. It does relational work, has excellent JSON support when you need flexibility, scales further than people assume with replicas and good indexing, and you can defer the harder NoSQL questions until you have real access patterns to design around. Reach for a NoSQL store when you have a specific workload high-volume event ingestion, a feature store, time-series telemetry where its model is a clear fit, not as a general-purpose default because it sounds more scalable.

Most teams that “needed NoSQL for scale” actually needed an index and a cache. Prove your relational database is the bottleneck before you abandon it.


Scaling: up first, out when you have to

There are two ways to handle more load, and the order in which you reach for them matters.

Vertical scaling means a bigger machine more CPU, more RAM. Its virtue is that it requires zero code changes: you resize the instance and move on. Its limit is physics and price. You can’t buy an infinitely large server, and cost grows faster than capacity at the top end. But for a startup, vertical scaling is underrated moving from a 4-core to a 32-core box can buy you a year of runway while you build the things that actually need horizontal scale.

Horizontal scaling means more machines. No ceiling, cost grows roughly linearly, better fault tolerance but it forces distributed-systems thinking onto your whole stack.

The key distinction that determines how painful horizontal scaling will be is whether your services are stateless.

A stateless service keeps no per-user state in memory between requests, so any server can handle any request and you add capacity by adding boxes. A stateful service one that holds session data locally breaks the moment a user’s second request lands on a different server. The fix is to push state out of the application tier entirely: store sessions in Redis, or make them stateless with signed JWTs, so your application servers stay disposable. Disposable servers are the foundation everything else rests on.

Stateless (good):           Stateful (problematic):
User → LB → any server      User → LB → Server 1 (session lives here)
        every server          next request → Server 2 (no session → logged out)
        is identical

Make the application tier stateless. Put state in dedicated layers built to hold it.


Load balancing

Once you have more than one server, something has to decide where each request goes. That’s the load balancer, and it does two jobs: distribute traffic and stop sending it to servers that have died.

The distribution algorithm matters more than people think:

  • Round-robin cycles through servers evenly. Fine when requests are uniform and servers are identical.
  • Least connections routes to whoever’s least busy. Better when request durations vary wildly one slow report shouldn’t keep getting more work piled on it.
  • Consistent hashing sends the same key (user, session) to the same server. Necessary when you do have local caches or affinity, and it minimizes disruption when servers join or leave.

For most teams, a managed layer-7 balancer (AWS ALB, Cloudflare, or Nginx/HAProxy if self-hosting) is the right answer. A single load balancer comfortably fronts many backend servers, so it rarely becomes the bottleneck but it does become a single point of failure, which is why production setups run at least two.


Caching: the highest-ROI optimization you have

Caching is usually the single biggest performance win available, because the math is brutal in your favor. If your cache absorbs 90% of reads, your database sees 10% of the traffic. Push the hit rate to 99% and the database sees 1% a tenfold reduction in load from one number.

The strategy you choose determines your consistency guarantees:

Cache-aside (lazy loading) is the common default. On a read, check the cache; on a miss, query the database and populate the cache. You only ever cache data someone actually asked for. The cost is that the first request for any item is always a miss, and you need a story for invalidation.

Write-through writes to cache and database together, so the cache is never stale at the price of slower writes.

Write-behind writes to cache immediately and flushes to the database asynchronously. Fast, high-throughput, and risky: if the cache dies before the flush, you lose data. Reserve it for cases where that loss is acceptable.

// Cache-aside read, with explicit invalidation on write
async function getUser(userId) {
  const key = `user:${userId}`;
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
  await redis.set(key, JSON.stringify(user), 'EX', 3600); // 1h TTL
  return user;
}

async function updateUser(userId, data) {
  await db.update('users', { id: userId }, data);
  await redis.del(`user:${userId}`);          // invalidate immediately
  await eventBus.publish('user.updated', { userId }); // let others react
}

The hard part, as the saying goes, is invalidation. Three approaches, in increasing order of freshness and effort: let entries expire on a TTL (simple, but you serve stale data until expiry); delete the entry explicitly on every write (fresh, but easy to forget a write path); or publish a change event that all interested caches subscribe to (decoupled and reliable, but more moving parts). Match the TTL to how stale the data is allowed to be an hour for a user profile, a minute for inventory that affects whether you can take an order, fifteen minutes for an auth token because security caps it.


CDNs

A CDN is a cache for static content positioned physically close to your users. Instead of a request from Mumbai traveling to a server in Virginia and back, it’s served from an edge node in Mumbai in tens of milliseconds.

Cache aggressively at the edge for anything static and public images, CSS, JS, fonts, video, downloads using long max-age headers. Never cache personalized or rapidly changing content there; a user seeing another user’s cached profile is a serious bug, not a performance win. The line is simple: if two different users should see different bytes, it doesn’t belong in a shared edge cache.


Database replication and sharding

When the database becomes the bottleneck and it usually is the first thing to saturate you have two distinct tools for two distinct problems.

Replication solves read scaling. You keep a primary that takes all writes and one or more replicas that serve reads. Most applications read far more than they write, so this buys a lot of headroom cheaply.

The catch is replication lag. Replicas catch up to the primary asynchronously usually milliseconds, occasionally seconds under load. So immediately after a write, a read from a replica may return the old value:

Primary:  like count → 1000   (write committed)
Replica:  like count →  999   (hasn't caught up yet)

For a like counter, nobody cares. For “did my payment go through,” you read from the primary or you read your own writes through a session-pinned route. Knowing which of your reads can tolerate lag, and which can’t, is the actual skill here.

Sharding solves write scaling, and it’s a much bigger commitment. You split data across independent databases by some key:

hash(user_id) % num_shards  →  which database holds this user

Sharding works, but it costs you things that were free before. Cross-shard queries (“find every user named John”) now have to hit every shard and merge results. Transactions spanning shards are painful. And resharding going from 3 shards to 4 moves data and is genuinely dangerous, which is why consistent hashing exists to minimize how much moves.

The sequence matters: add replicas and caching first, exhaust them, and only shard when a single primary genuinely can’t absorb your write volume (very roughly, when you’re pushing tens of thousands of writes per second or your data outgrows a single node). Choose the shard key carefully, because changing it later is among the most expensive operations in all of infrastructure. A good key (user ID, tenant ID) distributes evenly and keeps related data together; a bad one (timestamp, status) creates hot shards where all the new traffic piles onto one node.


Message queues: stop making users wait

Not every task triggered by a request needs to finish before you respond. When a user signs up, you create their account, send a welcome email, kick off analytics, and build an onboarding flow. Only the first one needs to happen before you return a response. The rest can happen in the background.

Synchronous (user waits for everything):
User → API → create account → send email → analytics → onboarding → response (slow)

Asynchronous (user waits only for what matters):
User → API → create account → enqueue the rest → response (fast)
                                    ↓
                              workers process the queue

A queue decouples accepting work from doing work. There are two shapes worth knowing. A task queue delivers each message to exactly one worker use it for jobs like image resizing or email. A pub/sub stream delivers each event to many independent subscribers use it when several systems need to react to the same thing, like a user.created event that simultaneously triggers email, profile creation, and preference initialization.

Pay attention to delivery guarantees. Most practical queues give you at-least-once delivery, which means a message can be processed twice if a worker crashes after doing the work but before acknowledging it. The defense is idempotency design handlers so that processing the same message twice produces the same result as processing it once. This is not optional; it’s the price of admission for queues.

Kafka suits high-throughput event streaming, RabbitMQ suits complex routing, and managed options like SQS suit teams who’d rather not operate the infrastructure.


Monolith vs microservices

This debate generates more heat than any other in system design, and the honest answer is anticlimactic: start with a monolith.

A monolith is one deployable application. It is simpler to build, simpler to deploy, simpler to debug, and a single transaction can span your whole data model without distributed-systems gymnastics. For nearly every startup, the monolith is not a compromise it’s the correct choice, and it stays correct far longer than microservices advocates admit.

Microservices independently deployable services owning their own data solve real problems, but those problems are mostly organizational and scale-driven. They let large teams ship without colliding and let you scale a hot path independently. In return you take on network calls between services that can fail, distributed debugging, and operational overhead that demands real investment in observability before it pays off.

Engineering RealityMonolithMicroservices
DeploymentOne unit, simpleMany units, orchestration needed
Development speed (small team)FastSlowed by coordination overhead
ScalingWhole app togetherPer-service, granular
DebuggingIn-process, straightforwardDistributed tracing required
Right forStartups, small/medium teamsLarge orgs, independent teams, proven scale bottlenecks

The failure mode is splitting into microservices before you have the team size or scale that justifies them. You inherit all the operational cost and none of the benefit. Split when a specific service has a clear, measured reason to be independent not on principle.


Reliability: assume everything fails

At scale, component failure isn’t an edge case it’s a daily occurrence. Disks fail, networks partition, dependencies time out. Reliable systems are the ones that expect this.

The foundation is redundancy: no single point of failure, at any layer. Multiple app servers behind multiple load balancers, database replicas across availability zones, ideally more than one region. If killing any single machine takes down your product, you have a design bug, not bad luck.

Beyond redundancy, three patterns earn their keep:

Circuit breakers stop cascading failures. When a downstream dependency is failing, continuing to call it wastes resources and ties up threads that then cause your callers to time out the failure spreads upstream. A circuit breaker detects repeated failures and “opens,” failing fast instead of waiting, then periodically tests whether the dependency has recovered.

class CircuitBreaker {
  constructor(threshold = 5, cooldownMs = 30000) {
    this.failures = 0;
    this.threshold = threshold;
    this.cooldownMs = cooldownMs;
    this.state = 'closed';
    this.openedAt = null;
  }

  async call(fn) {
    if (this.state === 'open') {
      if (Date.now() - this.openedAt > this.cooldownMs) this.state = 'half-open';
      else throw new Error('circuit open failing fast');
    }
    try {
      const result = await fn();
      this.failures = 0;
      this.state = 'closed';
      return result;
    } catch (err) {
      if (++this.failures >= this.threshold) {
        this.state = 'open';
        this.openedAt = Date.now();
      }
      throw err;
    }
  }
}

Bulkheads isolate failures so one exhausted resource doesn’t sink the ship. If report generation can drain the database connection pool, give it its own pool then a flood of reports degrades reports while the rest of the API keeps serving.

Retries with exponential backoff and jitter handle transient failures without turning them into outages. Retry, but back off geometrically (100ms, 200ms, 400ms…) and add randomness, so that ten thousand clients don’t all retry in lockstep and hammer a recovering service into the ground again the “thundering herd.”


Observability: you can’t fix what you can’t see

When something breaks in production at 2 a.m., the difference between a five-minute fix and a five-hour outage is whether you can see what’s happening. Observability rests on three pillars.

Logs record discrete events what happened. Metrics record numbers over time how much, how fast, how often. Traces follow a single request across every service it touches, which is the only practical way to answer “why was this request slow?” in a distributed system.

The metrics that actually matter:

  • Availability, stated honestly: 99.9% is nearly 9 hours of downtime a year; 99.99% is under an hour. Know which one you’re promising.
  • Latency at percentiles, not averages. Average latency hides your worst experiences. If p99 is 500ms, one in every hundred requests waits half a second and your heaviest users hit that tail constantly. p99 is the number that reflects real user pain.
  • Error rate, broken down by type and by component, so a spike points you somewhere.
  • Resource saturation CPU, memory, connections so you see the wall before you hit it.

Then alert on a small number of things you can actually act on (error rate over 1%, p99 over your SLO, disk under 20%) and resist alerting on everything else. Alert fatigue is how teams learn to ignore the page that finally matters.


Security has to be in the design, not bolted on after

Security retrofitted after launch is always more expensive and less effective than security designed in. The fundamentals:

Authentication verifies who someone is sessions, JWTs, OAuth. Authorization verifies what they’re allowed to do, and it’s a separate question; authenticating a user tells you nothing about whether they may delete this record. Enforce authorization on the server for every sensitive action, never in the client.

Encryption protects data in transit (TLS everywhere, no exceptions) and at rest (disk and field-level encryption for sensitive data). Rate limiting protects you from abuse and from accidental floods cap requests per client and fail the excess cleanly. And secrets management means API keys, database passwords, and tokens live in a vault or your platform’s secrets manager, never in source control. The most common breach in startups is a credential committed to a repo.


The trade-offs, on one page

Every technique in this guide buys you something and charges you for it. The whole discipline compressed into a table:

DecisionWhat it buysWhat it costs
CachingMassive read-load reduction, lower latencyInvalidation complexity, stale-data risk
ReplicationRead scaling, failoverReplication lag, eventual consistency
ShardingWrite scaling beyond one nodeCross-shard queries, resharding pain
MicroservicesIndependent scaling and deploysOperational + observability overhead
Message queuesResponsiveness, decouplingAt-least-once duplicates, idempotency burden
CDNGlobal low latency, less origin loadCache management, invalidation

There is no architecture that wins every column. There’s only the architecture that’s right for the requirements you wrote down at the start.


A concrete example: a SaaS platform that grows up

Theory is cheap. Here’s how the pieces assemble for a project-management SaaS, and more usefully how the architecture should evolve rather than arrive fully formed.

Stage 1: launch (0 to ~10k users). A Next.js frontend, a single monolithic API, one Postgres instance, Redis for sessions and hot reads. That’s it. This stack will take you remarkably far, and every hour spent making it fancier is an hour not spent finding product-market fit.

Stage 2: traction (~10k to ~500k users). Now the database feels read pressure. Add read replicas and route reads to them. Put a CDN in front of static assets. Move the slow post-request work emails, exports, notifications onto a queue with background workers. Notice that none of this is a rewrite; it’s additive.

Stage 3: scale (500k+). Specific services now have genuine, measured reasons to scale independently, so peel them off the monolith one at a time. Shard the tables that have outgrown a single primary. Go multi-region for latency and resilience. By the time you’re here, you have the traffic data to make each of these calls with evidence instead of guesswork.

Stage 3 topology:
Users → Cloudflare → Load Balancer → API Gateway
                                       ├→ Auth service
                                       ├→ Projects service
                                       ├→ Billing service
                                       └→ Notifications service
                                            
 
Postgres (sharded + replicas) · Redis · Queue workers

The lesson isn’t the diagram. It’s that you arrive at the diagram in steps, each justified by a problem you can measure, and that Stage 1 is a perfectly good place to spend a long time.


The mistakes that show up again and again

Premature microservices. The most expensive early mistake. You take on distributed-systems cost before you have distributed-systems problems.

No observability. Shipping without logs, metrics, and tracing means flying blind the first time production misbehaves which is exactly when you can least afford it.

Treating the database as infinitely fast. It’s almost always your first bottleneck. A missing index or an uncached hot query will surface long before anything exotic does.

Single points of failure. One box, one balancer, one region each is a future outage with a date you haven’t met yet.

Over-engineering for scale you don’t have. Building for ten million users while serving ten thousand is just complexity you pay to maintain and learn nothing from.


Principles that survive contact with production

  1. Start simple. Complexity should be earned, not assumed.
  2. Scale incrementally, and let measurements not fashion drive each step.
  3. Measure before optimizing. Intuition about bottlenecks is usually wrong.
  4. Design for failure, because failure is the default state of distributed systems.
  5. Cache what’s read often; it’s the cheapest large win you have.
  6. Keep the application tier stateless so servers stay disposable.
  7. Make every queue consumer idempotent.
  8. Read percentiles, not averages.
  9. Build security in from day one.
  10. Solve the business problem in front of you, not the one you imagine having at a hundred times your current scale.

Final thoughts

System design isn’t about reproducing the architecture diagrams of large tech companies. Those diagrams are answers to their requirements at their scale copying them gives you their complexity without their reasons.

The real skill is judgment: understanding how systems behave under load and failure, and making deliberate trade-offs among scalability, performance, reliability, security, and cost for the system you’re actually building. The strongest engineers are defined as much by what they choose not to build as by what they do by knowing when a single Postgres instance is the right answer and when it’s time to shard.

Get the fundamentals right, keep things as simple as the requirements allow, and add complexity only when a measured problem demands it. That’s the whole discipline.


Designing or scaling a system and want a second set of eyes on the architecture? Get in touch.

Tags: Backend Developmentbusiness productivityDevOpsDistributed SystemsLoad BalancingModern Software SystemsPerformance EngineeringScalable Software Architecture\Scalable SystemsSoftware ArchitectureSystem DesignSystem Design for DevelopersSystem Design FundamentalsSystem Design Guide

Explore our services