Learning Distributed Systems by Building a Social Media Platform
Hey folks!
A lot of people use social media and wonder how it works behind the scenes. How is everything so fast? Instant notifications, chat messages, status updates, posts, comments, likes and much more. It’s mind-boggling that a single application can do so much for millions of concurrent users. We’re going to unpack this mystery in this four-phase series.
We will build a feature-rich social media application from scratch. We’ll cover frontend, backend, databases, caching, sharding, partitioning, replication and much more. We’ll implement features, learn about scalability and make the product reliable and maintainable.
This is not a simple tutorial on implementing yet another “todo list”. This series is intended to discuss the tradeoffs being made when choosing tools and techniques. We will scale our application for millions of users, so we’ll have to answer a lot of questions.
This project is divided into four phases because we don’t want try setting everything up before implementing a single feature. We’ll proceed gradually and learn by making mistakes and correcting them, just as a normal developer does.
-
Phase 1: Building the Functional Application This phase will focus on building the product. We’ll do what a Software Engineer is supposed to do: ship features. We won’t think about scalability yet. This phase focuses on monolithic implementation of our full-stack application. It will use a single Postgres database, and we won’t even use caching yet.
-
Phase 2: Making the Application Production-Grade Here we’ll discuss architectural patterns, telemetry, improved error handling, better testing, CI/CD, etc.
-
Phase 3: Scaling a Monolith into a High-Traffic Architecture This is where caching, sharding, replication and load balancing come in. We’ll scale our application using recommended tools and techniques, but we won’t be fully distributed yet.
-
Phase 4: Turning a Social Media Platform into a Fully Distributed System This is where we carefully design our architecture to go fully distributed. We’ll cover what patterns are useful for us and why we’re using them. Microservices, Event-Driven Architecture, Distributed Caching, API gateways are some topics that come to mind. We’ll see what we need and why we need it when we arrive at Phase 4.
It should be clear by now that this series is intended for full-stack developers who want to learn how to scale systems in a distributed manner. The series assumes that you have a decent understanding of programming languages, client-server architecture, APIs, frameworks, databases and testing. It will not cover the basics of JavaScript/Typescript, React/Next.js, Node.js/NestJS or SQL/NoSQL as there are already plenty of resources for that.
Let’s talk about our phases and what to expect from each iteration.
Phase 1: Building the Functional Application
First phase is just about implementing features. We’ll take one feature at a time, figure out what it does, design the frontend, design the backend, create or modify database tables, and write simple tests.
Remember: Our current phase is not about scaling. It doesn’t matter if our app can’t handle more than a hundred users right now; we just want to make things work.
Technology Stack
-
Typescript: We’re not building distributed infrastructure from scratch, a game engine, or an operating system. We just want to make sense of how parts of a complex application work together. We should prefer a language that is mature enough to do that, yet simple enough to understand and build modules easily.
-
Next.js & React: Next.js is important for server-side rendering and React is standard for building large, manageable applications. For styling, we’ll consider shadcn/ui and tailwindcss if necessary. Since our focus is building complex logic, we will not focus heavily on custom design, but it will still look clean.
-
Node.js & Nest.js: Since a lot of our API in this phase will just be about I/O, we’ll use Node.js. When optimizing for performance and scalability later, we’ll need to figure out bottlenecks in our application. We can eventually rewrite specific services (the bottlenecks, not the whole app) in another language. Rust will be used later to gain maximum performance, but we’ll get to that in future phases.
-
Postgres: A relational database is the obvious choice when storing data for a social media application. It’s fast, open-source, modern and sufficient for Phase 1 and 2. We’ll consider other databases later.
-
MongoDB: This database will be used to store logs and authentication tokens. It’s simple and optimized for these tasks. We’ll consider Redis for authentication tokens and caching in later phases.
Features
We’re not going to create another Facebook or Instagram, as those platforms have evolved over years by thousands of engineers. But we do want to build something big enough to understand complex system interactions. We’ll implement the following core features.
- Users: Follows, Profile Management, Suggested Users
- Authentication: Signup, Login, Forgot Password
- Posts: Media support, Comments and Replies
- Feed: Hot/Trending posts
- Likes: On Posts, Comments, Replies
- Chat: Direct Messaging with another person (Groups will not be supported initially).
- Notifications: Alerts when someone posts, comments, likes, follows, or accept requests.
- Recommendations: Recommended Posts based on interactions.
- Search: Decent, organized search results. (These will be improved in Phase 4 when we introduce dedicated search databases).
Planned Articles
This is the planned list of topics we will cover. This list will be updated as new articles arrive, pointing to that article. Stay tuned!
- Getting Started - UI | Backend
- Authentication - UI | Backend
- Profile - UI | Backend
- Making Friends - UI | Backend
- Posts - UI | Backend
- Comments - UI | Backend
- Likes - UI | Backend
- Feed - UI | Backend
- Search - UI | Backend
- Notifications - UI | Backend
- Chat - UI | Backend
Since this is not a book, but a series of articles written step-by-step, I haven’t fully planned the later phases in details yet. However, here is an overview of what to expect.
Phase 2: Making the application production-grade
This phase is about making the product robust and ready to ship for production use. We will improve:
- Architecture / Design patterns
- Error Handling
- Telemetry (Logging/Metrics)
- Testing coverage
- CI/CD pipelines
- Security best practices
We’ll cover basic CI/CD to avoid pain of manual deployments. We’ll also use S3 (MinIO) to store our media files.
Phase 3: Scaling a Monolith into a High-Traffic Architecture
What we build in Phase 1 won’t be able to handle thousands of users, let alone millions. In this phase, we’ll learn how to scale databases and utilize caching strategies effectively.
Phase 4: Turning a Social Media Platform into a Fully Distributed System
This is where we carefully design our architecture to go fully distributed. We’ll cover which patterns are useful and why. Microservices, Event-Driven Architecture, Distributed Cache, API gateways are key topics here. We’ll analyze what we need and why we need it when we arrive at Phase 4.