Everything I know about good system design
我所知道關於良好系統設計的一切

I see a lot of bad system design advice. One classic is the LinkedIn-optimized “bet you never heard of queues” style of post, presumably aimed at people who are new to the industry. Another is the Twitter-optimized “you’re a terrible engineer if you ever store booleans in a database” clever trick¹. Even good system design advice can be kind of bad. I love Designing Data-Intensive Applications, but I don’t think it’s particularly useful for most system design problems engineers will run into.
我看到很多糟糕的系統設計建議。一種典型是 LinkedIn 上那種「保證你沒聽過佇列」的風格貼文，顯然是針對業界新手。另一種是 Twitter 上那種「如果你在資料庫存布林值就是爛工程師」的聰明把戲 ¹ 。即使是好的系統設計建議也可能有點問題。我喜歡《設計資料密集型應用》這本書，但我不認為它對工程師會遇到的大多數系統設計問題特別有用。

What is system design? In my view, if software design is how you assemble lines of code, system design is how you assemble services. The primitives of software design are variables, functions, classes, and so on. The primitives of system design are app servers, databases, caches, queues, event buses, proxies, and so on.
什麼是系統設計？在我看來，如果軟體設計是關於如何組合程式碼行，那麼系統設計就是關於如何組合服務。軟體設計的基本元素是變數、函式、類別等等。系統設計的基本元素則是應用伺服器、資料庫、快取、佇列、事件匯流排、代理伺服器等等。

This post is my attempt to write down, in broad strokes, everything I know about good system design. A lot of the concrete judgment calls do come down to experience, which I can’t convey in this post. But I’m trying to write down what I can.
這篇文章是我嘗試概括寫下我所知道關於良好系統設計的一切。很多具體的判斷確實取決於經驗，這點我無法在本文中傳達。但我會盡力寫下我能分享的部分。

Recognizing good design 辨識優良設計

What does good system design look like? I’ve written before that it looks underwhelming. In practice, it looks like nothing going wrong for a long time. You can tell that you’re in the presence of good design if you have thoughts like “huh, this ended up being easier than I expected”, or “I never have to think about this part of the system, it’s fine”. Paradoxically, good design is self-effacing: bad design is often more impressive than good. I’m always suspicious of impressive-looking systems. If a system has distributed-consensus mechanisms, many different forms of event-driven communication, CQRS, and other clever tricks, I wonder if there’s some fundamental bad decision that’s being compensated for (or if the system is just straightforwardly over-designed).
優良的系統設計是什麼模樣？我先前曾寫道，它看起來平凡無奇。實際上，它就像長時間運作卻毫無差錯。當你產生「咦，這比預期中簡單多了」或「我從不需要操心系統這部分，它運作得很好」這類想法時，你就能確定自己正見證著優良設計。弔詭的是，優良設計往往低調不張揚：糟糕設計反而經常比優良設計更令人驚豔。我總是對那些看似驚人的系統抱持懷疑。若某個系統具備分散式共識機制、多種事件驅動通訊形式、CQRS 模式或其他精巧技巧，我會懷疑是否在彌補某些根本性的錯誤決策（或者該系統純粹就是過度設計了）。

I’m often alone on this. Engineers look at complex systems with many interesting parts and think “wow, a lot of system design is happening here!” In fact, a complex system usually reflects an absence of good design. I say “usually” because sometimes you do need complex systems. I’ve worked on many systems that earned their complexity. However, a complex system that works always evolves from a simple system that works. Beginning from scratch with a complex system is a really bad idea.
我經常獨自持這個觀點。工程師們看著由許多有趣部分組成的複雜系統時，往往會想「哇，這裡有很多系統設計的巧思！」但實際上，複雜的系統通常反映出良好設計的缺失。我說「通常」是因為有時候確實需要複雜系統。我參與過許多確實需要這種複雜性的系統。然而，一個能正常運作的複雜系統，總是從一個能正常運作的簡單系統演變而來。從零開始就打造複雜系統絕對是個糟糕的主意。

State and statelessness 狀態與無狀態性

The hard part about software design is state. If you’re storing any kind of information for any amount of time, you have a lot of tricky decisions to make about how you save, store and serve it. If you’re not storing information², your app is “stateless”. As a non-trivial example, GitHub has an internal API that takes a PDF file and returns a HTML rendering of it. That’s a real stateless service. Anything that writes to a database is stateful.
軟體設計最困難的部分就是狀態管理。如果你需要儲存任何形式的資訊（無論時間長短），就必須做出許多棘手的決策：如何保存、儲存和提供這些資訊。如果你不需要儲存資訊 ² ，你的應用程式就是「無狀態的」。舉個實際例子：GitHub 有個內部 API 能接收 PDF 檔案並回傳 HTML 渲染結果，這就是真正的無狀態服務。任何會寫入資料庫的操作都屬於有狀態行為。

You should try and minimize the amount of stateful components in any system. (In a sense this is trivially true, because you should try to minimize the amount of all components in a system, but stateful components are particularly dangerous.) The reason you should do this is that stateful components can get into a bad state. Our stateless PDF-rendering service will safely run forever, as long as you’re doing broadly sensible things: e.g. running it in a restartable container so that if anything goes wrong it can be automatically killed and restored to working order. A stateful service can’t be automatically repaired like this. If your database gets a bad entry in it (for instance, an entry with a format that triggers a crash in your application), you have to manually go in and fix it up. If your database runs out of room, you have to figure out some way to prune unneeded data or expand it.
在任何系統中，你都應該盡量減少有狀態元件的數量。（從某種意義上說，這顯而易見，因為你本來就該盡量減少系統中所有元件的數量，但有狀態元件尤其危險。）這麼做的原因在於，有狀態元件可能會進入不良狀態。只要操作大致合理（例如將其運作於可重啟的容器中，讓它在出問題時能被自動終止並恢復正常運作），我們的無狀態 PDF 渲染服務就能永遠安全運行。但對於有狀態服務，就無法這樣自動修復。舉例來說，若資料庫中出現格式錯誤的條目（例如觸發應用程式崩潰的格式），就必須手動介入修正；當資料庫空間不足時，也得自行設法清理無用資料或擴充容量。

What this means in practice is having one service that knows about the state - i.e. it talks to a database - and other services that do stateless things. Avoid having five different services all write to the same table. Instead, have four of them send API requests (or emit events) to the first service, and keep the writing logic in that one service. If you can, it’s worth doing this for the read logic as well, although I’m less absolutist about this. It’s sometimes better for services to do a quick read of the user_sessions table than to make a 2x slower HTTP request to an internal sessions service.
這在實務上意味著，我們應該讓一個服務負責管理狀態（例如與資料庫溝通），而其他服務則處理無狀態的事務。避免讓五個不同的服務都寫入同一個資料表，而是讓其中四個服務透過 API 請求（或發送事件）與第一個服務溝通，並將寫入邏輯集中在那個服務中。如果可能的話，讀取邏輯也值得這樣處理，不過我對這部分的態度沒那麼絕對。有時候讓服務直接快速讀取 user_sessions 資料表，會比向內部會話服務發送速度慢兩倍的 HTTP 請求來得好。

Databases 資料庫

Since managing state is the most important part of system design, the most important component is usually where that state lives: the database. I’ve spent most of my time working with SQL databases (MySQL and PostgreSQL), so that’s what I’m going to talk about.
由於狀態管理是系統設計中最關鍵的部分，最重要的元件通常就是存放這些狀態的地方：資料庫。我大部分時間都在使用 SQL 資料庫（MySQL 和 PostgreSQL），所以接下來要討論的也是這些。

Schemas and indexes 資料表結構與索引

If you need to store something in a database, the first thing to do is define a table with the schema you need. Schema design should be flexible, because once you have thousands or millions of records, it can be an enormous pain to change the schema. However, if you make it too flexible (e.g. by sticking everything in a “value” JSON column, or using “keys” and “values” tables to track arbitrary data) you load a ton of complexity into the application code (and likely buy some very awkward performance constraints). Drawing the line here is a judgment call and depends on specifics, but in general I aim to have my tables be human-readable: you should be able to go through the database schema and get a rough idea of what the application is storing and why.
若你需要將資料存入資料庫，首要之務就是根據需求定義資料表結構。結構設計應保持彈性，因為當資料量達到數千或數百萬筆時，修改結構會變得極其困難。然而，若過度追求彈性（例如將所有內容塞入「value」JSON 欄位，或使用「keys」和「values」資料表來儲存任意資料），反而會讓應用程式代碼變得異常複雜（通常還會伴隨效能上的限制）。如何拿捏分寸需視具體情況判斷，但我的原則是讓資料表保持可讀性：瀏覽資料庫結構時，應該能大致理解應用程式儲存了什麼資料及其用途。

If you expect your table to ever be more than a few rows, you should put indexes on it. Try to make your indexes match the most common queries you’re sending (e.g. if you query by email and type, create an index with those two fields). Don’t index on every single thing you can think of, since each index adds write overhead.
若預期資料表未來可能超過數十筆資料，就該建立索引。盡量讓索引符合最常執行的查詢條件（例如若常以 email 和 type 進行查詢，就為這兩個欄位建立聯合索引）。但別為所有可能條件都建立索引，因為每個索引都會增加寫入負擔。

Bottlenecks 效能瓶頸

Accessing the database is often the bottleneck in high-traffic applications. This is true even when the compute side of things is relatively inefficient (e.g. Ruby on Rails running on a preforking server like Unicorn). That’s because complex applications need to make a lot of database calls - hundreds and hundreds for every single request, often sequentially (because you don’t know if you need to check whether a user is part of an organization until after you’ve confirmed they’re not abusive, and so on). How can you avoid getting bottlenecked?
存取資料庫往往是高流量應用程式的效能瓶頸。即便運算端相對沒有效率（例如在 Unicorn 這類預先分岔的伺服器上運行的 Ruby on Rails），情況依然如此。這是因為複雜的應用程式需要進行大量資料庫呼叫——每個請求動輒數百次，而且通常是依序執行（舉例來說，你得先確認使用者沒有濫用行為，才能檢查他們是否屬於某個組織，諸如此類）。該如何避免陷入瓶頸呢？

When querying the database, query the database. It’s almost always more efficient to get the database to do the work than to do it yourself. For instance, if you need data from multiple tables, JOIN them instead of making separate queries and stitching them together in-memory. Particularly if you’re using an ORM, beware accidentally making queries in an inner loop. That’s an easy way to turn a select id, name from table to a select id from table and a hundred select name from table where id = ?.
在查詢資料庫時，就讓資料庫來處理查詢。讓資料庫執行工作幾乎總是比你自己動手更有效率。舉例來說，如果你需要從多個表格取得資料，請使用 JOIN （聯結查詢）而非分別查詢後再於記憶體中拼接資料。特別是當你使用 ORM 時，要小心避免意外在內部迴圈中執行查詢。這可是讓 select id, name from table （毫秒級查詢）變成 select id from table （秒級查詢）外加上百個 select name from table where id = ? （額外查詢）的簡單方法。

Every so often you do want to break queries apart. It doesn’t happen often, but I’ve run into queries that were ugly enough that it was easier on the database to split them up than to try to run them as a single query. I’m sure it’s always possible to construct indexes and hints such that the database can do it better, but the occasional tactical query-split is a tool worth having in your toolbox.
偶爾你會想要將查詢拆開來執行。這種情況並不常見，但我確實遇過一些查詢語法複雜到與其嘗試用單一查詢執行，拆開處理反而能減輕資料庫負擔。雖然理論上總能透過建立索引和提示讓資料庫更有效率地執行，但適時地拆分查詢仍是值得納入工具箱的技巧。

Send as many read queries as you can to database replicas. A typical database setup will have one write node and a bunch of read-replicas. The more you can avoid reading from the write node, the better - that write node is already busy enough doing all the writes. The exception is when you really, really can’t tolerate any replication lag (since read-replicas are always running at least a handful of ms behind the write node). But in most cases replication lag can be worked around with simple tricks: for instance, when you update a record but need to use it right after, you can fill in the updated details in-memory instead of immediately re-reading after a write.
盡可能將讀取查詢導向資料庫副本。典型資料庫架構會有一個寫入節點和多個讀取副本。能避免從寫入節點讀取就盡量避免——畢竟寫入節點光是處理所有寫入操作就已經夠忙了。唯一例外是當你完全無法容忍任何複製延遲時（因為讀取副本的資料總是會比寫入節點慢上幾毫秒）。不過在多數情況下，只需簡單技巧就能解決複製延遲問題：例如當你更新一筆記錄後需要立即使用時，可以直接在記憶體中填入更新後的資料，而不必在寫入後馬上重新讀取。

Beware spikes of queries (particularly write queries, and particularly transactions). Once a database gets overloaded, it gets slow, which makes it more overloaded. Transactions and writes are good at overloading databases, because they require a lot of database work for each query. If you’re designing a service that might generate massive query spikes (e.g. some kind of bulk-import API), consider throttling your queries.
注意查詢高峰（特別是寫入查詢，尤其是交易）。一旦資料庫超載，速度就會變慢，這會讓它更加超載。交易和寫入很容易造成資料庫超載，因為每個查詢都需要資料庫進行大量工作。如果你正在設計一個可能產生大量查詢高峰的服務（例如某種批量導入 API），請考慮限制查詢速率。

Slow operations, fast operations
慢速操作與快速操作

A service has to do some things fast. If a user is interacting with something (say, an API or a web page), they should see a response within a few hundred ms³. But a service has to do other things that are slow. Some operations just take a long time (converting a very large PDF to HTML, for instance). The general pattern for this is splitting out the minimum amount of work needed to do something useful for the user and doing the rest of the work in the background. In the PDF-to-HTML example, you might render the first page to HTML immediately and queue up the rest in a background job.
服務必須快速完成某些事情。如果用戶正在與某個東西互動（比如 API 或網頁），他們應該在幾百毫秒內看到回應。但服務也必須處理其他耗時較長的操作。有些操作就是需要較長時間（例如將非常大的 PDF 轉換為 HTML）。處理這種情況的通用模式是：先拆分出能為用戶提供基本功能的最小工作量，其餘工作則放在背景執行。以 PDF 轉 HTML 為例，你可以立即渲染第一頁為 HTML，並將剩餘頁面排入背景工作佇列。

What’s a background job? It’s worth answering this in detail, because “background jobs” are a core system design primitive. Every tech company will have some kind of system for running background jobs. There will be two main components: a collection of queues, e.g. in Redis, and a job runner service that will pick up items from the queues and execute them. You enqueue a background job by putting an item like {job_name, params} on the queue. It’s also possible to schedule background jobs to run at a set time (which is useful for periodic cleanups or summary rollups). Background jobs should be your first choice for slow operations, because they’re typically such a well-trodden path.
什麼是背景工作？這個問題值得詳細回答，因為「背景工作」是系統設計的核心基礎。每家科技公司都會有某種執行背景工作的系統。主要有兩個組成部分：一組佇列（例如在 Redis 中），以及一個從佇列中取出項目並執行的作業執行服務。你可以透過在佇列中放入像 {job_name, params} 這樣的項目來排入背景工作。也可以安排背景工作在特定時間執行（這對於定期清理或摘要彙總很有用）。對於耗時操作，背景工作應該是你的首選，因為這通常是經過充分驗證的成熟做法。

Sometimes you want to roll your own queue system. For instance, if you want to enqueue a job to run in a month, you probably shouldn’t put an item on the Redis queue. Redis persistence is typically not guaranteed over that period of time (and even if it is, you likely want to be able to query for those far-future enqueued jobs in a way that would be tricky with the Redis job queue). In this case, I typically create a database table for the pending operation with columns for each param plus a scheduled_at column. I then use a daily job to check for these items with scheduled_at <= today, and either delete them or mark them as complete once the job has finished.
有時候你會想要自己打造一個佇列系統。舉例來說，如果你想將一個工作排入佇列，讓它在一個月後執行，那麼你可能不該把項目放在 Redis 佇列中。Redis 的持久性通常無法保證能維持那麼長的時間（即使可以，你可能也會希望能夠查詢那些排程在遙遠未來的工作，而這在使用 Redis 工作佇列時會相當棘手）。在這種情況下，我通常會為待處理的操作建立一個資料庫表格，其中包含每個參數的欄位，再加上一個 scheduled_at 欄位。接著我會使用每日工作來檢查這些帶有 scheduled_at <= today 的項目，並在工作完成後將它們刪除或標記為已完成。

Caching 快取

Sometimes an operation is slow because it needs to do an expensive (i.e. slow) task that’s the same between users. For instance, if you’re calculating how much to charge a user in a billing service, you might need to do an API call to look up the current prices. If you’re charging users per-use (like OpenAI does per-token), that could (a) be unacceptably slow and (b) cause a lot of traffic for whatever service is serving the prices. The classic solution here is caching: only looking up the prices every five minutes, and storing the value in the meantime. It’s easiest to cache in-memory, but using some fast external key-value store like Redis or Memcached is also popular (since it means you can share one cache across a bunch of app servers).
有時候某個操作會變慢，是因為它需要執行一個對所有使用者都相同的昂貴（也就是緩慢）任務。舉例來說，如果你在帳務服務中計算該向使用者收取多少費用，可能需要透過 API 調用來查詢當前價格。如果是像 OpenAI 那樣按使用量（例如每個 token）計費，這種做法可能（a）慢到令人無法接受，而且（b）會對提供價格的服務造成大量流量衝擊。這裡的經典解決方案就是快取：每五分鐘才查詢一次價格，並在此期間將數值儲存起來。最簡單的方式是使用記憶體內快取，但採用像 Redis 或 Memcached 這類快速的外部鍵值儲存系統也很常見（因為這樣就能在多個應用伺服器之間共享同一個快取）。

The typical pattern is that junior engineers learn about caching and want to cache everything, while senior engineers want to cache as little as possible. Why is that? It comes down to the first point I made about the danger of statefulness. A cache is a source of state. It can get weird data in it, or get out-of-sync with the actual truth, or cause mysterious bugs by serving stale data, and so on. You should never cache something without first making a serious effort to speed it up. For instance, it’s silly to cache an expensive SQL query that isn’t covered by a database index. You should just add the database index!
典型的模式是，初級工程師學到快取技術後會想快取所有東西，而資深工程師則傾向盡量減少快取。為什麼會這樣？這要回歸到我最初提到的狀態管理風險。快取本身就是一種狀態來源——它可能存入異常數據、與實際情況不同步，或因提供過期數據而引發難以追蹤的錯誤等等。在未經嚴謹優化前，絕對不該貿然使用快取。舉例來說，若某個耗時的 SQL 查詢缺乏資料庫索引支援，這時選擇快取根本是本末倒置——你真正該做的是補上那個資料庫索引！

I use caching a lot. One useful caching trick to have in the toolbox is using a scheduled job and a document storage like S3 or Azure Blob Storage as a large-scale persistent cache. If you need to cache the result of a really expensive operation (say, a weekly usage report for a large customer), you might not be able to fit the result in Redis or Memcached. Instead, stick a timestamped blob of the results in your document storage and serve the file directly from there. Like the database-backed long-term queue I mentioned above, this is an example of using the caching idea without using a specific cache technology.
我經常使用快取機制。工具箱中一個實用的快取技巧是使用排程任務搭配像是 S3 或 Azure Blob Storage 這類文件儲存服務，作為大規模的持久性快取。如果你需要快取一個運算成本極高的結果（例如某個大客戶的每週使用報告），可能無法將結果完全存入 Redis 或 Memcached。這時可以將帶有時間戳記的結果資料塊存入文件儲存系統，並直接從那裡提供檔案服務。就像我前面提到的基於資料庫的長期佇列一樣，這是不使用特定快取技術卻實現快取概念的一個範例。

Events 事件

As well as some kind of caching infrastructure and background job system, tech companies will typically have an event hub. The most common implementation of this is Kafka. An event hub is just a queue - like the one for background jobs - but instead of putting “run this job with these params” on the queue, you put “this thing happened” on the queue. One classic example is firing off a “new account created” event for each new account, and then having multiple services consume that event and take some action: a “send a welcome email” service, a “scan for abuse” service, a “set up per-account infrastructure” service, and so on.
除了某種快取基礎設施和背景工作系統外，科技公司通常還會有一個事件中心。最常見的實作方式是 Kafka。事件中心其實就是一個佇列——就像背景工作用的那種——但與其把「用這些參數執行這個工作」放進佇列，而是放入「這件事發生了」的訊息。一個經典例子是為每個新帳號發送「新帳號已建立」事件，然後讓多個服務消費該事件並採取行動：像是「發送歡迎郵件」服務、「掃描濫用行為」服務、「建立帳號專屬基礎設施」服務等等。

You shouldn’t overuse events. Much of the time it’s better to just have one service make an API request to another service: all the logs are in the same place, it’s easier to reason about, and you can immediately see what the other service responded with. Events are good for when the code sending the event doesn’t necessarily care what the consumers do with the event, or when the events are high-volume and not particularly time-sensitive (e.g. abuse scanning on each new Twitter post).
你不應該過度使用事件。多數情況下，直接讓一個服務向另一個服務發送 API 請求會更好：所有日誌都集中在同個地方，更容易追蹤問題，而且能立即看到另一個服務的回應內容。事件適合用在發送事件的程式碼不一定在乎消費者如何處理該事件時，或是當事件量很大且時間敏感性不高時（例如對每篇新推特貼文進行濫用掃描）。

Pushing and pulling 推送與拉取

When you need data to flow from one place to a lot of other places, there are two options. The simplest is to pull. This is how most websites work: you have a server that owns some data, and when a user wants it they make a request (via their browser) to the server to pull that data down to them. The problem here is that users might do a lot of pulling down the same data - e.g. refreshing their email inbox to see if they have any new emails, which will pull down and reload the entire web application instead of just the data about the emails.
當你需要將資料從一個地方傳送到多個其他位置時，有兩種選擇。最簡單的方式是「拉取」。這也是大多數網站運作的方式：你有一個擁有某些資料的伺服器，當使用者需要這些資料時，他們會透過瀏覽器向伺服器發送請求，將資料拉取下來。這裡的問題在於，使用者可能會重複拉取相同的資料——例如刷新電子郵件收件箱查看是否有新郵件，這會導致整個網頁應用程式被重新載入，而不僅僅是關於郵件的資料。

The alternative is to push. Instead of allowing users to ask for the data, you allow them to register as clients, and then when the data changes, the server pushes the data down to each client. This is how GMail works: you don’t have to refresh the page to get new emails, because they’ll just appear when they arrive.
另一種方式是「推送」。與其讓使用者主動請求資料，你可以讓他們註冊為客戶端，當資料發生變化時，伺服器會主動將資料推送給每個客戶端。這就是 Gmail 的運作方式：你不需要刷新頁面來獲取新郵件，因為它們會在到達時自動顯示。

If we’re talking about background services instead of users with web browsers, it’s easy to see why pushing can be a good idea. Even in a very large system, you might only have a hundred or so services that need the same data. For data that doesn’t change much, it’s much easier to make a hundred HTTP requests (or RPC, or whatever) whenever the data changes than to serve up the same data a thousand times a second.
如果我們談論的是背景服務而非使用網頁瀏覽器的用戶，就很容易理解為什麼推送會是個好主意。即使在一個非常龐大的系統中，可能只有約一百個服務需要相同的數據。對於不太變動的數據，每當數據變更時發送一百次 HTTP 請求（或 RPC 等），遠比每秒提供相同數據一千次來得簡單。

Suppose you did need to serve up-to-date data to a million clients (like GMail, does). Should those clients be pushing or pulling? It depends. Either way, you won’t be able to run it all from a single server, so you’ll need to farm it out to other components of the system. If you’re pushing, that will likely mean sticking each push on an event queue and having a horde of event processors each pulling from the queue and sending out your pushes. If you’re pulling, that will mean standing up a bunch (say, a hundred) of fast⁴ read-replica cache servers that will sit in front of your main application and handle all the read traffic⁵.
假設您確實需要向一百萬個客戶端（如 GMail）提供最新數據。這些客戶端應該採用推送還是拉取？這取決於情況。無論哪種方式，您都無法僅靠單一伺服器運行所有服務，因此需要將工作分配給系統的其他組件。如果採用推送，可能意味著將每個推送放入事件佇列，並由一群事件處理器從佇列中提取並發送推送。如果採用拉取，則意味著建立一批（例如一百台）快速的 ⁴ 讀取副本快取伺服器，這些伺服器將位於主應用程式前端，處理所有讀取流量 ⁵ 。

Hot paths 熱門路徑

When you’re designing a system, there are lots of different ways users can interact with it or data can flow through it. It can get a bit overwhelming. The trick is to mainly focus on the “hot paths”: the part of the system that is most critically important, and the part of the system that is going to handle the most data. For instance, in a metered billing system, those pieces might be the part that decides whether or not a customer gets charged, and the part that needs to hook into all user actions on the platform to identify how much to charge.
當你在設計一個系統時，使用者與系統互動或資料流動的方式有許多種可能，這可能會讓人感到有些不知所措。關鍵在於要專注於「熱路徑」：系統中最關鍵重要的部分，以及處理最多資料的部分。舉例來說，在一個計費系統中，這些部分可能是決定是否向客戶收費的模組，以及需要連結平台上所有使用者行為來計算應收金額的模組。

Hot paths are important because they have fewer possible solutions than other design areas. There are a thousand ways you can build a billing settings page and they’ll all mainly work. But there might be only a handful of ways that you can sensibly consume the firehose of user actions. Hot paths also go wrong more spectacularly. You have to really screw up a settings page to take down the entire product, but any code you write that’s triggered on all user actions can easily cause huge problems.
熱路徑之所以重要，是因為相較於其他設計領域，它們的可行解決方案較少。你可以用上千種方式來建置帳單設定頁面，基本上都能運作。但要合理處理大量使用者行為資料流，可能只有少數幾種方法可行。此外，熱路徑出錯時後果也更嚴重。你得真的把設定頁面搞砸才會導致整個產品癱瘓，但任何在所有使用者行為觸發時執行的程式碼，都很容易造成重大問題。

Logging and metrics 日誌記錄與指標

How do you know if you’ve got problems? One thing I’ve learned from my most paranoid colleagues is to log aggressively during unhappy paths. If you’re writing a function that checks a bunch of conditions to see if a user-facing endpoint should respond 422, you should log out the condition that was hit. If you’re writing billing code, you should log every decision made (e.g. “we’re not billing for this event because of X”). Many engineers don’t do this because it adds a bunch of logging boilerplate and makes it hard to write beautifully elegant code, but you should do it anyway. You’ll be happy you did when an important customer is complaining that they’re getting a 422 - even if that customer did something wrong, you still need to figure out what they did wrong for them.
你怎麼知道自己遇到問題了？我從那些最謹慎的同事身上學到一件事：在異常路徑上要大量記錄日誌。如果你正在寫一個函數來檢查一堆條件，判斷是否該對使用者端點回傳 422，你就該記錄觸發了哪個條件。如果你在寫計費程式碼，就該記錄每個決策（例如「我們不對這個事件收費，原因是 X」）。很多工程師不這麼做，因為這會增加一堆日誌樣板程式碼，讓程式碼難以保持優雅簡潔，但你還是該這麼做。當重要客戶抱怨收到 422 時，你會慶幸自己這麼做了——即使客戶真的做錯了什麼，你還是得幫他們找出問題所在。

You should also have basic observability into the operational parts of the system. That means CPU/memory on the hosts or containers, queue sizes, average time per-request or per-job, and so on. For user-facing metrics like time per-request, you also need to watch the p95 and p99 (i.e. how slow your slowest requests are). Even one or two very slow requests are scary, because they’re disproportionately from your largest and most important users. If you’re just looking at averages, it’s easy to miss the fact that some users are finding your service unusable.
你應該對系統的運作部分具備基本的可觀測性。這包括主機或容器的 CPU/記憶體使用情況、佇列大小、每個請求或工作的平均處理時間等。對於面向用戶的指標（如每個請求的處理時間），你還需要關注 p95 和 p99（也就是最慢請求的處理速度）。即使只有一兩個極慢的請求也很可怕，因為它們往往來自最重要的大客戶。如果只看平均值，很容易忽略某些用戶發現服務根本無法使用的事實。

Killswitches, retries, and failing gracefully
緊急停止開關、重試機制與優雅降級

I wrote a whole post about killswitches that I won’t repeat here, but the gist is that you should think carefully about what happens when the system fails badly.
關於緊急停止開關我已另撰專文討論，這裡不再贅述，但核心概念是：你必須仔細思考當系統嚴重故障時會發生什麼狀況。

Retries are not a magic bullet. You need to make sure you’re not putting extra load on other services by blindly retrying failed requests. If you can, put high-volume API calls inside a “circuit breaker”: if you get too many 5xx responses in a row, stop sending requests for a while to let the service recover. You also need to make sure you’re not retrying write events that may or may not have succeeded (for instance, if you send a “bill this user” request and get back a 5xx, you don’t know if the user has been billed or not). The classic solution to this is to use an “idempotency key”, which is a special UUID in the request that the other service uses to avoid re-running old requests: every time they do something, they save the idempotency key, and if they get another request with the same key, they silently ignore it.
重試並非萬靈丹。你必須確保不會因盲目重試失敗請求而對其他服務造成額外負載。如果可能，將高流量的 API 呼叫放入「斷路器」中：若連續收到過多 5xx 錯誤回應，就暫停發送請求一段時間，讓服務有機會恢復。同時也要確認不會重試那些可能已成功或失敗的寫入操作（例如，發送「向用戶收費」請求後收到 5xx 錯誤時，你無法確定用戶是否已被收費）。解決此問題的經典方法是使用「冪等鍵」，這是請求中的特殊 UUID，對方服務會用它來避免重複執行舊請求：每次執行操作時，他們會儲存冪等鍵，若收到相同鍵的新請求，便會靜默忽略。

It’s also important to decide what happens when part of your system fails. For instance, say you have some rate limiting code that checks a Redis bucket to see if a user has made too many requests in the current window. What happens when that Redis bucket is unavailable? You have two options: fail open and let the request through, or fail closed and block the request with a 429.
當系統部分功能失效時該如何應對，這點至關重要。舉例來說，假設你有一段速率限制程式碼，會檢查 Redis 的計數桶來判斷使用者在當前時間窗口是否發送過多請求。當這個 Redis 計數桶無法使用時會發生什麼？你有兩個選擇：故障開放（fail open）讓請求通過，或是故障關閉（fail closed）並回傳 429 狀態碼阻擋請求。

Whether you should fail open or closed depends on the specific feature. In my view, a rate limiting system should almost always fail open. That means that a problem with the rate limiting code isn’t necessarily a big user-facing incident. However, auth should (obviously) always fail closed: it’s better to deny a user access to their own data than to give a user access to some other user’s data. There are a lot of cases where it’s not clear what the right behavior is. It’s often a difficult tradeoff.
該選擇故障開放或關閉取決於具體功能。我認為速率限制系統幾乎都應該採用故障開放模式，這意味著速率限制程式碼出問題時，不至於對使用者造成重大影響。然而，身分驗證系統（顯而易見地）永遠該採用故障關閉模式：寧可拒絕使用者存取自己的資料，也勝過讓使用者誤取他人資料。許多情況下很難判斷哪種行為才正確，這往往需要艱難的權衡取捨。

Final thoughts 最終思考

There are some topics I’m deliberately not covering here. For instance, whether or when to split your monolith out into different services, when to use containers or VMs, tracing, good API design. Partly this is because I don’t think it matters that much (in my experience, monoliths are fine), or because I think it’s too obvious to talk about (you should use tracing), or because I just don’t have the time (API design is complicated).
有些主題我刻意不在這裡討論。例如，是否或何時將你的單體架構拆分為不同服務、何時使用容器或虛擬機、追蹤技術、良好的 API 設計。部分原因是我不認為這些特別重要（根據我的經驗，單體架構就很好），或是因為我覺得這些太顯而易見（你應該使用追蹤技術），又或者只是我沒有時間（API 設計很複雜）。

The main point I’m trying to make is what I said at the start of this post: good system design is not about clever tricks, it’s about knowing how to use boring, well-tested components in the right place. I’m not a plumber, but I imagine good plumbing is similar: if you’re doing something too exciting, you’re probably going to end up with crap all over yourself.
我想表達的主要觀點就像這篇文章開頭所說的：良好的系統設計不在於聰明的技巧，而在於知道如何在正確的地方使用經過充分測試的無聊元件。我不是水管工，但我猜想好的水管工程也類似：如果你在做一些太令人興奮的事，最後很可能會搞得自己一身狼狽。

Especially at large tech companies, where these components already exist off the shelf (i.e. your company already has some kind of event bus, caching service, etc), good system design is going to look like nothing. There are very, very few areas where you want to do the kind of system design you could talk about at a conference. They do exist! I have seen hand-rolled data structures make features possible that wouldn’t have been possible otherwise. But I’ve only seen that happen once or twice in ten years. I see boring system design every single day.
特別是在大型科技公司，這些元件通常都有現成的解決方案（也就是說，你的公司已經擁有某種事件匯流排、快取服務等），好的系統設計往往看起來平淡無奇。真正需要那種能在研討會上侃侃而談的系統設計場合其實非常、非常少。雖然確實存在！我見過手工打造的資料結構實現了原本不可能的功能。但在我十年的職業生涯中，這種情況只見過一兩次。反倒是平淡無奇的系統設計，我每天都能見到。

You’re supposed to store timestamps instead, and treat the presence of a timestamp as true. I do this sometimes but not always - in my view there’s some value in keeping a database schema immediately-readable.
理論上你應該儲存時間戳記，並將時間戳記的存在視為 true 。我有時會這麼做，但並非總是如此——在我看來，保持資料庫結構的直觀可讀性有其價值。
↩
Technically any service stores information of some kind for some duration, at least in-memory. Typically what’s meant here is storing information outside of the request-response lifecycle (e.g. persistently on-disk somewhere, such as in a database). If you can stand up a new version of the app by simply spinning up the application server, that’s a stateless app.
嚴格來說，任何服務都會在某段時間內儲存某種資訊，至少是在記憶體中。這裡通常指的是在請求-回應生命週期之外儲存資訊（例如持久化儲存在磁碟上，像是資料庫中）。如果你能單純透過啟動應用程式伺服器就部署新版本應用，那就是無狀態應用。
↩
Gamedevs on Twitter will say that anything slower than 10ms is unacceptable. Whether that ought to be the case, it’s just factually not true about successful tech products - users will accept slower responses if the app is doing something that’s useful to them.
推特上的遊戲開發者會說，任何超過 10 毫秒的延遲都無法接受。姑且不論這種說法是否合理，事實上成功的科技產品並非如此——只要應用程式提供的功能對使用者有價值，他們是能接受較慢的反應速度的。
↩
They’re fast because they don’t have to talk to a database in the way the main server does. In theory, this could just be a static file on-disk that they serve up when asked, or even data held in-memory.
它們之所以快速，是因為不需要像主伺服器那樣與資料庫進行溝通。理論上，這可以只是一個靜態的磁碟檔案，在需要時提供服務，甚至是保存在記憶體中的資料。
↩
Incidentally, those cache servers will either poll your main server (i.e. pulling) or your main server will send the new data to them (i.e. pushing). I don’t think it matters too much which you do. Pushing will give you more up-to-date data but pulling is simpler.
順帶一提，那些快取伺服器要不是會輪詢你的主伺服器（也就是拉取），就是你的主伺服器會把新資料傳送給它們（也就是推送）。我認為選擇哪種方式影響不大。推送能讓你獲得較即時的資料，但拉取更簡單。
↩

If you liked this post, consider subscribing to email updates about my new posts.
如果你喜歡這篇文章，可以訂閱電子郵件，接收我的新文章更新通知。

June 21, 2025 │ Tags: good engineers, software design
2025 年 6 月 21 日 │ 標籤：優秀工程師、軟體設計