你的数据库技能并非“锦上添花”
以及对对象关系映射器的憎恨颂歌
作者:Andrei Taranchenko (LinkedIn)
创建日期:2023 年 11 月 12 日
A MySQL war story
一个 MySQL 的实战故事
It’s 2006, and the New York Magazine digital team set out to create a new search experience for its Fashion Week portal. It was
one of
those projects where technical feasibility was not even discussed with the tech team - a common occurrence back then. Agile was
still new, let alone in publishing. It was just a vision, a real friggin’ moonshot, and 10 to 12 weeks to develop
the wireframed version of the product. There would be almost no time left for proper QA. Fashion Week does not start slowly but
rather goes from zero to sixty in a blink.
时间是 2006 年,纽约杂志的数字团队着手为其时装周门户网站打造全新的搜索体验。这是那种技术可行性根本没有和技术团队讨论过的项目之一——在当时很常见。敏捷开发那时还很新,更不用说在出版行业了。这仅仅是一个愿景,一个真正宏伟的目标,而开发出产品线框图版本的时间只有 10 到 12 周。留给正式质量保证(QA)的时间几乎没有。时装周不会慢慢开始,而是瞬间从零加速到六十。
The vision? Thousands of near-real-time fashion show images, each one with its sub-items categorized: “2006”, “bag”, “red”, “
leather”, and so on. A user will land on the search page and have
the ability to “drill down” and narrow the results based on those properties. To make things much harder, all of these
properties would come with exact counts.
愿景是:数千张近乎实时的时装秀图片,每张图片都有其子项分类:“2006”、“包”、“红色”、“皮革”等等。用户将进入搜索页面,并能够根据这些属性进行“深入”和缩小搜索结果。为了增加难度,所有这些属性都将附带确切的计数。
The workflow was going to be intense. Photographers will courier their digital cartridges from downtown NYC to
our offices on Madison Avenue, where the images will be processed, tagged by interns, and then indexed every hour by our Perl
script, reading the tags from the embedded EXIF information. Failure to build the search product on our side would have collapsed
the entire ecosystem already in place, primed and ready to rumble.
工作流程将非常紧张。摄影师会将他们的数码胶片从纽约市中心送到我们在麦迪逊大道的办公室,在那里图片将被处理、由实习生打上标签,然后每小时由我们的 Perl 脚本进行索引,读取嵌入的 EXIF 信息中的标签。如果我们未能构建搜索产品,将导致已经就绪并准备就绪的整个生态系统崩溃。
“Oh! Just use the facets in Solr, dude”. Yeah, not so fast - dude. In 2006 that kind of technology didn’t even exist yet. I sat
through multiple enterprise search engine demos, and none of the products (which cost a LOT of money) could do a deep
faceted search. We already had an Autonomy license and my first try proved that… it just couldn’t do it. It was supposed to be
able to, but the counts were all wrong. Endeca (now owned by Oracle),
came out of stealth
when the design part of the project was already underway. Too new, too raw, too risky. The idea was just a little too
ambitious for its time, especially for a tiny team in a non-tech company.
“哦!用 Solr 的分面功能就行了,哥们儿”。嗯,没那么快,哥们儿。2006 年的时候,这种技术甚至还不存在。我参加了多次企业搜索引擎的演示,但没有一款产品(价格非常昂贵)能够进行深度分面搜索。我们已经有了 Autonomy 的许可证,我的第一次尝试证明了这一点……它就是做不到。它本应该能做到,但计数全错了。Endeca(现在归 Oracle 所有)在项目设计阶段已经开始,才刚刚崭露头角。太新了,太粗糙,太冒险了。这个想法对当时来说有点过于宏伟,尤其是对于一家非科技公司里的小团队来说。
So here we were, a team of three, myself and two consultants, writing Perl for the indexing script, query-parsing logic, and
modeling the data - in MySQL 4. It was one of those projects where one single insurmountable technical risk would have sunk the
whole thing. I will cut the story short and spare you the excitement. We did it, and then we went out to celebrate at a karaoke
bar (where I got my very first work-stress-related severe hangover) 🤮
于是我们三个,我和两位顾问,用 Perl 编写索引脚本、查询解析逻辑,并在 MySQL 4 中建模数据。这是那种任何一个单一的、无法克服的技术风险都会导致整个项目失败的项目之一。我长话短说,省略其中的精彩部分。我们做到了,然后我们去卡拉 OK 酒吧庆祝(在那里我第一次因为工作压力而宿醉 🤮)。
For someone who was in charge of the SQL model and queries, it was days and days of tuning those, timing every query and studying
the EXPLAIN output to see what else I could do to squeeze another 50ms out of the database. In the end, it was a combination of trial and error, digging deep into MySQL server settings, and crafting GROUP BY
queries that would make you nauseous. The MySQL query analyzer was fidgety back then, and sometimes re-arranging the fields in the
SELECT clause could change a query’s performance. Imagine if SELECT field1, field2 FROM my_table was faster than SELECT
field2, field1 FROM my_table. Why would it do that? I have no idea to this day, and I don’t even want to know.
对于负责 SQL 模型和查询的人来说,那是要花费数天时间进行调优,测量每次查询的时间,并研究 EXPLAIN 的输出,看看还能做些什么来从数据库中挤出额外的 50 毫秒。最终,这是试错、深入研究 MySQL 服务器设置以及编写会让你头晕的 GROUP BY 查询的结合。当时的 MySQL 查询分析器很挑剔,有时重新排列 SELECT 子句中的字段就能改变查询的性能。想象一下,如果 SELECT field1, field2 FROM my_table 比 SELECT
field2, field1 FROM my_table 快。为什么会这样?我至今也不知道,甚至不想知道。
Unfortunately, I lost examples of this work, but the
Way Back Machine
has proof of our final product.
不幸的是,我丢失了这项工作的例子,但 Way Back Machine 有我们最终产品的证明。
The point here is - if you really know your database, you can do pretty crazy things with it, and with the modern generation of
storage
technologies and beefier hardware, you don’t even need to push the limits - it should easily handle what I refer to as
“common-scale”.
这里的重点是——如果你真的了解你的数据库,你可以用它做一些非常疯狂的事情,而且随着现代存储技术和更强大的硬件的出现,你甚至不需要挑战极限——它应该能轻松处理我所说的“通用规模”。
The fading art of SQL
正在消失的 SQL 技艺
In the past few years I have been noticing an unsettling trend - software engineers are eager to use exotic “planet-scale”
databases for pretty rudimentary problems, while at the same time not having a good grasp of the very powerful
relational
database engine they are likely already using, let alone understanding the technology’s more advanced and useful capabilities.
The SQL layer is buried so deep beneath libraries and too clever by a half ORMs that it all just becomes high-level
code.
在过去的几年里,我注意到一个令人不安的趋势——软件工程师们热衷于为相当基础的问题使用奇特的“行星级”数据库,但同时却未能很好地掌握他们可能已经在使用的、功能强大的关系型数据库引擎,更不用说理解该技术更高级、更有用的功能了。SQL 层被深深地埋藏在库和过于聪明的 ORM 之下,一切都变成了高级代码。
- Why is it slow?
为什么它很慢? - No idea - let's add Cassandra to it!
不知道——让我们加上 Cassandra 吧!
Modern hardware certainly allows us to go way up from the CPU into the higher abstraction layers, while it wasn’t
that uncommon in the past to convert certain functions to assembly code in order to squeeze every bit of performance out of the
processor. Now compute and storage is cheaper - it’s true - but abusing this abundance has trained us
laziness and complacency. Suddenly, that Cloud bill is a wee too high, and heavens knows how
much energy the world is burning by just running billions of auto-generated “Squeel” queries every second against mammoth
database instances.
现代硬件无疑使我们能够从 CPU 深入到更高级的抽象层,而在过去,为了从处理器中榨取每一丝性能而将某些函数转换为汇编代码并非不常见。现在计算和存储确实更便宜了——这是真的——但滥用这种充裕已经让我们变得懒惰和自满。突然之间,云账单有点太高了,而且天知道全世界每秒钟都在消耗多少能源,仅仅是为了在庞大的数据库实例上运行数十亿个自动生成的“Squeel”查询。
The morning of my first job interview in 2004, I was on a subway train memorizing
the nine levels of database normalization. Or is it five levels? I don’t remember, and It doesn’t even matter - no one will ever
ask you this now in a software engineer interview.
2004 年,在我第一次工作面试的早晨,我坐在一辆地铁上,努力记着数据库规范化的九个级别。或者说是五个级别?我不记得了,而且这根本不重要——现在软件工程师面试没人会问你这个。
Just skimming through the table of contents of your database of choice, say the now
freshly in vogue Postgres,
you will find an absolute treasure trove of features fit to handle everything but the most gruesome
planet-scale computer science problems. Petabyte-sized Postgres boxes, replicated, are effortlessly running now as you
are reading this.
随便翻翻你选择的数据库的目录,比如现在非常流行的 Postgres,你会发现它拥有极其丰富的特性,足以应对除了最棘手的行星级计算机科学问题之外的所有问题。如今,PB 级大小的 Postgres 数据库,经过复制,在你阅读这篇文章时正在轻松运行。
The trick is to not expect your database or your ORM to read your mind. Speaking of…
诀窍在于不要指望你的数据库或你的 ORM 能读懂你的心思。说起……
ORMs are not magic
ORM 不是魔法
I was a new hire at an e-commerce outfit, and right off the bat I was thrown into fixing serious performance issues with the
company’s product catalog pages. Just a straight-forward, paginated grid of product images. How hard could it be? Believe it or
not - it be. The pages took over 10 seconds to load, sometimes longer, the database was struggling, and the solution was to “just
cache it”. One last datapoint - this was not a high-traffic site. The pages were dead-slow even if there was no traffic at all.
That’s a
rotten sign that something is seriously off.
我刚加入一家电子商务公司,立刻就被派去解决公司产品目录页面的严重性能问题。这只是一个简单的、分页的商品图片网格。能有多难?信不信由你——它就是这么难。页面加载需要 10 秒以上,有时更长,数据库不堪重负,解决方案是“直接缓存”。最后一点数据——这不是一个高流量网站。即使没有任何流量,页面加载也极其缓慢。这是一个严重的警告信号,表明有些地方出了严重问题。
After looking a bit closer, I realized that I hit the motherlode - all top three major database and coding mistakes in one.
仔细查看后,我发现我挖到了宝藏——所有三大主要数据库和编码错误都集中在了一处。
❌ Mistake #1: There is no index
❌ 错误 #1:没有索引
The column that was hit in every single mission-critical query had no index. None. After adding the much-needed index in
production, you could practically hear MySQL exhaling in relief. Still, the performance was not quite there yet, so I
had to dig deeper, now in the code.
在每一次关键任务查询中被命中的列没有索引。一个都没有。在生产环境中添加了急需的索引后,你几乎能听到 MySQL 松了一口气。不过,性能仍然不尽如人意,所以我不得不深入挖掘,这次是代码。
❌ Mistake #2: Assuming each ORM call is free
❌ 错误 #2:假设每次 ORM 调用都是免费的
Activating the query logs locally and reloading a product listing page, I see… 200, 300, 500 queries fired off just to load one
single page. What the shit? Turns out, this was the result of a classic ORM abuse of going through every record in a loop, to the
effect of:
在本地激活查询日志并重新加载产品列表页面,我看到……仅仅加载一个页面就触发了 200、300、500 次查询。什么鬼?结果发现,这是经典 ORM 滥用的结果,在循环中遍历每个记录,效果如下:
for product_id in product_ids:
product = my_orm.products.get(id=product_id)
products.append(product)
The high number of queries was also due the fact that some of this logic was nested. The obvious solution is to keep the
number of queries in each request to a minimum, using ORM capabilities to join and combine the data into one single blob.
This is what relational databases do - it’s in the name.
查询数量过多的原因还在于其中一些逻辑是嵌套的。显而易见的解决方案是尽量减少每次请求中的查询数量,利用 ORM 的能力将数据连接和合并成一个单独的块。这就是关系数据库所做的——顾名思义。
What is happening above is that each separate query needs to travel to the database,
get parsed, transformed, analyzed, planned, executed,
and then travel back to the caller. It is one of
the most expensive operations you can do, and ORMs will happily do the worst possible thing for you in terms of
performance. How does that ORM call translate to SQL? If it’s not what you think it should be, is it an ORM limitation or are you
just not using the
right library call? Is it a particular flavor of non-ANSI vendor SQL that your choice of ORM has a tough time with?
Do you ultimately need to drop into raw SQL for this call but not the others? And so on.
上面发生的是,每个单独的查询都需要传输到数据库,进行解析、转换、分析、规划、执行,然后才能返回给调用者。这是你能做的最昂贵的操作之一,而 ORM 会很乐意为你做性能方面最糟糕的事情。那个 ORM 调用如何翻译成 SQL?如果它不是你认为应该有的样子,是 ORM 的限制还是你只是没有使用正确的库调用?是你选择的 ORM 难以处理的特定非 ANSI 供应商 SQL 吗?你最终是否需要为这个调用而改用原始 SQL,但其他调用不需要?等等。
❌ Mistake #3: Pulling in the world
❌ 错误 #3:拉取全部数据
To make matters worse, the amount of data here was relatively small, but there were dozens and dozens of columns. What do ORMs
usually do by default in order to make your life “easier”? They send the whole thing, all the columns, clogging your network
pipes with the data that you don’t even need. It is a form of toxic technical debt, where the speed of development will
eventually start eating into performance.
更糟糕的是,这里的数据量相对较小,但却有几十上百个列。ORM 通常默认会做什么来让你的生活“更轻松”?它们会发送全部数据,所有的列,用你甚至不需要的数据堵塞你的网络管道。这是一种有毒的技术债务,开发速度最终会开始侵蚀性能。
I spent hours within the same project hacking the dark corners of the Dango admin, overriding default ORM queries to be less
“eager”. This led
to a much better office-facing experience.
在同一个项目中,我花了几个小时去钻研 Django admin 的黑暗角落,覆盖默认的 ORM 查询,使其不那么“急切”。这带来了更好的面向办公室的体验。
Performance IS a feature
性能是一项功能
Serious, mission-critical systems have been running on classic and boring relational databases for decades, serving thousands of
requests per second. These systems have become more advanced, more capable, and more relevant. They are wonders of computer
science, one can claim. You would think that an ancient database like Postgres (in development since 1982) is in some kind of
legacy maintenance mode at this point, but the opposite is true. In fact, the work has been only accelerating, with the scale
and
features becoming pretty impressive.
What took multiple queries just a few years ago now takes a single one.
严肃的、任务关键型的系统几十年来一直运行在经典且朴实的的关系型数据库上,每秒处理数千次请求。这些系统变得更加先进、功能更强大、更具相关性。可以说,它们是计算机科学的奇迹。你可能会认为像 Postgres 这样古老的数据库(自 1982 年开始开发)此刻处于某种遗留维护模式,但事实恰恰相反。事实上,这项工作一直在加速,其规模和功能变得相当令人印象深刻。几年前需要多个查询才能完成的操作,现在只需一个查询即可完成。
Why is this significant? It has been known for a long time,
as discovered by Amazon,
that every additional 100ms of a user waiting for a page to load loses a business money. We also know now that from a user’s
perspective,
the maximum target response time for a web page is around 100 milliseconds:
这为什么很重要?亚马逊早就发现,用户等待页面加载的每增加 100 毫秒,公司就会损失金钱。我们现在也知道,从用户的角度来看,网页的最大目标响应时间约为 100 毫秒:
A delay of less than 100 milliseconds feels instant to a user, but a delay between 100 and 300 milliseconds is perceptible. A delay between 300 and 1,000 milliseconds makes the user feel like a machine is working, but if the delay is above 1,000 milliseconds, your user will likely start to mentally context-switch.
用户感觉不到 100 毫秒以下的延迟,但 100 到 300 毫秒的延迟是可以察觉的。300 到 1000 毫秒的延迟会让用户感觉像是在使用机器,但如果延迟超过 1000 毫秒,用户很可能会开始在脑海中切换上下文。
The “just add more CPU and RAM if it’s slow” approach may have worked for a while, but
many are finding out the hard way
that this kind of laziness is not sustainable in a frugal business environment where costs matter.
“慢了就加 CPU 和内存”的方法可能在一段时间内奏效,但许多人正以惨痛的教训发现,在成本至上的节俭商业环境中,这种懒惰的做法是不可持续的。
Database anti-patterns 数据库反模式
Knowing what not to do is as important as knowing what to do. Some of the below mistakes are all too common:
知道什么不该做和知道该做什么同样重要。以下一些错误非常普遍:
❌ Anti-pattern #1. Using exotic databases for the wrong reasons
❌ 反模式 #1. 出于错误的原因使用冷门数据库
Technologies like DynamoDB are designed to handle scale at which Postgres and MySQL begin to fail. This is achieved by
denormalizing, duplicating the data aggressively, where the database is not doing much real-time data manipulation or joining.
Your data is now modeled after how it is queried, not after how it is related. Regular relational concepts disintegrate at this
insane level of scale. Needless to say, if you are resorting to this kind of storage for “common-scale” problems, you are already
solving problems you don’t have.
DynamoDB 等技术旨在处理 Postgres 和 MySQL 开始出现故障的规模。这是通过反规范化、积极复制数据来实现的,在这种情况下,数据库不会进行太多实时数据操作或连接。您的数据现在是根据查询方式建模的,而不是根据它们之间的关系建模的。在如此巨大的规模下,常规的关系概念会瓦解。不用说,如果您为了“常见规模”的问题而采用这种存储方式,那么您就已经在解决您并不存在的问题了。
❌ Anti-pattern #2. Caching things unnecessarily
❌ 反模式 #2. 不必要地缓存
Caching is a necessary evil - but it’s not always necessary. There is an entire class of bugs and on-call issues that
stem from stale cached data. Read-only database replicas are a classic architecture pattern that
is still very much not outdated, and it will buy you insane levels of performance before you have to worry about anything. It
should
not be a
surprise that mature relational databases already have query caching in place - it just
has to be tuned
for your specific needs.
缓存是必要的邪恶——但并非总是必要的。有一整类错误和值班问题源于过时的缓存数据。只读数据库副本是一个经典的架构模式,至今仍然不过时,它能在您担心任何事情之前为您带来惊人的性能。成熟的关系数据库已经内置了查询缓存,这应该不足为奇——它只需要根据您的具体需求进行调整。
Cache invalidation is hard. It adds more complexity and states of
uncertainty to your system. It makes debugging more difficult. I received more emails from content teams than I care for
throughout my career that wondered “why is the data not there, I updated it 30 minutes ago?!”
缓存失效很难。它会给你的系统增加更多的复杂性和不确定状态。这使得调试更加困难。在我职业生涯中,我收到的来自内容团队的邮件比我希望的要多,他们会疑惑“为什么数据没有更新,我明明 30 分钟前更新了?!”
Caching should not act as a bandaid for bad architecture and non-performant code.
缓存不应成为糟糕架构和性能低下代码的“创可贴”。
❌ Anti-pattern #3. Storing everything and a kitchen sink
❌ 反模式 #3:存储一切,包括“厨房水槽”
As much punishment as an industry-standard database can take, it’s probably not a good idea to not care at all about what’s going
into
it, treating it like a data landfill of sorts. Management, querying, backups, migrations - all becomes painful once the DB grows
substantially. Even if that is of no concern as you are using a
managed cloud DB - the costs should be. An RDBMS is a sophisticated piece of technology, and storing data in it is expensive.
尽管行业标准的数据库能够承受相当大的压力,但完全不关心存入其中的数据,将其视为某种数据填埋场,这通常不是一个好主意。一旦数据库规模显著增长,管理、查询、备份和迁移都会变得痛苦不堪。即使你使用的是托管云数据库,无需担心这些,但成本也应该被考虑在内。关系型数据库管理系统(RDBMS)是一项复杂的技术,存储数据成本高昂。
Figure out common-scale first
首先确定通用规模
It is fairly easy to make a beefy Postgres or a MySQL database grind to a halt if you expect it to do magic without any extra
work. “It’s
not web-scale, boss. Our 2 million records seem to be too much of a lift. We need DynamoDB, Kafka, and event
sourcing!”
如果期望一个强大的 PostgreSQL 或 MySQL 数据库在没有任何额外工作的情况下就能施展魔法,那么让它们停滞不前是相当容易的。“老板,它不是网络规模的。我们这 200 万条记录似乎太重了。我们需要 DynamoDB、Kafka 和事件溯源!”
A relational database is not some antiquated technology that only us tech fossils choose to be experts in, a thing that can be
waved off like an annoying insect. “Here we React and GraphQL all the things, old man”. In legal speak, a modern RDBMS is
innocent until proven guilty, and the burden of proof should be extremely
high - and almost entirely on you.
关系型数据库并非某种过时的技术,只有我们这些技术老古董才会选择成为专家,可以像赶走恼人的昆虫一样将其挥开。“我们这里有 React 和 GraphQL,万事万物都用它们,老家伙”。用法律的语言来说,现代关系型数据库管理系统(RDBMS)在被证明有罪之前是无罪的,而证明的责任应该非常高——几乎完全在你身上。
Finally, if I have to figure out “why it’s slow”, my approximate runbook is:
最后,如果我必须弄清楚“为什么它慢”,我的大致运行手册是:
- Compile a list of unique queries, from logging, slow query log, etc.
从日志、慢查询日志等地方编译一份唯一的查询列表。 - Look at the most frequent queries first
先看最频繁的查询 - Use
EXPLAINto check slow query plans for index usage
使用EXPLAIN检查慢查询计划的索引使用情况 - Select only the data that needs to travel across the wire
仅选择需要通过网络传输的数据 - If an ORM is doing something silly without a workaround, pop the hood and get dirty with the raw SQL plumbing
如果 ORM 在没有变通方法的情况下做了些愚蠢的事情,就打开引擎盖,直接处理原始 SQL
Most importantly, study your database (and SQL). Learn it, love it, use it, abuse it. Spending a couple of days just leafing
through that Postgres manual to see what it can do will probably make you a better engineer than spending more time on the next
flavor-of-the-month JavaScript framework hotness. Again.
最重要的是,学习你的数据库(和 SQL)。去了解它、热爱它、使用它、磨练它。花几天时间随便翻阅一下 Postgres 手册,看看它能做什么,可能比你花更多时间去研究下一个流行的 JavaScript 框架要让你成为一个更好的工程师。再说一遍。
Latest 最新
Related posts 相关文章
I am not your Cloud person
我不是你的云端专家
Further reading 延伸阅读
Use the index, Luke - SQL Indexing and Tuning e-Book
善用索引,卢克 - SQL 索引与调优电子书

