Lecture 7 第 7 课

Welcome! 欢迎！
Flat-File Database 平面文件数据库
Relational Databases 关系型数据库
SELECT 选择
INSERT 插入
DELETE 删除
UPDATE 更新
IMDb
JOINs
Indexes 索引
Using SQL in Python
在 Python 中使用 SQL
Race Conditions 竞态条件
SQL Injection Attacks SQL 注入攻击
Summing Up 总结

Welcome! 欢迎！

In previous weeks, we introduced you to Python, a high-level programming language that utilized the same building blocks we learned in C. However, we introduced this new language not for the purpose of learning “just another language.” Instead, we do so because some tools are better for some jobs and not so great for others!
在前几周，我们向您介绍了 Python，这是一种高级编程语言，它使用了我们在 C 语言中学到的相同构建块。然而，我们引入这种新语言的目的并不是为了学习“又一种语言”。相反，我们这样做是因为有些工具更适合某些工作，而对其他工作则不那么出色！
This week, we will be continuing more syntax related to Python.
本周，我们将继续学习与 Python 相关的更多语法。
Further, we will be integrating this knowledge with data.
此外，我们将把这些知识与数据结合起来。
Finally, we will be discussing SQL or Structured Query Language, a domain-specific way by which we can interact with and modify data.
最后，我们将讨论 SQL，即结构化查询语言，这是一种特定领域的语言，我们可以用它来交互和修改数据。
Overall, one of the goals of this course is to learn to program generally – not simply how to program in the languages described in this course.
总的来说，这门课程的目标之一是学习编程——不仅仅是学习这门课程中描述的语言如何编程。

Flat-File Database 平面文件数据库

As you have likely seen before, data can often be described in patterns of columns and rows.
正如你可能之前见过的，数据通常可以用列和行的模式来描述。
Spreadsheets like those created in Microsoft Excel and Google Sheets can be outputted to a csv or comma-separated values file.
像在 Microsoft Excel 和 Google Sheets 中创建的电子表格可以输出为 csv 或逗号分隔值文件。
If you look at a csv file, you’ll notice that the file is flat in that all of our data is stored in a single table represented by a text file. We call this form of data a flat-file database.
如果你查看一个 csv 文件，你会注意到文件是平面的，即我们所有的数据都存储在一个由文本文件表示的单一表格中。我们称这种形式的数据为平面文件数据库。
All data is stored row by row. Each column is separated by a comma or another value.
所有数据都按行存储。每一列都由逗号或另一个值分隔。
Python comes with native support for csv files.
Python 原生支持 csv 文件。
First, download favorites.csv and upload it to your file explorer inside cs50.dev. Second, examining this data, notice that the first row is special in that it defines each column. Then, each record is stored row by row.
首先，下载 favorites.csv 并上传到你的 cs50.dev 文件浏览器中。其次，检查这些数据时，你会注意到第一行是特别的，因为它定义了每一列。然后，每条记录都按行存储。
In your terminal window, type code favorites.py and write code as follows:
在你的终端窗口中，输入 code favorites.py 并编写如下代码：
```
# Prints all favorites in CSV using csv.reader

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create reader
    reader = csv.reader(file)

    # Skip header row
    next(reader)

    # Iterate over CSV file, printing each favorite
    for row in reader:
        print(row[1])
```
Notice that the csv library is imported. Further, we created a reader that will hold the result of csv.reader(file). The csv.reader function reads each row from the file, and in our code, we store the results in reader. print(row[1]), therefore, will print the language from the favorites.csv file.
注意到 csv 库被导入了。此外，我们创建了一个 reader 来保存 csv.reader(file) 的结果。 csv.reader 函数读取文件中的每一行，在我们的代码中，我们将结果存储在 reader 中。因此， print(row[1]) 将打印出来自 favorites.csv 文件的语言。

You can improve your code as follows:
你可以按照以下方式改进你的代码：

# Stores favorite in a variable

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create reader
    reader = csv.reader(file)

    # Skip header row
    next(reader)

    # Iterate over CSV file, printing each favorite
    for row in reader:
        favorite = row[1]
        print(favorite)

Notice that favorite is stored and then printed. Also, notice that we use the next function to skip to the next line of our reader.
注意到 favorite 被存储然后打印出来。同时，注意到我们使用 next 函数来跳到我们读者的下一行。

One of the disadvantages of the above approach is that we are trusting that row[1] is always the favorite. However, what would happen if the columns had been moved around?
上述方法的一个缺点是我们相信 row[1] 总是最喜欢的。然而，如果列被移动了呢？
We can fix this potential issue. Python also allows you to index by the keys of a list. Modify your code as follows:
我们可以修复这个潜在的问题。Python 还允许你通过列表的键来索引。按照以下方式修改你的代码：
```
# Prints all favorites in CSV using csv.DictReader

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Iterate over CSV file, printing each favorite
    for row in reader:
        favorite = row["language"]
        print(favorite)
```
Notice that this example directly utilizes the language key in the print statement. favorite indexes into the reader dictionary of row["language"].
注意到这个例子直接在打印语句中使用了 language 键。 favorite 索引到 reader 字典的 row["language"] 。

This could be further simplified to:
这可以进一步简化为：

# Prints all favorites in CSV using csv.DictReader

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Iterate over CSV file, printing each favorite
    for row in reader:
        print(row["language"])

To count the number of favorite languages expressed in the csv file, we can do the following:
要计算 csv 文件中表达的最喜欢的语言的数量，我们可以做以下操作：

# Counts favorites using variables

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Counts
    scratch, c, python = 0, 0, 0

    # Iterate over CSV file, counting favorites
    for row in reader:
        favorite = row["language"]
        if favorite == "Scratch":
            scratch += 1
        elif favorite == "C":
            c += 1
        elif favorite == "Python":
            python += 1

# Print counts
print(f"Scratch: {scratch}")
print(f"C: {c}")
print(f"Python: {python}")

Notice that each language is counted using if statements. Further, notice the double equal == signs in those if statements.
注意，每种语言都是使用 if 语句来计数的。此外，注意那些 if 语句中的双等号 == 。

Python allows us to use a dictionary to count the counts of each language. Consider the following improvement upon our code:
Python 允许我们使用字典来计算每种语言的 counts 。考虑以下对我们代码的改进：
```
# Counts favorites using dictionary

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Counts
    counts = {}

    # Iterate over CSV file, counting favorites
    for row in reader:
        favorite = row["language"]
        if favorite in counts:
            counts[favorite] += 1
        else:
            counts[favorite] = 1

# Print counts
for favorite in counts:
    print(f"{favorite}: {counts[favorite]}")
```
Notice that the value in counts with the key favorite is incremented when it exists already. If it does not exist, we define counts[favorite] and set it to 1. Further, the formatted string has been improved to present the counts[favorite].
注意到当 favorite 键在 counts 中已存在时，其对应的值会被增加。如果不存在，我们定义 counts[favorite] 并将其设置为 1。此外，格式化字符串已得到改进，以展示 counts[favorite] 。

Python also allows sorting counts. Improve your code as follows:
Python 还允许排序 counts 。按照以下方式改进你的代码：

# Sorts favorites by key

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Counts
    counts = {}

    # Iterate over CSV file, counting favorites
    for row in reader:
        favorite = row["language"]
        if favorite in counts:
            counts[favorite] += 1
        else:
            counts[favorite] = 1

# Print counts
for favorite in sorted(counts):
    print(f"{favorite}: {counts[favorite]}")

Notice the sorted(counts) at the bottom of the code.
注意代码底部的 sorted(counts) 。

If you look at the parameters for the sorted function in the Python documentation, you will find it has many built-in parameters. You can leverage some of these built-in parameters as follows:
如果你查看 Python 文档中 sorted 函数的参数，你会发现它有很多内置参数。你可以按照以下方式利用这些内置参数：
```
# Sorts favorites by value using .get

import csv

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Counts
    counts = {}

    # Iterate over CSV file, counting favorites
    for row in reader:
        favorite = row["language"]
        if favorite in counts:
            counts[favorite] += 1
        else:
            counts[favorite] = 1

# Print counts
for favorite in sorted(counts, key=counts.get, reverse=True):
    print(f"{favorite}: {counts[favorite]}")
```
Notice the arguments passed to sorted. The key argument allows you to tell Python the method you wish to use to sort items. In this case counts.get is used to sort by the values. reverse=True tells sorted to sort from largest to smallest.
注意传递给 sorted 的参数。 key 参数允许你告诉 Python 你想使用哪种方法来排序项目。在这种情况下， counts.get 被用来按值排序。 reverse=True 告诉 sorted 从大到小排序。
Python has numerous libraries that we can utilize in our code. One of these libraries is collections, from which we can import Counter. Counter will allow you to access the counts of each language without the headaches of all the if statements seen in our previous code. You can implement as follows:
Python 拥有许多我们可以在代码中使用的库。其中之一是 collections ，我们可以从中导入 Counter 。 Counter 将允许你访问每种语言的计数，而无需处理我们在之前代码中看到的所有的 if 语句所带来的麻烦。你可以按照以下方式实现：
```
# Sorts favorites by value using .get

import csv

from collections import Counter

# Open CSV file
with open("favorites.csv", "r") as file:

    # Create DictReader
    reader = csv.DictReader(file)

    # Counts
    counts = Counter()

    # Iterate over CSV file, counting favorites
    for row in reader:
        favorite = row["language"]
        counts[favorite] += 1

# Print counts
for favorite, count in counts.most_common():
    print(f"{favorite}: {count}")
```
Notice how counts = Counter() enables the use of this imported Counter class from collections.
注意到 counts = Counter() 使得可以使用从 collections 导入的 Counter 类。
You can learn more about sorted in the Python Documentation.
你可以在 Python 文档中了解更多关于排序的信息。

Relational Databases 关系型数据库

Google, X, and Meta all use relational databases to store their information at scale.
谷歌、X 公司和 Meta 都使用关系型数据库来大规模存储他们的信息。
Relational databases store data in rows and columns in structures called tables.
关系型数据库将数据存储在称为表的结构中的行和列里。
SQL allows for four types of commands:
SQL 允许四种类型的命令：
```
  Create
  Read
  Update
  Delete
```
These four operations are affectionately called CRUD.
这四个操作被亲切地称为 CRUD。
We can create a database with the SQL syntax CREATE TABLE table (column type, ...);. But where do you run this command?
我们可以使用 SQL 语法 CREATE TABLE table (column type, ...); 创建一个数据库。但是，你在哪里运行这个命令呢？
sqlite3 is a type of SQL database that has the core features required for this course.
sqlite3 是一种 SQL 数据库，它具备了本课程所需的核心特性。
We can create a SQL database at the terminal by typing sqlite3 favorites.db. Upon being prompted, we will agree that we want to create favorites.db by pressing y.
我们可以通过在终端输入 sqlite3 favorites.db 来创建一个 SQL 数据库。当系统提示时，我们将通过按下 y 来确认我们想要创建 favorites.db 。
You will notice a different prompt as we are now using a program called sqlite.
你会注意到一个不同的提示符，因为我们现在使用的是一个叫做 sqlite 的程序。
We can put sqlite into csv mode by typing .mode csv. Then, we can import our data from our csv file by typing .import favorites.csv favorites. It seems that nothing has happened!
我们可以通过输入 .mode csv 将 sqlite 设置为 csv 模式。然后，我们可以通过输入 .import favorites.csv favorites 从我们的 csv 文件导入数据。看起来好像什么也没发生！
We can type .schema to see the structure of the database.
You can read items from a table using the syntax SELECT columns FROM table.
For example, you can type SELECT * FROM favorites; which will print every row in favorites.
You can get a subset of the data using the command SELECT language FROM favorites;.

SQL supports many commands to access data, including:

  AVG
  COUNT
  DISTINCT
  LOWER
  MAX
  MIN
  UPPER

For example, you can type SELECT COUNT(*) FROM favorites;. Further, you can type SELECT DISTINCT language FROM favorites; to get a list of the individual languages within the database. You could even type SELECT COUNT(DISTINCT language) FROM favorites; to get a count of those.

SQL offers additional commands we can utilize in our queries:

  WHERE       -- adding a Boolean expression to filter our data
  LIKE        -- filtering responses more loosely
  ORDER BY    -- ordering responses
  LIMIT       -- limiting the number of responses
  GROUP BY    -- grouping responses together

Notice that we use -- to write a comment in SQL.

SELECT

For example, we can execute SELECT COUNT(*) FROM favorites WHERE language = 'C';. A count is presented.
Further, we could type SELECT COUNT(*) FROM favorites WHERE language = 'C' AND problem = 'Hello, World';. Notice how the AND is utilized to narrow our results.
Similarly, we could execute SELECT language, COUNT(*) FROM favorites GROUP BY language;. This would offer a temporary table that would show the language and count.
We could improve this by typing SELECT language, COUNT(*) FROM favorites GROUP BY language ORDER BY COUNT(*);. This will order the resulting table by the count.
Likewise, we could execute SELECT COUNT(*) FROM favorites WHERE language = 'C' AND (problem = 'Hello, World' OR problem = 'Hello, It''s Me');. Do notice that there are two '' marks as to allow the use of single quotes in a way that does not confuse SQL.
Further, we could execute SELECT COUNT(*) FROM favorites WHERE language = 'C' AND problem LIKE 'Hello, %'; to find any problems that start with Hello, (including a space).
We can also group the values of each language by executing SELECT language, COUNT(*) FROM favorites GROUP BY language;.
We can order the output as follows: SELECT language, COUNT(*) FROM favorites GROUP BY language ORDER BY COUNT(*) DESC;.
We can even create aliases, like variables in our queries: SELECT language, COUNT(*) AS n FROM favorites GROUP BY language ORDER BY n DESC;.
Finally, we can limit our output to 1 or more values: SELECT language, COUNT(*) AS n FROM favorites GROUP BY language ORDER BY n DESC LIMIT 1;.

INSERT

We can also INSERT into a SQL database utilizing the form INSERT INTO table (column...) VALUES(value, ...);.
We can execute INSERT INTO favorites (language, problem) VALUES ('SQL', 'Fiftyville');.
You can verify the addition of this favorite by executing SELECT * FROM favorites;.

DELETE

DELETE allows you to delete parts of your data. For example, you could DELETE FROM favorites WHERE Timestamp IS NULL;. This deletes any record where the Timestamp is NULL.

UPDATE

We can also utilize the UPDATE command to update your data.
For example, you can execute UPDATE favorites SET language = 'SQL', problem = 'Fiftyville';. This will result in overwriting all previous statements where C and Scratch were the favorite programming language.
Notice that these queries have immense power. Accordingly, in the real-world setting, you should consider who has permissions to execute certain commands and if you have backups available!

IMDb

We can imagine a database that we might want to create to catalog various TV shows. We could create a spreadsheet with columns like title, star, star, star, star, and more stars. A problem with this approach is that it has a lot of wasted space. Some shows may have one star. Others may have dozens.
We could separate our database into multiple sheets. We could have a shows sheet, a stars sheet, and a people sheet. On the people sheet, each person could have a unique id. On the shows sheet, each show could have a unique id too. On a third sheet called stars we could relate how each show has people for each show by having a show_id and person_id. While this is an improvement, this is not an ideal database.
IMDb offers a database of people, shows, writers, stars, genres, and ratings. Each of these tables is related to one another as follows:
After downloading shows.db, you can execute sqlite3 shows.db in your terminal window.
Let’s zero in on the relationship between two tables within the database called shows and ratings. The relationship between these two tables can be illustrated as follows:
To illustrate the relationship between these tables, we could execute the following command: SELECT * FROM ratings LIMIT 10;. Examining the output, we could execute SELECT * FROM shows LIMIT 10;.
Examining shows and rating, we can see these have a one-to-one relationship: One show has one rating.
To understand the database, upon executing .schema you will find not only each of the tables but the individual fields inside each of these fields.
More specifically, you could execute .schema shows to understand the fields inside shows. You can also execute .schema ratings to see the fields inside ratings.
As you can see, show_id exists in all of the tables. In the shows table, it is simply called id. This common field between all the fields is called a key. Primary keys are used to identify a unique record in a table. Foreign keys are used to build relationships between tables by pointing to the primary key in another table. You can see in the schema of ratings that show_id is a foreign key that references id in shows.
By storing data in a relational database, as above, data can be more efficiently stored.

In sqlite, we have five data types, including:

  BLOB       -- binary large objects that are groups of ones and zeros
  INTEGER    -- an integer
  NUMERIC    -- for numbers that are formatted specially like dates
  REAL       -- like a float
  TEXT       -- for strings and the like

Additionally, columns can be set to add special constraints:
```
  NOT NULL
  UNIQUE
```
We can further play with this data to understand these relationships. Execute SELECT * FROM ratings;. There are a lot of ratings!
We can further limit this data down by executing SELECT show_id FROM ratings WHERE rating >= 6.0 LIMIT 10;. From this query, you can see that there are 10 shows presented. However, we don’t know what show each show_id represents.
You can discover what shows these are by executing SELECT * FROM shows WHERE id = 626124;
We can further our query to be more efficient by executing:
```
SELECT title
FROM shows
WHERE id IN (
    SELECT show_id
    FROM ratings
    WHERE rating >= 6.0
    LIMIT 10
)
```
Notice that this query nests together two queries. An inner query is used by an outer query.

`JOIN`s

We are pulling data from shows and ratings. Notice how both shows and ratings have an id in common.
How could we combine tables temporarily? Tables could be joined together using the JOIN command.

Execute the following command:

SELECT * FROM shows
  JOIN ratings on shows.id = ratings.show_id
  WHERE rating >= 6.0
  LIMIT 10;

Notice this results in a wider table than we have previously seen.

Where the previous queries have illustrated the one-to-one relationship between these keys, let’s examine some one-to-many relationships. Focusing on the genres table, execute the following:
```
SELECT * FROM genres
LIMIT 10;
```
Notice how this provides us a sense of the raw data. You might notice that one show has three values. This is a one-to-many relationship.
We can learn more about the genres table by typing .schema genres.
Execute the following command to learn more about the various comedies in the database:
```
SELECT title FROM shows
WHERE id IN (
  SELECT show_id FROM genres
  WHERE genre = 'Comedy'
  LIMIT 10
);
```
Notice how this produces a list of comedies, including Catweazle.
To learn more about Catweazle, by joining various tables through a join:
```
SELECT * FROM shows
JOIN genres
ON shows.id = genres.show_id
WHERE id = 63881;
```
Notice that this results in a temporary table. It is fine to have a duplicate table.
In contrast to one-to-one and one-to-many relationships, there may be many-to-many relationships.
We can learn more about the show The Office and the actors in that show by executing the following command:
```
SELECT name FROM people WHERE id IN 
    (SELECT person_id FROM stars WHERE show_id = 
        (SELECT id FROM shows WHERE title = 'The Office' AND year = 2005));
```
Notice that this results in a table that includes the names of various stars through nested queries.

We find all the shows in which Steve Carell starred:

SELECT title FROM shows WHERE id IN 
    (SELECT show_id FROM stars WHERE person_id = 
        (SELECT id FROM people WHERE name = 'Steve Carell'));

This results in a list of titles of shows wherein Steve Carell starred.

This could also be expressed in this way:

SELECT title FROM shows, stars, people 
WHERE shows.id = stars.show_id
AND people.id = stars.person_id
AND name = 'Steve Carell';

The wildcard % operator can be used to find all people whose names start with Steve C one could employ the syntax SELECT * FROM people WHERE name LIKE 'Steve C%';.

Indexes

While relational databases have the ability to be faster and more robust than utilizing a CSV file, data can be optimized within a table using indexes.
Indexes can be utilized to speed up our queries.
We can track the speed of our queries by executing .timer on in sqlite3.
To understand how indexes can speed up our queries, run the following: SELECT * FROM shows WHERE title = 'The Office'; Notice the time that displays after the query executes.
Then, we can create an index with the syntax CREATE INDEX title_index ON shows (title);. This tells sqlite3 to create an index and perform some special under-the-hood optimization relating to this column title.
This will create a data structure called a B Tree, a data structure that looks similar to a binary tree. However, unlike a binary tree, there can be more than two child nodes.

Further, we can create indexes as follows:

CREATE INDEX name_index ON people (name);
CREATE INDEX person_index ON stars (person_id);

Running the query and you will notice that the query runs much more quickly!

SELECT title FROM shows WHERE id IN 
    (SELECT show_id FROM stars WHERE person_id = 
        (SELECT id FROM people WHERE name = 'Steve Carell'));

Unfortunately, indexing all columns would result in utilizing more storage space. Therefore, there is a tradeoff for enhanced speed.

Using SQL in Python

To assist in working with SQL in this course, the CS50 Library can be utilized as follows in your code:
```
from cs50 import SQL
```
Similar to previous uses of the CS50 Library, this library will assist with the complicated steps of utilizing SQL within your Python code.
You can read more about the CS50 Library’s SQL functionality in the documentation.
Using our new knowledge of SQL, we can now leverage Python alongside.
Modify your code for favorites.py as follows:
```
# Searches database popularity of a problem

from cs50 import SQL

# Open database
db = SQL("sqlite:///favorites.db")

# Prompt user for favorite
favorite = input("Favorite: ")

# Search for title
rows = db.execute("SELECT COUNT(*) AS n FROM favorites WHERE language = ?", favorite)

# Get first (and only) row
row = rows[0]

# Print popularity
print(row["n"])
```
Notice that db = SQL("sqlite:///favorites.db") provides Python the location of the database file. Then, the line that begins with rows executes SQL commands utilizing db.execute. Indeed, this command passes the syntax within the quotation marks to the db.execute function. We can issue any SQL command using this syntax. Further, notice that rows is returned as a list of dictionaries. In this case, there is only one result, one row, returned to the rows list as a dictionary.

Race Conditions

Utilization of SQL can sometimes result in some problems.
You can imagine a case where multiple users could be accessing the same database and executing commands at the same time.
This could result in glitches where code is interrupted by other people’s actions. This could result in a loss of data.
Built-in SQL features such as BEGIN TRANSACTION, COMMIT, and ROLLBACK help avoid some of these race condition problems.

SQL Injection Attacks

Now, still considering the code above, you might be wondering what the ? question marks do above. One of the problems that can arise in real-world applications of SQL is what is called an injection attack. An injection attack is where a malicious actor could input malicious SQL code.
For example, consider a login screen as follows:
Without the proper protections in our own code, a bad actor could run malicious code. Consider the following:
```
rows = db.execute("SELECT COUNT(*) FROM users WHERE username = ? AND password = ?", username, password)
```
Notice that because the ? is in place, validation can be run on favorite before it is blindly accepted by the query.
You never want to utilize formatted strings in queries as above or blindly trust the user’s input.
Utilizing the CS50 Library, the library will sanitize and remove any potentially malicious characters.

In this lesson, you learned more syntax related to Python. Further, you learned how to integrate this knowledge with data in the form of flat-file and relational databases. Finally, you learned about SQL. Specifically, we discussed…

Flat-file databases
Relational databases
SQL commands such as SELECT, CREATE, INSERT, DELETE, and UPDATE.
Primary and foreign keys
JOINs
Indexes
Using SQL in Python
Race conditions
SQL injection attacks

See you next time!