2020-03-02 11:43

Page 1
- Notes: chapter 6
- Book: chapters 2, 5, 16

- Sets, S = {t
_{1}, t_{2}, …, t_{n}} - Cartesian products, S
_{1}× S_{2}= {⟨x,y⟩ | x∈S_{1}, y∈S_{2}} - Relations are sets of tuples, R ⊆ S
_{1}× … × S_{n} - Operations on sets and relations: ∩, ∪, −, ×
- Operations from relational algebra: π, σ, ⨝

- SQL Query processing
- Relational Algebra
- Translation from SQL to Relational Algebra
- Query optimization
- Query execution and optimization in a real-world example

- Lexing
- Parsing
- Type checking
- Logical query plan generation
**Translate SQL query to a relational algebra expression**

- Optimization
**Simplifying relational algebra expressions**

- Physical query plan generation
- Query execution

`R`::=`relname`σ _{c}`R`selecting rows WHERE c, HAVING c π _{p}`R`projection SELECT p ρ _{r}`R`renaming AS γ _{g}`R`grouping GROUP BY τ _{a+}`R`sorting ORDER BY a+ δ `R`removing duplicates DISTINCT …

`R`::=… `R`×`R`cartesian product FROM, CROSS JOIN `R`∪`R`union UNION `R`∩`R`intersection INTERSECT `R`−`R`difference EXCEPT …

`R`::=… `R`⨝`R`NATURAL JOIN `R`⨝_{a+}`R`INNER JOIN USING (a+) `R`⨝_{c}`R`theta join JOIN ON `R`⨝^{o}_{a+}`R`FULL OUTER JOIN `R`⨝^{oL}_{a+}`R`LEFT OUTER JOIN `R`⨝^{oR}_{a+}`R`RIGHT OUTER JOIN

- π
_{A.name}(σ_{A.name=B.capital}(ρ_{A}Countries × ρ_{B}Countries)) - (Created with the SQL to Relational Algebra translator that is part of Query Converter online.)

- Write them over several lines and use indentation to show the structure.
- π
_{A.name}

(σ_{A.name=B.capital}

(ρ_{A}Countries × ρ_{B}Countries))

- π
- Split them in to several named parts.
- R1 = ρ
_{A}Countries × ρ_{B}Countries - R2 = σ
_{A.name=B.capital}R1 - Result = π
_{A.name}R2

- R1 = ρ

- …
- Logical query plan generation
**Translate SQL query to a relational algebra expression**

- …

SQL Relational Algebra **SELECT**`projections`**FROM**Table_{1},…,Table_{n}**WHERE**`condition`⟹ π _{projections}

σ_{condition}

(Table_{1}× … × Table_{n})

SQL Relational Algebra **SELECT*****FROM**Countries⟹ Countries **SELECT*****FROM**Countries**WHERE**name='UK'⟹ σ _{name='UK'}Countries- Table names can be used directly (
`SELECT *`

disappears)

SQL Relational Algebra **SELECT**capital,area/1000**FROM**Countries**WHERE**name='UK'⟹ π _{capital,area/1000}

σ_{name='UK'}Countries- The expressions and conditions are just copied

SQL Relational Algebra **SELECT**name**AS**country,

population/area**AS**density**FROM**Countries**WHERE**continent='EU'⟹ π _{name→country, population/area→density}

σ_{continent='EU'}

Countries- The

keyword turns into an arrow (→)**AS**

SQL Relational Algebra **SELECT**A.name**FROM**Countries**AS**A,

Countries**AS**B**WHERE**A.name = B.capital⟹ π _{A.name}

σ_{A.name=B.capital}

(ρ_{A}Countries × ρ_{B}Countries)

SQL Relational Algebra **SELECT**name, capital**FROM**Countries**ORDER BY**name⟹ π _{name, capital}τ_{name}Countries⟹ τ _{name}π_{name, capital}Countries**SELECT**name**FROM**Countries**ORDER BY**population⟹ π _{name}τ_{population}Countries- One can not sort by an attribute that has already been discarded by a projection.

SQL Relational Algebra **SELECT**currency**FROM**Countries⟹ π _{currency}Countries**SELECT DISTINCT**currency**FROM**Countries⟹ δ (π _{currency}Countries)

SQL Relational Algebra **SELECT**currency, COUNT(name)**FROM**Countries**GROUP BY**currency⟹ γ _{currency,COUNT(name)}Countries

SQL Relational Algebra **SELECT**currency,SUM(population)**FROM**Countries**GROUP BY**currency**HAVING**COUNT(name)>1⟹ π _{currency,SUM(population)}

σ_{COUNT(name)>1}

γ_{currency,SUM(population),COUNT(name)}

Countries- Aggregation functions from different parts of the SQL query appear together in the same γ operator in relational algebra.
- Both
`WHERE`

`cond`and`HAVING`

`cond`turn into σ_{cond}.

- …
- Logical query plan generation
**Translate SQL query to a relational algebra expression**

- Optimization
**Simplifying relational algebra expressions**

- …

- This is something that should be familiar from math. Example:
- a(b+c) = ab+ac
- (a+b)² = a² + 2ab + b²
- x+b+x-b = 2x

- The same idea can be applied also in relational algebra!

- There are often different ways of writing queries to solve a particular task.
- Since DBMSs contain a query optimizer that tries to transform each query into its most efficient form…
- …we can focus on formulating or SQL queries in a way that makes them easy for humans understand, and rely on the DBMS to make them run efficiently.

- σ
_{c₁}(σ_{c₂}R) = σ_{c₁ AND c₂}R

- π
_{p₁}(π_{p₂}R) = π_{p₁}R- Example:
- π
_{name}(π_{name,population}Countries) = π_{name}Countries

- π

- Example:
- Seems obvious.

- π
_{p₁}(π_{p₂}R) = π_{p₁}R - It only works if p₂ is a plain projection.
- Counter example:
- π
_{name,density}(π_{name,capital,population/area→density}Countries) ≠ π_{name,density}Countries

- π

- σ
_{c}(R₁ × R₂) = (σ_{c}R₁) × R₂- (if c only uses attributes from R₁)

- σ
_{c1 AND c2}(R₁ × R₂) = (σ_{c1}R₁) × (σ_{c2}R₂)- (if c1 & c2 only uses attributes from R₁ & R₂, respectively.)

- This can dramatically reduce the number of rows in the cartesian product!

- σ
_{c}(R₁ × R₂) = R₁ ⋈_{c}R₂ - If the join is using the primary keys,
- we can quickly find matching rows in R₁ and R₂.
- the resulting table will not have more rows than R₁ or R₂.

- A quick look at query execution and query optimization in action.
- Using data from the
Internet Movie Database (IMDb).
IMDB downloaded tables and sizes Name_Basic 9.8 million rows Title_Basics 6.4 million rows Title_Principals 37.2 million rows Title_Ratings 1.0 million rows - A subset of the data be downloaded from:
- SQL schemas for the IMDb data: movies.sql
- Example queries: examples.sql

- Query: Find movies and TV series in which Frida Hallgren has appeared.
- SQL query:
**SELECT**T**.**startYear**,**T**.**endYear**,**T**.**titleType**,**T**.**originalTitle**FROM**Title_Principals**AS**P**,**Name_basics**AS**N**,**Title_Basics**AS**T**WHERE**T**.**titleType**IN****(**'movie'**,**'tvSeries'**)****AND**N**.**primaryName**=**'Frida Hallgren'**AND**N**.**nconst**=**P**.**nconst**AND**T**.**tconst**=**P**.**tconst**;**- Note: the cartesian product in the FROM line is a table with 37e6*9.8e6*6.4e6 = 2.3e21 rows!

- π
_{T.startYear, T.endYear, T.titleType, T.originalTitle}σ_{T.titleType IN ('movie', 'tvSeries') AND N.primaryName='Frida Hallgren' AND N.nconst=P.nconst AND T.tconst=P.tconst}(ρ P Title_Principals × ρ T Title_Basics × ρ N Name_Basics)

- π
_{T.startYear, T.endYear, T.titleType, T.originalTitle}σ_{T.titleType IN ('movie', 'tvSeries') AND N.primaryName='Frida Hallgren' AND N.nconst=P.nconst AND T.tconst=P.tconst}(P × N × T) - Leaving out the long table names. Pretend they are still there!

- π
_{T.startYear, T.endYear, T.titleType, T.originalTitle}σ_{N.nconst=P.nconst AND T.tconst=P.tconst}(P × σ_{primaryName='Frida Hallgren'}N × σ_{T.titleType IN ('movie', 'tvSeries')}T) - Pushing selections inside the cartesian product
- Table size now: 37e6 * 1 * 716610 = 2.66e13
- 88e6 times reduction!

- π
_{T.startYear, T.endYear, T.titleType, T.originalTitle}σ_{T.tconst=P.tconst}((P ⋈ σ_{primaryName='Frida Hallgren'}N) × σ_{t.titletype IN ('movie', 'tvSeries')}t) - The σ
_{primaryName}selects 1 of 9.8e6 names. - The condition N.nconst=P.nconst can be converted into a natural join.
- The join selects 60 of 37e6 rows from P.
- Without indexes, the above two steps would need to examine 9.8e6 + 37e6 = 47e6 table rows
- The × creates a table with 60 * 716610 = 43e6 rows

- π
_{T.startYear, T.endYear, T.titleType, T.originalTitle}((P ⋈ σ_{primaryname='Frida Hallgren'}N) ⋈ σ_{T.titleType IN ('movie', 'tvSeries')}T) - The second ⋈ is a join of subsets of
`P`and`T`, using`tconst`. `tconst`is (part of) the primary key in both`p`and`t`.- By using the indexes, the DBMS can create a physical query plan to do this efficiently.

- π
_{T.startYear, T.endYear, T.titleType, T.originalTitle}((P ⋈ σ_{primaryName='Frida Hallgren'}N) ⋈ σ_{T.titleType IN ('movie', 'tvSeries')}T) **WITH**Names**AS****(****SELECT*********FROM**Name_Basics**WHERE**primaryName**=**'Frida Hallgren'**)****,**Titles**AS****(****SELECT*********FROM**Title_Basics**WHERE**titleType**IN****(**'movie'**,**'tvSeries'**)****)****SELECT**startYear**,**endYear**,**titleType**,**originalTitle**FROM**Title_Principals**NATURAL****JOIN**Names**NATURAL****JOIN**Titles**;**

**SELECT**startYear**,**endYear**,**titleType**,**originalTitle**FROM**Title_Principals**NATURAL****JOIN**Name_Basics**NATURAL****JOIN**Title_Basics**WHERE**titletype**IN****(**'movie'**,**'tvSeries'**)****AND**primaryName**=**'Frida Hallgren'**;**

- See also Jonas Duregård's slides.
- What's next:
- Wednesday: Exercise: JSON
- Thursday: Lecture CANCELLED
- Monday: Lecture: Recap and exam preparation