2020-03-02 11:43
Page 1

# Databases

## Lecture 11: Relational algebra and query compilation

• Notes: chapter 6
• Book: chapters 2, 5, 16
Page 2

# Earlier

## Relational data modelling (Notes: chapter 4)

### Relational databases are based on ideas from discrete math

• Sets, S = {t1, t2, …, tn}
• Cartesian products, S1 × S2 = {⟨x,y⟩ | x∈S1, y∈S2}
• Relations are sets of tuples, R ⊆ S1 × … × Sn
• Operations on sets and relations: ∩, ∪, −, ×
• Operations from relational algebra: π, σ, ⨝
Page 3

# Today

• SQL Query processing
• Relational Algebra
• Translation from SQL to Relational Algebra
• Query optimization
• Query execution and optimization in a real-world example
Page 4

# SQL Query processing

## A DBMS processes a query in several steps

• Lexing
• Parsing
• Type checking
• Logical query plan generation
• Translate SQL query to a relational algebra expression
• Optimization
• Simplifying relational algebra expressions
• Physical query plan generation
• Query execution
Page 5

# Relational Algebra overview

## Operators that transform a relation

• R ::=  relname σc R selecting rows WHERE c, HAVING c πp R projection SELECT p ρr R renaming AS γg R grouping GROUP BY τa+ R sorting ORDER BY a+ δ R removing duplicates DISTINCT …
Page 6

## Operations from set theory that combine two relations

• R ::=  … R × R cartesian product FROM, CROSS JOIN R ∪ R union UNION R ∩ R intersection INTERSECT R − R difference EXCEPT …
Page 7

## Join operations

• R ::=  … R ⨝ R NATURAL JOIN R ⨝a+ R INNER JOIN USING (a+) R ⨝c R theta join JOIN ON R ⨝oa+ R FULL OUTER JOIN R ⨝oLa+ R LEFT OUTER JOIN R ⨝oRa+ R RIGHT OUTER JOIN
Page 8

Page 9

# Presentation

## To make large relational algebra expressions more readable

• Write them over several lines and use indentation to show the structure.
• πA.name
(σA.name=B.capital
(ρA Countries × ρB Countries))
• Split them in to several named parts.
• R1 = ρA Countries × ρB Countries
• R2 = σA.name=B.capital R1
• Result = πA.name R2
Page 10

# From SQL to Relational Algebra

• Logical query plan generation
• Translate SQL query to a relational algebra expression
Page 11

## Basic queries

• SQLRelational Algebra
```SELECT projections FROM Table1,…,Tablen WHERE condition ``` πprojections
σcondition
(Table1 × … × Tablen)
Page 12

## Tables are relations

• SQLRelational Algebra
`SELECT * FROM Countries` Countries

`SELECT * FROM CountriesWHERE name='UK'` σname='UK' Countries
• Table names can be used directly (`SELECT *` disappears)
Page 13

## What happens with expressions?

• SQLRelational Algebra
```SELECT capital,area/1000 FROM CountriesWHERE name='UK'``` πcapital,area/1000
σname='UK' Countries
• The expressions and conditions are just copied
Page 14

## Renaming columns

• SQLRelational Algebra
```SELECT name AS country,        population/area AS density FROM CountriesWHERE continent='EU'``` πname→country, population/area→density
σcontinent='EU'
Countries
• The `AS` keyword turns into an arrow (→)
Page 15

## Renaming tables (relations)

• SQLRelational Algebra
```SELECT A.name FROM Countries AS A,      Countries AS B WHERE A.name = B.capital ``` πA.name
σA.name=B.capital
(ρA Countries × ρB Countries)
Page 16

## Sorting

• SQLRelational Algebra
```SELECT name, capital FROM Countries ORDER BY name ``` πname, capital τname Countries
τname πname, capital Countries

```SELECT nameFROM CountriesORDER BY population ``` πname τpopulation Countries
• One can not sort by an attribute that has already been discarded by a projection.
Page 17

## Removing duplicates

• SQLRelational Algebra
```SELECT currency FROM Countries ``` πcurrency Countries

```SELECT DISTINCT currency FROM Countries ``` δ (πcurrency Countries)
Page 18

## Grouping

• SQLRelational Algebra
```SELECT currency, COUNT(name) FROM Countries GROUP BY currency ``` γcurrency,COUNT(name) Countries
Page 19

## Grouping with several aggregation functions

• SQLRelational Algebra
```SELECT currency,SUM(population) FROM Countries GROUP BY currency HAVING COUNT(name)>1 ``` πcurrency,SUM(population)
σCOUNT(name)>1
γcurrency,SUM(population),COUNT(name)
Countries
• Aggregation functions from different parts of the SQL query appear together in the same γ operator in relational algebra.
• Both `WHERE` cond and `HAVING` cond turn into σcond.
Page 20

# SQL Query processing

## A DBMS processes a query in several steps

• Logical query plan generation
• Translate SQL query to a relational algebra expression
• Optimization
• Simplifying relational algebra expressions
Page 21

# Query Optimization

## Algebraic laws and algebraic manipulation

• This is something that should be familiar from math. Example:
• a(b+c) = ab+ac
• (a+b)² = a² + 2ab + b²
• x+b+x-b = 2x
• The same idea can be applied also in relational algebra!
Page 22

## Practical benefits of query optimization

• There are often different ways of writing queries to solve a particular task.
• Since DBMSs contain a query optimizer that tries to transform each query into its most efficient form…
• …we can focus on formulating or SQL queries in a way that makes them easy for humans understand, and rely on the DBMS to make them run efficiently.
Page 23

## Example: repeated selection

• σc₁c₂ R) = σc₁ AND c₂ R
Page 24

## Example: repeated projection

• πp₁p₂ R) = πp₁ R
• Example:
• πnamename,population Countries) = πname Countries
• Seems obvious.
Page 25
##### Query Optimization → Example: repeated projection
• πp₁p₂ R) = πp₁ R
• It only works if p₂ is a plain projection.
• Counter example:
• πname,densityname,capital,population/area→density Countries) ≠ πname,density Countries
Page 26

## Pushing selection inside a cartesian product

• σc(R₁ × R₂) = (σc R₁) × R₂
• (if c only uses attributes from R₁)
• σc1 AND c2(R₁ × R₂) = (σc1 R₁) × (σc2 R₂)
• (if c1 & c2 only uses attributes from R₁ & R₂, respectively.)
• This can dramatically reduce the number of rows in the cartesian product!
Page 27

## Doing a join instead of a cartesian product

• σc(R₁ × R₂) = R₁ ⋈c R₂
• If the join is using the primary keys,
• we can quickly find matching rows in R₁ and R₂.
• the resulting table will not have more rows than R₁ or R₂.
Page 28

# Real-world example

• A quick look at query execution and query optimization in action.
• Using data from the Internet Movie Database (IMDb). Name_Basic 9.8 million rows 6.4 million rows 37.2 million rows 1.0 million rows
• SQL schemas for the IMDb data: movies.sql
• Example queries: examples.sql
Page 29

# Query optimization example

• Query: Find movies and TV series in which Frida Hallgren has appeared.
• SQL query:
• ```SELECT T.startYear,T.endYear,T.titleType,T.originalTitle
FROM Title_Principals AS P,Name_basics AS N,Title_Basics AS T
WHERE T.titleType IN ('movie','tvSeries')
AND N.primaryName='Frida Hallgren'
AND N.nconst = P.nconst
AND T.tconst = P.tconst ;
```
• Note: the cartesian product in the FROM line is a table with 37e6*9.8e6*6.4e6 = 2.3e21 rows!
Page 30

## First step: translate to Relational Algebra

• πT.startYear, T.endYear, T.titleType, T.originalTitle σT.titleType IN ('movie', 'tvSeries') AND N.primaryName='Frida Hallgren' AND N.nconst=P.nconst AND T.tconst=P.tconst (ρ P Title_Principals × ρ T Title_Basics × ρ N Name_Basics)
Page 31
##### Query optimization example
• πT.startYear, T.endYear, T.titleType, T.originalTitle σT.titleType IN ('movie', 'tvSeries') AND N.primaryName='Frida Hallgren' AND N.nconst=P.nconst AND T.tconst=P.tconst (P × N × T)
• Leaving out the long table names. Pretend they are still there!
Page 32
##### Query optimization example
• πT.startYear, T.endYear, T.titleType, T.originalTitle σN.nconst=P.nconst AND T.tconst=P.tconst (P × σprimaryName='Frida Hallgren' N × σT.titleType IN ('movie', 'tvSeries') T)
• Pushing selections inside the cartesian product
• Table size now: 37e6 * 1 * 716610 = 2.66e13
• 88e6 times reduction!
Page 33
##### Query optimization example
• πT.startYear, T.endYear, T.titleType, T.originalTitle σT.tconst=P.tconst ((P ⋈ σprimaryName='Frida Hallgren' N) × σt.titletype IN ('movie', 'tvSeries') t)
• The σprimaryName selects 1 of 9.8e6 names.
• The condition N.nconst=P.nconst can be converted into a natural join.
• The join selects 60 of 37e6 rows from P.
• Without indexes, the above two steps would need to examine 9.8e6 + 37e6 = 47e6 table rows
• The × creates a table with 60 * 716610 = 43e6 rows
Page 34
##### Query optimization example
• πT.startYear, T.endYear, T.titleType, T.originalTitle ((P ⋈ σprimaryname='Frida Hallgren' N) ⋈ σT.titleType IN ('movie', 'tvSeries') T)
• The second ⋈ is a join of subsets of P and T, using tconst.
• tconst is (part of) the primary key in both p and t.
• By using the indexes, the DBMS can create a physical query plan to do this efficiently.
Page 35

## Translating back to SQL

• πT.startYear, T.endYear, T.titleType, T.originalTitle ((P ⋈ σprimaryName='Frida Hallgren' N) ⋈ σT.titleType IN ('movie', 'tvSeries') T)
• ```WITH
Names AS (SELECT * FROM Name_Basics
WHERE primaryName='Frida Hallgren'),
Titles AS (SELECT * FROM Title_Basics
WHERE titleType IN ('movie','tvSeries'))
SELECT startYear, endYear, titleType, originalTitle
FROM Title_Principals NATURAL JOIN Names NATURAL JOIN Titles ;
```
Page 36

## Another way to write the same query in SQL

• ```SELECT startYear, endYear, titleType, originalTitle
FROM Title_Principals NATURAL JOIN Name_Basics
NATURAL JOIN Title_Basics
WHERE titletype IN ('movie','tvSeries')
AND primaryName = 'Frida Hallgren';
```
Page 37