Loading in 5 sec....

CS 519: Big Data Exploration and AnalyticsPowerPoint Presentation

CS 519: Big Data Exploration and Analytics

- 127 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' CS 519: Big Data Exploration and Analytics' - kolina

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### CS 519: Big Data Exploration and Analytics

Relational Query Languages

Announcements

- The due data for selecting papers is tomorrow.
- Your top 3 choices
- First come, first server

- Some project ideas will be posted on the course Web site tomorrow.

Presentations

- 1 or 2 papers: 45 - 50 minutes presentation.
- 20 - 25 minutes discussion
- 5 minutes quiz.
- Posted articles on how to effectively read scientific papers.
- Which parts to read more carefully than others?
- Read and analyze the paper critically.

Presentations

- The story:
- Clear definition of the problem and its applications: Is it important?

- What makes this paper different from others
- Different version of the problem. Should we care?
- Novel solution: Is it sufficiently novel?

- Solution
- Give a general overview.
- Highlight the interesting and important parts.

Presentations

- Evaluation, evaluation, evaluation!
- Setting: Is it realistic?
- Results: Does the paper cover both advantages and disadvantages?

- Conclusion and discussion
- Strong points (3 or more). How to use them in problems?
- Weak points (3 or more) How to address the weak points?
- Possible future directions?
- Connections with your project/ research?

Presentations

- David Paterson’s guide on “How to give a bad presentation”
http://www.cs.berkeley.edu/~pattrsn/talks/BadTalk.pdf

Relational model and query languages

- Relational model defines data organization
- Relational query languages define data retrieval/manipulation operations.

Relational model

Relation name

Attribute names

Book:

Title Price Category Year

MySQL $102.1 computer 2001

Cell biology $201.69 biology 1954

French cinema $53.99 art 2002

NBA History $63.65 sport 2010

tuples

Relational Model

- Attributes
- Atomic values
- Domain: string, integer, real,
- Keys: no duplicate values

- Each relation must have keys
- A relation does not contain duplicate tuples.

Database Schema vs. Database Instance

- Schema of a Relation
- Names of the relation and their attributes.
- E.g.: Person (Name, Address, SSN)
- Domains of / constraints on the attributes.

- Schema of the database
- Set of relation schemata
- E.g.: Person (Name, Address, SSN)
Employment(Company, SSN)

Database Schema vs. Database Instance

- Schema: S:Book(Title, Price, Category, Year)
- Instance:
- Values of each attribute A in I: active domain of A, adom(A)

Title Price Category Year

MySQL $102.1 computer 2001

Cell biology $201.69 biology 1954

French cinema $53.99 art 2002

NBA History $63.65 sport 2010

SQL

- A declarative language for querying data stored in relational databases.
- Much easier to use than procedural languages.
- Say what instead of how
SELECT returned attribute(s)

FROM table(s)

WHERE conditions on the tuples of the table(s)

- Say what instead of how

SQL Example

Movie(id, title, year, total-gross)

Actor(id, name, b-year)

Plays(mid,aid)

What movies are made in 1998?

SELECT title

FROM movie

WHERE year = 1998;

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

SQL Example- Find actors who played in a movie whose total gross is more than $2,000,000.
SELECT *

FROM Actor, Movie, Plays

WHERE Movie.id= Plays.midAND

Plays.aid= Actor.id AND

total-gross > 2000000;

Formal Relational Query Languages

- Formal languages that express queries over relational schemas.
- Relational Algebra
- Datalog (recursion-free with negation)
- Relational calculus
- Used to explore the properties of relational model.
- Easier to use than SQL in some application domains.

Named versus unnamed perspective

- Named perspective
Movie(id, title, year, total-gross)

Actor(id, name, b-year)

Plays(mid,aid)

- Unnamed perspective
Movie: arity 4

Actor: arity 3

Plays: arity 2

Relational Algebra (RA)

- Used by RDBMS to execute queries
- Six operators
- Selection σ
- Projection Π
- Join ∞
- Union
- Difference –
- Renaming ρ (for named perspective)

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational Algebra- Selection σ
- Named perspective σtitle=‘Fargo’ (Movie)
- Unnamed perspective σ2=‘Fargo’ (Movie)

- Projection Π
- Named perspective Πb-year (Actor)
- Unnamed perspective Π3 (Actor)

- Join ∞
- Named perspective Movie ∞id=mid Plays
- Unnamed perspective Movie ∞1=1Plays

Datalog

- First created to support recursive queries over relational databases.
- Easier to use than relational algebra.
- Used extensively in research and industry
- Data integration, networking, logic programming, learning, distributed processing, …

- We talk about the recursion-free datalog.

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog- Each tuple in database is a fact
Movie(236878, ‘Godfather I’, 1972, 40000000)

Movie(879900, ‘Godfather II’, 1974, 3900000)

Actor(090988,’Robert De Niro’, 1943)

- Each query is a rule
Movies that were produced in 1998 and made more than $2,000.

Q1(y):- Movie(x,y,1998,z),z > 2000.

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog Example- Actors who played in a movie whose total gross is more than $2,000.
Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000.

- Actors who played in a movie whose total gross is more than $2,000 and a movie made in 1998.
Q3(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000, Plays(g,x), Movie(g,l,1998,h).

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

DatalogQ2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000.

y: head variable; x, z, t : existential variables

- Extensional Database Predicates (EDB)
- Movie, Actor, Plays

- Intentional Database Predicate (IDB)
- Q2

atom

atom

body

head

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog programs- A collection of rules: union
- Actors who played in a movie with gross of more than $2,000 or a movie made after 1990.
Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000.

Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

w > 1990.

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Views- Similar to views in SQL
- Actors who played in a movie with gross of more than $2000 and in a movie with ‘Robert De Niro’.
V(x,y,z):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 20000.

Q5(y):- V(x,y,z),Plays(t,x),Plays(t,f),Actor(f,’Robert De Niro’,g, h).

- Unfolding
Q5(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 20000,Plays(u,x),Plays(u,f),Actor(f,’Robert De Niro’,g, h).

Definition of V

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog with negation- All actors who did not play in a movie with ‘Robert De Niro’.
U(x,y,z):- Actor(x,y,z),Plays(t,x),Plays(t,f), Actor(f,’Robert De Niro’,g).

Q6(y):- Actor(x,y,z), not U(x,y,z).

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Safe datalog rules- Unsafe rules:
V(x,y,z):- Actor(x,y,1998), z > 200.

W(x,y,z):- Actor(x,y,z), not Plays(t,x).

- A datalog rule is safe if every variable appear in at least one positive predicate

Datalog to SQL

- Non-recursive datalog with negation represents the core functionalities of SQL
- We can translate each non-recursive datalog program to a core SQL query and vice versa.

Relational calculus

- First order logic or predicate calculus
- Two approaches:
- Domain Relational Calculus (DRC): Unnamed perspective.
- Our focus in this lecture.

- Tuple Relational Calculus (TRC): Named perspective.

- Domain Relational Calculus (DRC): Unnamed perspective.

Relational calculus

- Each relational predicate P is:
- Atom
- P ∧ P
- P ∨ P
- P P
- not (P)

- Each query Q(x1, …, xn) is a predicate P.

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational calculus- Actors who played in a movie with total gross of $2,000.
- Actors who played only in movies produced in 1990.

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational calculus- Actors who played in some movies with only one actor.

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Domain independent RC- Similar to datalog, one may write unsafe queries in RC, called domain dependent.
- RC queries should be domain independent.

Equivalency Theorem

- RA, non-recursive datalog with negation and RC express the same set of queries.
- Relational queries.

Conjunctive queries (CQ)

- One datalog rule.
- SELECT-DISTINCT-FROM-WHERE.
- Select/project/join (σ, Π, ∞) fragment of RA.
- Existential/ conjunctive fragment of RC
- There is not any comparison operator (<, ≠, …) in CQ. If used the family is called CQ<, CQ≠, …

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

CQexamplesActors who played in “LTR”.

Q7(y):- Actor(x,y,z),Plays(t,x),Movie(t,’LTR’,w,f).

Actors who played in a movie with total gross of $2000.

Non-CQ: Actors who played in some movies with only one actor.

Query equivalency and containment

- Interesting and long standing problems in database management.
- Queries q1 and q2 are equivalent if and only if for every database instance I, q1(I) = q2(I)
- Shown as q1 q2
- Query q1is contained in q2if and only if for every database instance I, q1(I) q2(I)
- Shown as q1 q2

Containment examples

Is q1 q2?

q1(x):- R(x,y),R(y,z),R(z,w).

q2(x):- R(x,y),R(y,z).

q1(x):-R(x,y),R(y,’Joe’).

q2(x):-R(x,y),R(y,z).

q1(x):- R(x,y),R(y,z),R(z,x).

q2(x):- R(x,y),R(y,x).

Download Presentation

Connecting to Server..