Cs 519 big data exploration and analytics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 38

CS 519: Big Data Exploration and Analytics PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

CS 519: Big Data Exploration and Analytics. Relational Query Languages. Announcements . The due data for selecting papers is tomorrow. Your top 3 choices First come, first server Some project ideas will be posted on the course Web site tomorrow. Presentations .

Download Presentation

CS 519: Big Data Exploration and Analytics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cs 519 big data exploration and analytics

CS 519: Big Data Exploration and Analytics

Relational Query Languages


Announcements

Announcements

  • The due data for selecting papers is tomorrow.

    • Your top 3 choices

    • First come, first server

  • Some project ideas will be posted on the course Web site tomorrow.


Presentations

Presentations

  • 1 or 2 papers: 45 - 50 minutes presentation.

  • 20 - 25 minutes discussion

  • 5 minutes quiz.

  • Posted articles on how to effectively read scientific papers.

    • Which parts to read more carefully than others?

    • Read and analyze the paper critically.


Presentations1

Presentations

  • The story:

    • Clear definition of the problem and its applications: Is it important?

  • What makes this paper different from others

    • Different version of the problem. Should we care?

    • Novel solution: Is it sufficiently novel?

  • Solution

    • Give a general overview.

    • Highlight the interesting and important parts.


Presentations2

Presentations

  • Evaluation, evaluation, evaluation!

    • Setting: Is it realistic?

    • Results: Does the paper cover both advantages and disadvantages?

  • Conclusion and discussion

    • Strong points (3 or more). How to use them in problems?

    • Weak points (3 or more) How to address the weak points?

    • Possible future directions?

    • Connections with your project/ research?


Presentations3

Presentations

  • David Paterson’s guide on “How to give a bad presentation”

    http://www.cs.berkeley.edu/~pattrsn/talks/BadTalk.pdf


Relational query languages

Relational Query Languages


Relational model and query languages

Relational model and query languages

  • Relational model defines data organization

  • Relational query languages define data retrieval/manipulation operations.


Relational model

Relational model

Relation name

Attribute names

Book:

Title Price Category Year

MySQL $102.1 computer 2001

Cell biology $201.69 biology 1954

French cinema $53.99 art 2002

NBA History $63.65 sport 2010

tuples


Relational model1

Relational Model

  • Attributes

    • Atomic values

    • Domain: string, integer, real,

    • Keys: no duplicate values

  • Each relation must have keys

    • A relation does not contain duplicate tuples.


Database schema vs database instance

Database Schema vs. Database Instance

  • Schema of a Relation

    • Names of the relation and their attributes.

    • E.g.: Person (Name, Address, SSN)

    • Domains of / constraints on the attributes.

  • Schema of the database

    • Set of relation schemata

    • E.g.: Person (Name, Address, SSN)

      Employment(Company, SSN)


Database schema vs database instance1

Database Schema vs. Database Instance

  • Schema: S:Book(Title, Price, Category, Year)

  • Instance:

  • Values of each attribute A in I: active domain of A, adom(A)

Title Price Category Year

MySQL $102.1 computer 2001

Cell biology $201.69 biology 1954

French cinema $53.99 art 2002

NBA History $63.65 sport 2010


Cs 519 big data exploration and analytics

SQL

  • A declarative language for querying data stored in relational databases.

  • Much easier to use than procedural languages.

    • Say what instead of how

      SELECT returned attribute(s)

      FROM table(s)

      WHERE conditions on the tuples of the table(s)


Sql example

SQL Example

Movie(id, title, year, total-gross)

Actor(id, name, b-year)

Plays(mid,aid)

What movies are made in 1998?

SELECT title

FROM movie

WHERE year = 1998;


Sql example1

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

SQL Example

  • Find actors who played in a movie whose total gross is more than $2,000,000.

    SELECT *

    FROM Actor, Movie, Plays

    WHERE Movie.id= Plays.midAND

    Plays.aid= Actor.id AND

    total-gross > 2000000;


Formal relational query languages

Formal Relational Query Languages

  • Formal languages that express queries over relational schemas.

  • Relational Algebra

  • Datalog (recursion-free with negation)

  • Relational calculus

  • Used to explore the properties of relational model.

  • Easier to use than SQL in some application domains.


Named versus unnamed perspective

Named versus unnamed perspective

  • Named perspective

    Movie(id, title, year, total-gross)

    Actor(id, name, b-year)

    Plays(mid,aid)

  • Unnamed perspective

    Movie: arity 4

    Actor: arity 3

    Plays: arity 2


Relational algebra ra

Relational Algebra (RA)

  • Used by RDBMS to execute queries

  • Six operators

    • Selection σ

    • Projection Π

    • Join ∞

    • Union

    • Difference –

    • Renaming ρ (for named perspective)


Relational algebra

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational Algebra

  • Selection σ

    • Named perspectiveσtitle=‘Fargo’ (Movie)

    • Unnamed perspective σ2=‘Fargo’ (Movie)

  • Projection Π

    • Named perspectiveΠb-year(Actor)

    • Unnamed perspectiveΠ3(Actor)

  • Join ∞

    • Named perspectiveMovie ∞id=mid Plays

    • Unnamed perspectiveMovie ∞1=1Plays


Datalog

Datalog

  • First created to support recursive queries over relational databases.

  • Easier to use than relational algebra.

  • Used extensively in research and industry

    • Data integration, networking, logic programming, learning, distributed processing, …

  • We talk about the recursion-free datalog.


Datalog1

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog

  • Each tuple in database is a fact

    Movie(236878, ‘Godfather I’, 1972, 40000000)

    Movie(879900, ‘Godfather II’, 1974, 3900000)

    Actor(090988,’Robert De Niro’, 1943)

  • Each query is a rule

    Movies that were produced in 1998 and made more than $2,000.

    Q1(y):- Movie(x,y,1998,z),z > 2000.


Datalog example

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog Example

  • Actors who played in a movie whose total gross is more than $2,000.

    Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

    f > 2000.

  • Actors who played in a movie whose total gross is more than $2,000 and a movie made in 1998.

    Q3(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

    f > 2000, Plays(g,x), Movie(g,l,1998,h).


Datalog2

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog

Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000.

y: head variable; x, z, t : existential variables

  • Extensional Database Predicates (EDB)

    • Movie, Actor, Plays

  • Intentional Database Predicate (IDB)

    • Q2

atom

atom

body

head


Datalog programs

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog programs

  • A collection of rules: union

  • Actors who played in a movie with gross of more than $2,000 or a movie made after 1990.

    Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

    f > 2000.

    Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

    w > 1990.


Views

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Views

  • Similar to views in SQL

  • Actors who played in a movie with gross of more than $2000 and in a movie with ‘Robert De Niro’.

    V(x,y,z):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

    f > 20000.

    Q5(y):- V(x,y,z),Plays(t,x),Plays(t,f),Actor(f,’Robert De Niro’,g, h).

  • Unfolding

    Q5(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

    f > 20000,Plays(u,x),Plays(u,f),Actor(f,’Robert De Niro’,g, h).

Definition of V


Datalog with negation

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog with negation

  • All actors who did not play in a movie with ‘Robert De Niro’.

    U(x,y,z):- Actor(x,y,z),Plays(t,x),Plays(t,f), Actor(f,’Robert De Niro’,g).

    Q6(y):- Actor(x,y,z), not U(x,y,z).


Safe datalog rules

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Safe datalog rules

  • Unsafe rules:

    V(x,y,z):- Actor(x,y,1998), z > 200.

    W(x,y,z):- Actor(x,y,z), not Plays(t,x).

  • A datalog rule is safe if every variable appear in at least one positive predicate


Datalog to sql

Datalog to SQL

  • Non-recursive datalog with negation represents the core functionalities of SQL

  • We can translate each non-recursive datalog program to a core SQL query and vice versa.


Relational calculus

Relational calculus

  • First order logic or predicate calculus

  • Two approaches:

    • Domain Relational Calculus (DRC): Unnamed perspective.

      • Our focus in this lecture.

    • Tuple Relational Calculus (TRC): Named perspective.


Relational calculus1

Relational calculus

  • Each relational predicate P is:

    • Atom

    • P ∧ P

    • P ∨ P

    • P P

    • not (P)

  • Each query Q(x1, …, xn) is a predicate P.


Relational calculus2

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational calculus

  • Actors who played in a movie with total gross of $2,000.

  • Actors who played only in movies produced in 1990.


Relational calculus3

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational calculus

  • Actors who played in some movies with only one actor.


Domain independent rc

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Domain independent RC

  • Similar to datalog, one may write unsafe queries in RC, called domain dependent.

  • RC queries should be domain independent.


Equivalency theorem

Equivalency Theorem

  • RA, non-recursive datalog with negation and RC express the same set of queries.

  • Relational queries.


Conjunctive queries cq

Conjunctive queries (CQ)

  • One datalog rule.

  • SELECT-DISTINCT-FROM-WHERE.

  • Select/project/join (σ, Π, ∞) fragment of RA.

  • Existential/ conjunctive fragment of RC

  • There is not any comparison operator (<, ≠, …) in CQ. If used the family is called CQ<, CQ≠, …


Cq examples

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

CQexamples

Actors who played in “LTR”.

Q7(y):- Actor(x,y,z),Plays(t,x),Movie(t,’LTR’,w,f).

Actors who played in a movie with total gross of $2000.

Non-CQ: Actors who played in some movies with only one actor.


Query equivalency and containment

Query equivalency and containment

  • Interesting and long standing problems in database management.

  • Queries q1 and q2 are equivalent if and only if for every database instance I, q1(I) = q2(I)

  • Shown as q1q2

  • Query q1is contained in q2if and only if for every database instance I, q1(I) q2(I)

  • Shown as q1q2


Containment examples

Containment examples

Is q1 q2?

q1(x):- R(x,y),R(y,z),R(z,w).

q2(x):- R(x,y),R(y,z).

q1(x):-R(x,y),R(y,’Joe’).

q2(x):-R(x,y),R(y,z).

q1(x):- R(x,y),R(y,z),R(z,x).

q2(x):- R(x,y),R(y,x).


  • Login