cs 519 big data exploration and analytics
Download
Skip this Video
Download Presentation
CS 519: Big Data Exploration and Analytics

Loading in 2 Seconds...

play fullscreen
1 / 38

CS 519: Big Data Exploration and Analytics - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

CS 519: Big Data Exploration and Analytics. Relational Query Languages. Announcements . The due data for selecting papers is tomorrow. Your top 3 choices First come, first server Some project ideas will be posted on the course Web site tomorrow. Presentations .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS 519: Big Data Exploration and Analytics' - kolina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
announcements
Announcements
  • The due data for selecting papers is tomorrow.
    • Your top 3 choices
    • First come, first server
  • Some project ideas will be posted on the course Web site tomorrow.
presentations
Presentations
  • 1 or 2 papers: 45 - 50 minutes presentation.
  • 20 - 25 minutes discussion
  • 5 minutes quiz.
  • Posted articles on how to effectively read scientific papers.
    • Which parts to read more carefully than others?
    • Read and analyze the paper critically.
presentations1
Presentations
  • The story:
    • Clear definition of the problem and its applications: Is it important?
  • What makes this paper different from others
    • Different version of the problem. Should we care?
    • Novel solution: Is it sufficiently novel?
  • Solution
    • Give a general overview.
    • Highlight the interesting and important parts.
presentations2
Presentations
  • Evaluation, evaluation, evaluation!
    • Setting: Is it realistic?
    • Results: Does the paper cover both advantages and disadvantages?
  • Conclusion and discussion
    • Strong points (3 or more). How to use them in problems?
    • Weak points (3 or more) How to address the weak points?
    • Possible future directions?
    • Connections with your project/ research?
presentations3
Presentations
  • David Paterson’s guide on “How to give a bad presentation”

http://www.cs.berkeley.edu/~pattrsn/talks/BadTalk.pdf

relational model and query languages
Relational model and query languages
  • Relational model defines data organization
  • Relational query languages define data retrieval/manipulation operations.
relational model
Relational model

Relation name

Attribute names

Book:

Title Price Category Year

MySQL $102.1 computer 2001

Cell biology $201.69 biology 1954

French cinema $53.99 art 2002

NBA History $63.65 sport 2010

tuples

relational model1
Relational Model
  • Attributes
    • Atomic values
    • Domain: string, integer, real,
    • Keys: no duplicate values
  • Each relation must have keys
    • A relation does not contain duplicate tuples.
database schema vs database instance
Database Schema vs. Database Instance
  • Schema of a Relation
    • Names of the relation and their attributes.
    • E.g.: Person (Name, Address, SSN)
    • Domains of / constraints on the attributes.
  • Schema of the database
    • Set of relation schemata
    • E.g.: Person (Name, Address, SSN)

Employment(Company, SSN)

database schema vs database instance1
Database Schema vs. Database Instance
  • Schema: S:Book(Title, Price, Category, Year)
  • Instance:
  • Values of each attribute A in I: active domain of A, adom(A)

Title Price Category Year

MySQL $102.1 computer 2001

Cell biology $201.69 biology 1954

French cinema $53.99 art 2002

NBA History $63.65 sport 2010

slide13
SQL
  • A declarative language for querying data stored in relational databases.
  • Much easier to use than procedural languages.
    • Say what instead of how

SELECT returned attribute(s)

FROM table(s)

WHERE conditions on the tuples of the table(s)

sql example
SQL Example

Movie(id, title, year, total-gross)

Actor(id, name, b-year)

Plays(mid,aid)

What movies are made in 1998?

SELECT title

FROM movie

WHERE year = 1998;

sql example1

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

SQL Example
  • Find actors who played in a movie whose total gross is more than $2,000,000.

SELECT *

FROM Actor, Movie, Plays

WHERE Movie.id= Plays.midAND

Plays.aid= Actor.id AND

total-gross > 2000000;

formal relational query languages
Formal Relational Query Languages
  • Formal languages that express queries over relational schemas.
  • Relational Algebra
  • Datalog (recursion-free with negation)
  • Relational calculus
  • Used to explore the properties of relational model.
  • Easier to use than SQL in some application domains.
named versus unnamed perspective
Named versus unnamed perspective
  • Named perspective

Movie(id, title, year, total-gross)

Actor(id, name, b-year)

Plays(mid,aid)

  • Unnamed perspective

Movie: arity 4

Actor: arity 3

Plays: arity 2

relational algebra ra
Relational Algebra (RA)
  • Used by RDBMS to execute queries
  • Six operators
    • Selection σ
    • Projection Π
    • Join ∞
    • Union
    • Difference –
    • Renaming ρ (for named perspective)
relational algebra

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational Algebra
  • Selection σ
    • Named perspective σtitle=‘Fargo’ (Movie)
    • Unnamed perspective σ2=‘Fargo’ (Movie)
  • Projection Π
    • Named perspective Πb-year (Actor)
    • Unnamed perspective Π3 (Actor)
  • Join ∞
    • Named perspective Movie ∞id=mid Plays
    • Unnamed perspective Movie ∞1=1Plays
datalog
Datalog
  • First created to support recursive queries over relational databases.
  • Easier to use than relational algebra.
  • Used extensively in research and industry
    • Data integration, networking, logic programming, learning, distributed processing, …
  • We talk about the recursion-free datalog.
datalog1

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog
  • Each tuple in database is a fact

Movie(236878, ‘Godfather I’, 1972, 40000000)

Movie(879900, ‘Godfather II’, 1974, 3900000)

Actor(090988,’Robert De Niro’, 1943)

  • Each query is a rule

Movies that were produced in 1998 and made more than $2,000.

Q1(y):- Movie(x,y,1998,z),z > 2000.

datalog example

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog Example
  • Actors who played in a movie whose total gross is more than $2,000.

Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000.

  • Actors who played in a movie whose total gross is more than $2,000 and a movie made in 1998.

Q3(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000, Plays(g,x), Movie(g,l,1998,h).

datalog2

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog

Q2(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000.

y: head variable; x, z, t : existential variables

  • Extensional Database Predicates (EDB)
    • Movie, Actor, Plays
  • Intentional Database Predicate (IDB)
    • Q2

atom

atom

body

head

datalog programs

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog programs
  • A collection of rules: union
  • Actors who played in a movie with gross of more than $2,000 or a movie made after 1990.

Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 2000.

Q4(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

w > 1990.

views

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Views
  • Similar to views in SQL
  • Actors who played in a movie with gross of more than $2000 and in a movie with ‘Robert De Niro’.

V(x,y,z):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 20000.

Q5(y):- V(x,y,z),Plays(t,x),Plays(t,f),Actor(f,’Robert De Niro’,g, h).

  • Unfolding

Q5(y):- Actor(x,y,z),Plays(t,x),Movie(t,v,w,f),

f > 20000,Plays(u,x),Plays(u,f),Actor(f,’Robert De Niro’,g, h).

Definition of V

datalog with negation

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Datalog with negation
  • All actors who did not play in a movie with ‘Robert De Niro’.

U(x,y,z):- Actor(x,y,z),Plays(t,x),Plays(t,f), Actor(f,’Robert De Niro’,g).

Q6(y):- Actor(x,y,z), not U(x,y,z).

safe datalog rules

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Safe datalog rules
  • Unsafe rules:

V(x,y,z):- Actor(x,y,1998), z > 200.

W(x,y,z):- Actor(x,y,z), not Plays(t,x).

  • A datalog rule is safe if every variable appear in at least one positive predicate
datalog to sql
Datalog to SQL
  • Non-recursive datalog with negation represents the core functionalities of SQL
  • We can translate each non-recursive datalog program to a core SQL query and vice versa.
relational calculus
Relational calculus
  • First order logic or predicate calculus
  • Two approaches:
    • Domain Relational Calculus (DRC): Unnamed perspective.
      • Our focus in this lecture.
    • Tuple Relational Calculus (TRC): Named perspective.
relational calculus1
Relational calculus
  • Each relational predicate P is:
    • Atom
    • P ∧ P
    • P ∨ P
    • P P
    • not (P)
  • Each query Q(x1, …, xn) is a predicate P.
relational calculus2

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational calculus
  • Actors who played in a movie with total gross of $2,000.
  • Actors who played only in movies produced in 1990.
relational calculus3

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Relational calculus
  • Actors who played in some movies with only one actor.
domain independent rc

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

Domain independent RC
  • Similar to datalog, one may write unsafe queries in RC, called domain dependent.
  • RC queries should be domain independent.
equivalency theorem
Equivalency Theorem
  • RA, non-recursive datalog with negation and RC express the same set of queries.
  • Relational queries.
conjunctive queries cq
Conjunctive queries (CQ)
  • One datalog rule.
  • SELECT-DISTINCT-FROM-WHERE.
  • Select/project/join (σ, Π, ∞) fragment of RA.
  • Existential/ conjunctive fragment of RC
  • There is not any comparison operator (<, ≠, …) in CQ. If used the family is called CQ<, CQ≠, …
cq examples

Movie(mid, title, year, total-gross)

Actor(aid, name, b-year)

Plays(mid,aid)

CQexamples

Actors who played in “LTR”.

Q7(y):- Actor(x,y,z),Plays(t,x),Movie(t,’LTR’,w,f).

Actors who played in a movie with total gross of $2000.

Non-CQ: Actors who played in some movies with only one actor.

query equivalency and containment
Query equivalency and containment
  • Interesting and long standing problems in database management.
  • Queries q1 and q2 are equivalent if and only if for every database instance I, q1(I) = q2(I)
  • Shown as q1 q2
  • Query q1is contained in q2if and only if for every database instance I, q1(I) q2(I)
  • Shown as q1 q2
containment examples
Containment examples

Is q1 q2?

q1(x):- R(x,y),R(y,z),R(z,w).

q2(x):- R(x,y),R(y,z).

q1(x):-R(x,y),R(y,’Joe’).

q2(x):-R(x,y),R(y,z).

q1(x):- R(x,y),R(y,z),R(z,x).

q2(x):- R(x,y),R(y,x).

ad