parallel distributed database systems n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Parallel & Distributed Database Systems PowerPoint Presentation
Download Presentation
Parallel & Distributed Database Systems

Loading in 2 Seconds...

play fullscreen
1 / 51

Parallel & Distributed Database Systems - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Parallel & Distributed Database Systems. Reid Exley November 3, 2005. Overview. Parallel Databases Background Architectures Applications Distributed Databases Background Characteristics Applications. Parallel Databases. What is the PROBLEM?. Performance Mainframes expensive

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Parallel & Distributed Database Systems' - sabine


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
parallel distributed database systems

Parallel & Distributed Database Systems

Reid Exley

November 3, 2005

overview
Overview
  • Parallel Databases
    • Background
    • Architectures
    • Applications
  • Distributed Databases
    • Background
    • Characteristics
    • Applications
what is the problem
What is the PROBLEM?
  • Performance
  • Mainframes expensive
  • I/O bottleneck (speed(disk) < speed(RAM) < speed(cpu))
what is a parallel database
What is a parallel database?
  • Definition: a storage and retrieval method that seeks to improve performance through parallelization of various operations, such as loading data, building indexes, and evaluating queries
why parallel databases
Why parallel databases?
  • Relational model affords it
    • SQL can easily be ported to parallel processing
  • Hardware trends
    • Decreasing costs of servers
    • Increasing speed of hardware
  • Complex queries needed to be solved
    • Data warehouse
    • Search engines
three architectures

3

Three Architectures
  • Shared-memory system
  • Shared-disk system
  • Shared-nothing system
shared memory system
Shared-memory system
  • Multiple CPUs attached to an interconnection network and can access a common region of main memory
  • Closest to a conventional machine

DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

what s wrong with it
What’s wrong with it?
  • Memory contention becomes a bottleneck as the # of CPUs increases
shared disk system
Shared-disk system
  • Each CPU has a private memory and direct access to all disks through an interconnection network

DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

what s wrong with it1
What’s wrong with it?
  • Disks over the network become the bottleneck
  • INTERFERENCE: as more CPUs are added, existing CPUs are slowed down because of the increased contention for memory accesses and network bandwidth
shared nothing system
Shared-nothing system
  • Each CPU has local main memory and disk space, but no two CPUs can access the same storage area
  • All communication between CPUs is through a network connection
  • Considered the best parallel architecture

DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

disadvantage
Disadvantage:
  • Requires more extensive reorganization of the DBMS code
  • Load Balancing complex
advantage
Advantage:
  • Provides linear speed-up: Twice as much hardware can perform the task in half the elapsed time
  • Provides linear scale-up: Twice as much hardware can perform twice as large a task in the same elapsed time

DeWitt, J. D. & Gray, J. (1992). Parallel Database Systems: The Future of HighPerformance Database Processing. CACM.

how is parallelism achieved
How is parallelism achieved?
  • Pipelining
  • Data Partitioning
pipelining
Pipelining
  • The output of the second operator is worked on by the first operator as soon as it is generated

Module A

Module B

Module C

Time = 0, 1, 2

Module D

Time = 1, 2

Module E

Time = 2

pipelining challenges
Pipelining Challenges
  • Blocking – an operator is blocking if it does not output until it has consumed all inputs
    • Sorting & Aggregation
    • Blocking kills pipeline parallelism
  • Relational pipelines are rarely very long
  • Execution cost of one operation is sometimes much larger than others (skew)
data partitioning
Data Partitioning
  • Spreads I/O and computation among processors
partitioning methods
Partitioning Methods
  • Range
  • Hash
  • Round Robin
range
Range
  • Rows are sorted, and n ranges are chosen for the sort key values, so each range contains roughly the same # of rows
  • Rows in range i are assigned to processor i
  • Considerations:
    • Good for equijoins, range queries, group by
    • May need to sample data from disks to get even distribution (disk skew)
slide21
Hash
  • Hash function is applied to fields of a row to determine its processor
  • Considerations:
    • Good for equijoins
    • Good for accessing a subset of a relation
round robin
Round-robin
  • If there are n processors, the ith tuple is assigned to processor i mod n
  • Considerations:
    • Good if accessing the entire table or relation
    • Good for load balance
parallel join example

Node 1

Node 2

R2:

R1:

S1

S2

Node 3

Node 4

Parallel Join Example
real life application
Real-life Application
  • Search engines
    • Google
open problems
Open Problems
  • Database query optimizers do not consider all possible plans
    • min{cost = data transmission + local processing}
  • Highly skewed value distributions
  • Hybrid Architectures
  • Support of higher functionality such as rules and objects
what is the problem1
What is the PROBLEM?
  • Availability
    • Redundancy
  • Local Ownership
    • Data needs are often local
  • Accessing Distributed Data
    • Data at multiple sites must be accessible
what is it
What is it?
  • A collection of multiple, logically interrelated databases distributed over a computer network
characteristics
Characteristics
  • Distributed Data Independence
    • User should not have to know where data is located
  • Distributed Transaction Atomicity
    • Database should appear to be one local database
    • Atomic: All or nothing transactions
  • Data is Logically Related
  • Most DDB are relational
characteristics1
Characteristics
  • Distributed Data Independence
    • User should not have to know where data is located
  • Distributed Transaction Atomicity
    • Database should appear to be one local database
    • Atomic: All or nothing transactions
  • Data is Logically Related
  • Most DDB are relational
characteristics2
Characteristics
  • Distributed Data Independence
    • User should not have to know where data is located
  • Distributed Transaction Atomicity
    • Database should appear to be one local database
    • Atomic: All or nothing transactions
  • Data is Logically Related
  • Most DDB are relational
dimensions
Dimensions
  • Autonomy
    • How independent are each of the DBs
  • Distribution
    • How many DBs are there
  • Heterogeneity
    • How different are the data models, query languages, interfaces, and transaction management protocols
architectures
Architectures
  • Client-Server
  • Collaborating Server
  • Middleware
client server
Client-server
  • Client sends query to single site
  • All query processing done at one server

Client

Server

Query

collaborating server

DB

DB

DB

DB

DB

Collaborating Server
  • A collection of db servers that cooperatively execute transactions
  • A server that generates appropriate subqueries to be executed by other servers and puts the results together when data is distributed

Query

middleware

Query

DB

DB

Middleware

DB

DB

Result

Middleware
  • Coordinates the execution of queries and transactions across one or more independent database servers;
  • Does not contain any data itself
fragmentation
Fragmentation
  • Breaking a relation into smaller relations or fragments and storing the fragments possibly at different sites
    • Horizontal fragmentation (rows)
    • Vertical fragmentation (columns)
    • Hybrid?
replication
Replication
  • Storing several copies of a relation or fragments
  • Why?
    • Increased availability of data (redundancy)
    • Faster Query Evaluation (local is faster)
  • Necessary for Data Warehousing
replication two methods
Replication: Two Methods
  • Synchronous
    • All copies are synchronize before commit
  • Asynchronous
    • Modified copies are only periodically updated
    • More common
    • Sacrifices data independence for efficiency
    • 2 Types: Primary Site or P2P

Q. When would you use each method?

primary site
Primary Site
  • One copy of a relation is designated as the primary or master
  • Secondary copies are made from primary are not directly updated
  • Publisher & Subscriber
peer to peer
Peer-to-peer
  • More than one copy can be master
  • Changes to a master copy must be propagated to other copies
  • If two master copies are changed in a conflicting manner, it must be resolved (Site 1: Joe’s age changed to 35; Site 2 to 36)
  • Used best in situations that don’t produce conflicts
    • Each master site owns a disjoint fragment
    • Updating rights own by only one master at a time
distributed transactions
Distributed Transactions
  • Distributed Concurrency Control
  • Distributed Recovery
distributed concurrency control
Distributed Concurrency Control
  • How do we manage locks for objects across all sites?
    • Centralized: One site does all locking
      • Vulnerable to single-site failure
    • Primary Copy: All locking for an object done at the primary copy for this object
      • Reading requires access to locking site as well as site where the object is stored
    • Fully Distributed: Locking for a copy done at site where copy is stored
      • Locks at all sites while writing an object
distributed recovery
Distributed Recovery
  • Two new issues:
    • Link and remote system failures
    • Must ensure atomicity of sub-transactions
  • Solution: Two-phase commit
    • Coordinator (originator) asks for a vote (yes or abort) from subordinates (other DBs)
    • If unanimous yes, then coordinator tells all the others to commit
    • Coordinator waits for “ack” from all subordinates then commits and ends transaction log
popular support
Popular Support
  • Oracle
    • Grid Databases/Computing
    • http://www.oracle.com/technologies/grid/index.html
  • Mysql
    • Cluster DBMS
    • http://www.mysql.com/products/database/cluster/
  • DB2, MSSQL Server, etc.
applications
Applications
  • Manufacturing – especially at multiple locations
  • Military command and control
  • Airlines
  • Hotel Chains
  • (Any organization which has a decentralized structure)
open problems1
Open Problems
  • Increasing performance of locking protocols
    • Speculative Locking – waiting transactions execute on both the before and after states of preceding transactions prior to a commit (Krishna & Kitsuregawa, 2004)
summary
Summary
  • Parallel DBMSs designed for scalable performance; relational operators very well-suited for parallel execution
    • Pipeline and partitioned parallelism
  • Distributed DBMSs offer site autonomy and distributed administration

Questions?