1 / 56

ICS 214B: Transaction Processing and Distributed Data Management

ICS 214B: Transaction Processing and Distributed Data Management. Lecture 8: Distributed Database Systems Professor Chen Li. Application SQL Front End Query Processor Transaction Proc. File Access. So far: Centralized DB systems. P. Software:. M. Simplifications:

billy
Download Presentation

ICS 214B: Transaction Processing and Distributed Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 214B: Transaction Processing and Distributed Data Management Lecture 8: Distributed Database Systems Professor Chen Li

  2. Application SQL Front End Query Processor Transaction Proc. File Access So far: Centralized DB systems P Software: M ... • Simplifications: • single front end • one place to keep locks • if processor fails, system fails, ... Notes 08

  3. Next: distributed database systems • Multiple processors ( + memories) • Heterogeneity and autonomy of “components” Notes 08

  4. Why do we need Distributed Databases? • Example: Big Corp. has offices in London, New York, and Hong Kong. • Employee data: • EMP(ENO, NAME, TITLE, SALARY, …) • Where should the employee data table reside? Notes 08

  5. Big Corp. Data Access Pattern • Mostly, employee data is managed at the office where the employee works • E.g., payroll, benefits, hire and fire • Periodically, Big Corp needs consolidated access to employee data • E.g., Big Corp. changes benefit plans and that affects all employees. • E.g., Annual bonus depends on global net profit. Notes 08

  6. Internet New York Payroll app London Payroll app EMP London New York Hong Kong Payroll app Problem: NY and HK payroll apps run very slowly! Hong Kong Notes 08

  7. Internet New York Payroll app London Payroll app London Emp NY Emp London New York Hong Kong Payroll app Much better!! Hong Kong HK Emp Notes 08

  8. Internet New York Payroll app London Payroll app Annual Bonus app London Emp NY Emp London New York Hong Kong Payroll app Distribution provides opportunities for parallel execution Hong Kong HK Emp Notes 08

  9. Internet New York Payroll app London Payroll app Annual Bonus app London Emp NY Emp London New York Hong Kong Payroll app Hong Kong HK Emp Notes 08

  10. Internet New York Payroll app London Payroll app Annual Bonus app Lon, NY Emp NY, HK Emp London New York Hong Kong Payroll app Replication improves availability Hong Kong HK, Lon Emp Notes 08

  11. Heterogeneity and Autonomy Application Stock ticker tape Files RDBMS Portfolio History of dividends, ratios,... Notes 08

  12. We will study data management with multiple processors and possible autonomy, heterogeneity • Impact on: • Data organization • Query processing • Access structures • Concurrency control • Recovery Notes 08

  13. We also study transaction monitors • Coordinate transaction execution • Multiple DBMSs • High performance • Have workflow facilities • Manage communications with client “terminals” Notes 08

  14. DB architectures (1) Shared memory P P P ... M Notes 08

  15. P P P M DB architectures (2) Shared disk ... M M ... Notes 08

  16. P P P M M M DB architectures (3) Shared nothing ... Notes 08

  17. ... ... P P P P P P M M DB architectures (4) Hybrid example – Hierarchical or Clustered Notes 08

  18. Issues for selecting architecture • Reliability • Scalability • Geographic distribution of data • Data “clusters” • Performance • Cost Notes 08

  19. Parallel or distributed DB system? • More similarities than differences! Notes 08

  20. Typically, parallel DBs: • Fast interconnect • Homogeneous software • High performance is goal • Transparency is goal Notes 08

  21. Typically, distributed DBs: • Geographically distributed • Data sharing is goal (may run into heterogeneity, autonomy) • Disconnected operation possible Notes 08

  22. Distributed Database Challenges • Distributed Database Design • Deciding what data goes where • Depends on data access patterns of major applications • Two subproblems: • Fragmentation: partition tables into fragments • Allocation: allocate fragments to nodes Notes 08

  23. Distributed Database Challenges • Distributed Query Processing • Centralized query plan goal: minimize number of disk I/Os • Additional factors in distributed scenario: • Communication costs • Opportunity for parallelism • Space of possible query plans is much larger! Notes 08

  24. Distributed Database Challenges • Distributed Concurrency Control • Transactions span nodes • Must be globally serializable • Two main approaches: • Locking • Timestamps • Distributed Deadlock Management • Multiple data copies – need to be kept in sync when updates occur Notes 08

  25. Distributed Database Challenges • Reliability of Distributed Databases • Centralized database failure model: • processor fails • Distributed database failure model: • One or more processors may fail • Network may fail • Network may be partitioned • Data must be kept in sync Notes 08

  26.  To illustrate synchronization problems: “Two Generals” Problem Notes 08

  27. The one general problem (Trivial!)  Battlefield G Troops Notes 08

  28. The two general problem: <-------------------------------> Blue army Red army Enemy Blue G Red G messengers Notes 08

  29. Rules: • Blue and red army must attack at same time • Blue and red generals synchronize through messengers • Messengers can be lost Notes 08

  30. Distributed Database Challenges • Heterogeneity Application Stock ticker tape Files RDBMS Portfolio History of dividends, ratios,... Notes 08

  31. Distributed Database Challenges • Autonomy Example: unable to get statistics for query optimization Example: blue general may have mind of his (or her) own! Notes 08

  32. Next Topic • Distributed DB Design Notes 08

  33. Distributed DB Design Top-down approach: - have DB… - how to split and allocate the sites Bottom-up approach: - multi-database (possibly heterogeneous, autonomous) - no design issues! Notes 08

  34. Two issues in DDB design: • Fragmentation • Allocation Note: issues not independent, but will cover separately Notes 08

  35. Motivation: Two sites: Sa, Sb Qa  Qb Sa Sb Employee relation E (#,name,loc,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where loc=Sa where loc=Sb and… and ... Notes 08

  36. 5 Joe Sa 10 # NM Loc Sal E 7 Sally Sb 25 8 Tom Sa 15 .. .. F # NM Loc Sal # NM Loc Sal 5 Joe Sa 10 7 Sally Sb 25 .. 8 Tom Sa 15 .. At Sb At Sa Notes 08

  37. F = { F1, F2 } F1 = loc=Sa(E) F2 = loc=Sb(E)  called primary horizontal fragmentation Notes 08

  38. Fragmentation • Horizontal Primary depends on local attributes R Derived depends on foreign relation • Vertical R Notes 08

  39. Used mostly in parallel dbs Used in parallel dbs and distributed dbs Three common horizontal fragmentation techniques • Round robin • Hash partitioning • Range partitioning Notes 08

  40. Round robin R D0 D1 D2 t1 t1 t2 t2 t3 t3 t4 t4 ... t5 • Evenly distributes data • Good for scanning full relation • Not good for point or range queries • Not suitable for databases distributed over WAN Notes 08

  41. Hash partitioning R D0 D1 D2 t1h(k1)=2 t1 t2h(k2)=0 t2 t3h(k3)=0 t3 t4h(k4)=1 t4 ... • Good for point queries on key; also for joins on key • Not good for range queries; point queries not on key • If hash function good, even distribution • Not suitable for databases distributed over a WAN Notes 08

  42. Range partitioning partitioning vector R D0 D1 D2 t1: A=5 t1 t2: A=8 t2 t3: A=2 t3 t4: A=3 t4 ... 4 7 V0 V1 • Good for point queries on A; also for joins on A • Good for some range queries on A • Need to select good vector: else unbalanced • data skew, execution skew Notes 08

  43. Which are good fragmentations? Example: F = { F1, F2 } F1 = sal<10 E F2 = sal>20 E  Problem: Some tuples lost! Notes 08

  44. Which are good fragmentations? Second example: F = { F3, F4 } F3 = sal<10 E F4 = sal>5 E  Tuples with 5 < sal < 10 are duplicated... Notes 08

  45. Better design Example: F = { F5, F6, F7 } F5 = sal  5 E F6 = 5<sal<10 EF7 = sal  10 E  Then replicate F6 if convenient (part of allocation problem) Notes 08

  46. Desired properties for fragmentation R F = {F1, F2, …, Fn} • Completeness • For every data item x  R,  FiFsuch that xFi • Disjointness • xFi,  Fj such that xFj, i  j • Reconstruction • There is function g such that R = g(F1, F2, …, Fn) Notes 08

  47. Fi FiF Desired properties for horizontal fragmentation R F = {F1, F2, …, Fn} • Completeness • For every tuple tR,  FiFsuch that tFi • Disjointness • tFi,  Fj such that tFj, i  j • Reconstruction – can safely ignore • Completeness  R = Notes 08

  48. How do we get completeness and disjointness? (1) Check it “manually”! e.g., F1 = sal<10 E ; F2 = sal10 E Notes 08

  49. How do we get completeness and disjointness? (2) “Automatically” generate fragments with these properties • Horizontal fragments are defined by selection predicates • Generate a set of selection predicates with the desired properties Notes 08

  50. Example of generation • Say queries use predicates: A<10, A>5, Loc = SA, Loc = SB • Next: - generate “minterm” predicates - eliminate useless ones • Given simple predicates Pr= { p1, p2,.. pn } • minterm predicates are of the form • p1*  p2*  …  pn* • where pk* is pk or is ¬pk Notes 08

More Related