1 / 53

ITCS 6163

ITCS 6163. View Maintenance. Implementing a Warehouse. Monitoring : Sending data from sources Integrating : Loading, cleansing,... Processing : Query processing, indexing, ... Managing : Metadata, Design,. new. Monitoring.

Download Presentation

ITCS 6163

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ITCS 6163 View Maintenance

  2. Implementing a Warehouse • Monitoring: Sending data from sources • Integrating: Loading, cleansing,... • Processing: Query processing, indexing, ... • Managing: Metadata, Design, ...

  3. new Monitoring • Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … • Incremental vs. Refresh

  4. Monitoring Techniques • Periodic snapshots • Database triggers • Log shipping • Data shipping (replication service) • Transaction shipping • Polling (queries to source) • Application level monitoring è Advantages & Disadvantages!!

  5. Monitoring Issues • Frequency • periodic: daily, weekly, … • triggered: on “big” change, lots of changes, ... • Data transformation • convert data to uniform format • remove & add fields (e.g., add date to get history) • Standards (e.g., ODBC) • Gateways

  6. Monitoring Products • Gateways: Info Builders EDA/SQL, Oracle Open Connect, Informix Enterprise Gateway, … • Data Shipping: Oracle Replication Server, Praxis OmniReplicator, … • Transaction Shipping: Sybase Replication Server, Microsoft SQL Server • Extraction: Aonix, ETI, CrossAccess, DBStar • Monitoring/Integration products later on

  7. Client Client Query & Analysis Metadata Warehouse Integration Source Source Source Integration • Data Cleaning • Data Loading • Derived Data Query &analysis integration

  8. Change detection • Detect & send changes to integrator • Different classes of sources • Cooperative • Queryable • Logged • Snapshot/dump

  9. Data transformation • Convert data to uniform format • Byte ordering, string termination • Internal layout • Remove, add, & reorder attributes • Add (regeneratable) key • Add date to get history

  10. Data transformation (2) • Sort tuples • May use external utilites • Can be much faster (10x) than SQL engine • E.g., perl script to reorder attributes

  11. External functions (EFs) • Special transformation functions • E.g., Yen_to_dollars • User defined • Specified in warehouse table definition • Aid in integration • Must be applied to updates, too

  12. Data integration • Rules for matching data from different sources • Build composite view of data • Eliminate duplicate, unneeded attributes

  13. billing DB customer1(Joe) merged_customer(Joe) service DB customer2(Joe) Data Cleaning • Migration (e.g., yen ð dollars) • Scrubbing: use domain-specific knowledge (e.g., social security numbers) • Fusion (e.g., mail list, customer merging) • Auditing: discover rules & relationships(like data mining)

  14. Data cleansing • Find (& remove) duplicate tuples • E.g., Jane Doe & Jane Q. Doe • Detect inconsistent, wrong data • Attributes that don’t match • E.g., city, state and zipcode • Patch missing, unreadable data • Want to “backflush” clean data • Notify sources of errors found

  15. Loading Data • Incremental vs. refresh • Off-line vs. on-line • Frequency of loading • At night, 1x a week/month, continuously • Parallel/Partitioned load

  16. Derived Data • Derived Warehouse Data • indexes • aggregates • materialized views (next slide) • When to update derived data? • Incremental vs. refresh

  17. The “everything is a view” view • Pure programs: e.g., “can queries.” Always the same cost. No data is materialized. (DBMSs) • Derived data: Materialized views. Data always there but must be updated. (Good for warehouses.) • Pure data: Snapshot. Procedure is thrown away! Not maintainable. • Approximate: Snapshot+refresh procedure applied in some conditions. (Quasi-copies). Approximate models (e.g., statistical). (Quasi-cubes).

  18. does not exist at any source Materialized Views • Define new warehouse relations using SQL expressions

  19. Integration Products • Monitoring & Integration: Apertus, Informatica, Prism, Sagent, … • Merging: DataJoiner, SAS,… • Cleaning: Trillum, ... • Typically take warehouse off-line • Typically refresh • or simple incremental: e.g., Red Brick Table Management Utility, Prism

  20. Client Client Query & Analysis Metadata Warehouse Integration Source Source Source Managing • Metadata • Warehouse Design • Tools Query &analysis integration

  21. Metadata • Administrative • definition of sources, tools, ... • schemas, dimension hierarchies, … • rules for extraction, cleaning, … • refresh, purging policies • user profiles, access control, ...

  22. Metadata • Business • business terms & definition • data ownership, charging • Operational • data lineage • data currency (e.g., active, archived, purged) • use stats, error reports, audit trails

  23. Tools • Development • design & edit: schemas, views, scripts, rules, queries, reports • Planning & Analysis • what-if scenarios (schema changes, refresh rates), capacity planning • Warehouse Management • performance monitoring, usage patterns, exception reporting • System & Network Management • measure traffic (sources, warehouse, clients) • Workflow Management • “reliable scripts” for cleaning & analyzing data

  24. Tools - Products • Management Tools • HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager • System & Network Management • HP OpenView, IBM NetView, Tivoli

  25. Current State of Industry • Extraction and integration done off-line • Usually in large, time-consuming, batches • Everything copied at warehouse • Not selective about what is stored • Query benefit vs storage & update cost • Query optimization aimed at OLTP • High throughput instead of fast response • Process whole query before displaying anything

  26. Future Directions • Better performance • Larger warehouses • Easier to use • What are companies & research labs working on?

  27. Research (1) • Incremental Maintenance • Data Consistency • Data Expiration • Recovery • Data Quality • Error Handling (Back Flush)

  28. Research (2) • Rapid Monitor Construction • Temporal Warehouses • Materialization & Index Selection • Data Fusion • Data Mining • Integration of Text & Relational Data

  29. Items sold Sales Make warehouse self-maintainable • Add auxiliary tables to minimize update cost • Original + auxiliary are self-maintainable • E.g., auxiliary table of all unsold catalog items • Some updates may still be self-maintainable • E.g., insert into catalog if item (the join attribute) is a key Catalog

  30. Detection of self-maintainability • Most algorithms are at table level • Most algorithms are compile-time • Tuple levelat runtime [Huyn 1996, 1997] • Use state of tables and update to determine if self-maintainable • E.g., check whether sale is for item previously sold

  31. Warehouse maintenance • Current systems ignore integration of new data • Or assume warehouse can be rebuilt periodically • Depend on long “downtime” to regenerate warehouse • Technology gap: continuous incremental maintenance

  32. Maintenance research • Change detection • Data consistency • Single table consistency • Multiple table consistency • Expiration of data • Crash recovery

  33. Snapshot change detection • Compare old & new snapshots • Join-based algorithms • Hash old data, probe with new • Window algorithm • Sliding window over snapshots • Good for local changes

  34. Integrated data consistency • Conventional maintenance inadequate • Sources report changes but: • No locking, no global transactions (sources don’t communicate, coordinate with each other) • Inconsistencies caused by interleaving of updates

  35. Example anomaly • table Sold = catalog x sale x emp • insert into sale[hat, Sue] • delete from catalog[$12, hat] Sold price item clerk age sale catalog emp price item item clerk clerk age Sue 26 $12 hat

  36. catalog price item Sold price item clerk age emp clerk age sale catalog Sue 26 item clerk price item $12 hat Anomaly (2) $12,hat,Sue,26 ignored Q1 = catalog  [hat, Sue] A(Q1)= [$12,hat, Sue] Q2 = [$12,hat, Sue]  emp delete from catalog [$12, hat] insert into sale [hat, Sue] A(Q2)= [$12,hat,Sue,26] hat Sue

  37. Choices to deal with anomalies • Keep all relations in the DW (storage-expensive!) • Run all queries as distributed (may not be feasible! --legacy systems-- + poor performance) • Use specialized algorithms. E.g., Eager Compensation Algorithm (ECA), STROBE.

  38. Sold price clerk catalog catalog sale sale price item price item item clerk item clerk $12 hat hat Sue Another anomaly example V = [$12, Sue]-= V V = [$12, Sue]-= V WRONG! $12 Sue Delete(catalog[$12,hat]) Q1= p,c([$12,hat] sale) A(Q1) =  A(Q2) =  Delete(sale[hat,Sue])

  39. Depts Store City Add. NY Madison Ave catalog Store Dept City City Add. Shoes NY Yet another anomaly example Depts = Dept(catalog  Store) Shoes Bags Shoes Bags Bags Q1= Dept(catalog [NY,Madison Av]) Q2= Dept([Bags,NY]  Store) Insert(catalog[Bags,NY]) A(Q1) = [[Shoes],[Bags]] A(Q2) = [[Bags]] Insert(Store[NY, Madison Av. Bags NY

  40. Eager Compensating Algorithm(ECA) Principle: send compensating queries to offset the effect of concurrent updates ONLY GOOD IF ALL THE SOURCE RELATIONS ARE STORED IN ONE NODE (ONE SOURCE).

  41. Depts Store City Add. NY Madison Ave catalog Store Dept City City Add. Shoes NY Anomaly example revisited (ECA) Depts = Dept(catalog  Store) Q2= Dept([Bags,NY]  Store) -Dept([Bags,NY]  [NY,Madison Ave]] Shoes Bags Q1= Dept(catalog [NY,Madison Av]) Insert(catalog[Bags,NY]) A(Q1) = [[Shoes],[Bags]] A(Q2) =  Insert(Store[NY, Madison Av. Bags NY

  42. ECA Algorithm SOURCE DATA WAREHOUSE (DW) S_upi: Execute Ui W_upi: receive Ui send Ui to DW Qi=V(Ui)-QQSQj(Ui) trigger W_upi at DW UQS = UQS + {Qi} Send Qi to S trigger S_qui at S S_qui : Receive Qi W_ansi: Receive Ai let Ai = Qi(ssi) COL = COL + Ai Send Ai to DW UQS = UQS - {Qi} trigger W_ansi at DW if UQS =  MV=MV+COL COL =  ssi = current source state UQS = unanswered query set

  43. ECA-key Avoids the need for compensating queries. Necessary condition: the view contains key attributes for each of the base tables (e.g., star schema)

  44. UQS=UQS+{Q2}={Q1,Q2} A1 = {[bag,Jane]} COL = {[bags,Jane], [bags,Sue]} COL = {[hat,Sue]} COL = {[bags,Jane]} UQS = {Q2} UQS =  hat Sue COL =  Q1= i,d(catalog [hat,Jane]) UQS = {Q1} Sells Item clerk bags Sue bagsJane Insert(catalog[bag,acc])) catalog catalog catalog emp Item dept. Item dept. Item dept. item clerk hat acc bags acc acc Sue hat acc bags acc Example of ECA-key Q2= i,c([bags,acc] emp) A(Q2) = {[bags,Sue],[bags,Jane]} Delete(catalog,[hat,acc]) Insert (sale[acc,Jane]) acc Jane

  45. Strobe algorithm ideas • Apply actions only after a set of interleaving updates are all processed • Wait for sources to quiesce • Compensate effects of interleaved updates • Subtract effects of later updates before installing changes • Can combine these ideas STROBE IS A FAMILY OF ALGORITHMS

  46. Strobe Terminology • The materialized view MV is the current state of the view at the warehouse V(ws). • Given a query Q that needs to be evaluated, the function next_source(Q) returns the pair (x,Qi), where x is the next source to contact and Qithe portion of the query that can be answered by x. • Example: if V = r1  r2  r3, and U and update received from r2, then Q = (r1  U  r3) and next_source(Q) = (r1, r1  U)

  47. Strobe terminology (2) Source_evaluation(Q): /returns answers to Q/ Begin i = 0; WQ = Q; A0 = Q;(x,Q1)  next_source(WQ); While x is not nil do Let i = i + 1; Send Qi to source x; When x returns Ai, let WQ = WQ(Ai); Let (x,Qi+1)  next_source(WQ); Return(Ai); End

  48. Strobe Algorithm Source DW -After exec. Ui, send Ui to DW AL =  -When receiving Qi When Ui is received Compute Aiover ss[x] if a deletion Send Aito DW Qj  UQS add Ui to pend(Qj) Add key_del(MV,Ui) to AL if an insertion Qi = V(Ui), pend(Qi) =  Ai= source_evaluate(Qi); Uj  pend(Qi), key_del(Ai,Uj); Add insert(MV,Ai) to AL When UQS =  , apply AL to MV as a single transaction, without adding duplicate tuples to MV Reset AL

  49. Pend(Q1) =  AL ={key_del(MV,U2)} AL =  Apply key_del(A12,U2)  A2 =  Add nothing to AL UQS =  MV =  Q1=catalog [hat,Sue]emp Pend(Q1) = U2 catalog U2=Del([$12,hat]) A11=[$12,hat,Sue] Sold Q11=(catalog[hat,Sue]) price item A12=[$12,hat,Sue,26] Q12 =[$12,hat,Sue]  emp U1=Insert(sale, [hat, Sue]) price item clerk age sale catalog emp price item item clerk clerk age Sue 26 $12 hat Example with Strobe hat Sue

  50. AL = {del([hat,Sue]} MV =  AL = {ins([shoes,Jane]} MV = {[shoes,Jane]} T1 = {delete(sale,[hat,Sue]), insert(sale,[shoes,Jane])} sale sale sale sale sale item clerk item clerk item clerk item clerk item clerk shoes Jane hat Sue hat Sue shoes, Jane Transaction-Strobe

More Related