Itcs 6163
Download
1 / 53

ITCS 6163 - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

ITCS 6163. View Maintenance. Implementing a Warehouse. Monitoring : Sending data from sources Integrating : Loading, cleansing,... Processing : Query processing, indexing, ... Managing : Metadata, Design,. new. Monitoring.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' ITCS 6163' - damian-house


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Itcs 6163

ITCS 6163

View Maintenance


Implementing a warehouse
Implementing a Warehouse

  • Monitoring: Sending data from sources

  • Integrating: Loading, cleansing,...

  • Processing: Query processing, indexing, ...

  • Managing: Metadata, Design, ...


Monitoring

new

Monitoring

  • Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, …

  • Incremental vs. Refresh


Monitoring techniques
Monitoring Techniques

  • Periodic snapshots

  • Database triggers

  • Log shipping

  • Data shipping (replication service)

  • Transaction shipping

  • Polling (queries to source)

  • Application level monitoring

è Advantages & Disadvantages!!


Monitoring issues
Monitoring Issues

  • Frequency

    • periodic: daily, weekly, …

    • triggered: on “big” change, lots of changes, ...

  • Data transformation

    • convert data to uniform format

    • remove & add fields (e.g., add date to get history)

  • Standards (e.g., ODBC)

  • Gateways


Monitoring products
Monitoring Products

  • Gateways: Info Builders EDA/SQL, Oracle Open Connect, Informix Enterprise Gateway, …

  • Data Shipping: Oracle Replication Server, Praxis OmniReplicator, …

  • Transaction Shipping: Sybase Replication Server, Microsoft SQL Server

  • Extraction: Aonix, ETI, CrossAccess, DBStar

  • Monitoring/Integration products later on


Integration

Client

Client

Query & Analysis

Metadata

Warehouse

Integration

Source

Source

Source

Integration

  • Data Cleaning

  • Data Loading

  • Derived Data

Query &analysis

integration


Change detection
Change detection

  • Detect & send changes to integrator

  • Different classes of sources

    • Cooperative

    • Queryable

    • Logged

    • Snapshot/dump


Data transformation
Data transformation

  • Convert data to uniform format

    • Byte ordering, string termination

    • Internal layout

  • Remove, add, & reorder attributes

    • Add (regeneratable) key

    • Add date to get history


Data transformation 2
Data transformation (2)

  • Sort tuples

  • May use external utilites

    • Can be much faster (10x) than SQL engine

    • E.g., perl script to reorder attributes


External functions efs
External functions (EFs)

  • Special transformation functions

    • E.g., Yen_to_dollars

  • User defined

  • Specified in warehouse table definition

  • Aid in integration

  • Must be applied to updates, too


Data integration
Data integration

  • Rules for matching data from different sources

  • Build composite view of data

  • Eliminate duplicate, unneeded attributes


Data cleaning

billing DB

customer1(Joe)

merged_customer(Joe)

service DB

customer2(Joe)

Data Cleaning

  • Migration (e.g., yen ð dollars)

  • Scrubbing: use domain-specific knowledge (e.g., social security numbers)

  • Fusion (e.g., mail list, customer merging)

  • Auditing: discover rules & relationships(like data mining)


Data cleansing
Data cleansing

  • Find (& remove) duplicate tuples

    • E.g., Jane Doe & Jane Q. Doe

  • Detect inconsistent, wrong data

    • Attributes that don’t match

    • E.g., city, state and zipcode

  • Patch missing, unreadable data

  • Want to “backflush” clean data

    • Notify sources of errors found


Loading data
Loading Data

  • Incremental vs. refresh

  • Off-line vs. on-line

  • Frequency of loading

    • At night, 1x a week/month, continuously

  • Parallel/Partitioned load


Derived data
Derived Data

  • Derived Warehouse Data

    • indexes

    • aggregates

    • materialized views (next slide)

  • When to update derived data?

  • Incremental vs. refresh


The everything is a view view
The “everything is a view” view

  • Pure programs: e.g., “can queries.” Always the same cost. No data is materialized. (DBMSs)

  • Derived data: Materialized views. Data always there but must be updated. (Good for warehouses.)

  • Pure data: Snapshot. Procedure is thrown away! Not maintainable.

  • Approximate: Snapshot+refresh procedure applied in some conditions. (Quasi-copies). Approximate models (e.g., statistical). (Quasi-cubes).


Materialized views

does not exist

at any source

Materialized Views

  • Define new warehouse relations using SQL expressions


Integration products
Integration Products

  • Monitoring & Integration: Apertus, Informatica, Prism, Sagent, …

  • Merging: DataJoiner, SAS,…

  • Cleaning: Trillum, ...

  • Typically take warehouse off-line

  • Typically refresh

    • or simple incremental: e.g., Red Brick Table Management Utility, Prism


Managing

Client

Client

Query & Analysis

Metadata

Warehouse

Integration

Source

Source

Source

Managing

  • Metadata

  • Warehouse Design

  • Tools

Query &analysis

integration


Metadata
Metadata

  • Administrative

    • definition of sources, tools, ...

    • schemas, dimension hierarchies, …

    • rules for extraction, cleaning, …

    • refresh, purging policies

    • user profiles, access control, ...


Metadata1
Metadata

  • Business

    • business terms & definition

    • data ownership, charging

  • Operational

    • data lineage

    • data currency (e.g., active, archived, purged)

    • use stats, error reports, audit trails


Tools
Tools

  • Development

    • design & edit: schemas, views, scripts, rules, queries, reports

  • Planning & Analysis

    • what-if scenarios (schema changes, refresh rates), capacity planning

  • Warehouse Management

    • performance monitoring, usage patterns, exception reporting

  • System & Network Management

    • measure traffic (sources, warehouse, clients)

  • Workflow Management

    • “reliable scripts” for cleaning & analyzing data


Tools products
Tools - Products

  • Management Tools

    • HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager

  • System & Network Management

    • HP OpenView, IBM NetView, Tivoli


Current state of industry
Current State of Industry

  • Extraction and integration done off-line

    • Usually in large, time-consuming, batches

  • Everything copied at warehouse

    • Not selective about what is stored

    • Query benefit vs storage & update cost

  • Query optimization aimed at OLTP

    • High throughput instead of fast response

    • Process whole query before displaying anything


Future directions
Future Directions

  • Better performance

  • Larger warehouses

  • Easier to use

  • What are companies & research labs working on?


Research 1
Research (1)

  • Incremental Maintenance

  • Data Consistency

  • Data Expiration

  • Recovery

  • Data Quality

  • Error Handling (Back Flush)


Research 2
Research (2)

  • Rapid Monitor Construction

  • Temporal Warehouses

  • Materialization & Index Selection

  • Data Fusion

  • Data Mining

  • Integration of Text & Relational Data


Make warehouse self maintainable

Items sold

Sales

Make warehouse self-maintainable

  • Add auxiliary tables to minimize update cost

  • Original + auxiliary are self-maintainable

    • E.g., auxiliary table of all unsold catalog items

  • Some updates may still be self-maintainable

    • E.g., insert into catalog if item (the join attribute) is a key

Catalog


Detection of self maintainability
Detection of self-maintainability

  • Most algorithms are at table level

  • Most algorithms are compile-time

  • Tuple levelat runtime [Huyn 1996, 1997]

    • Use state of tables and update to determine if self-maintainable

    • E.g., check whether sale is for item previously sold


Warehouse maintenance
Warehouse maintenance

  • Current systems ignore integration of new data

    • Or assume warehouse can be rebuilt periodically

  • Depend on long “downtime” to regenerate warehouse

  • Technology gap: continuous incremental maintenance


Maintenance research
Maintenance research

  • Change detection

  • Data consistency

    • Single table consistency

    • Multiple table consistency

  • Expiration of data

  • Crash recovery


Snapshot change detection
Snapshot change detection

  • Compare old & new snapshots

  • Join-based algorithms

    • Hash old data, probe with new

  • Window algorithm

    • Sliding window over snapshots

    • Good for local changes


Integrated data consistency
Integrated data consistency

  • Conventional maintenance inadequate

    • Sources report changes but:

    • No locking, no global transactions (sources don’t communicate, coordinate with each other)

  • Inconsistencies caused by interleaving of updates


Example anomaly
Example anomaly

  • table Sold = catalog x sale x emp

  • insert into sale[hat, Sue]

  • delete from catalog[$12, hat]

Sold

price item clerk age

sale

catalog

emp

price item

item clerk

clerk age

Sue 26

$12 hat


Anomaly 2

catalog

price item

Sold

price item clerk age

emp

clerk age

sale

catalog

Sue 26

item clerk

price item

$12 hat

Anomaly (2)

$12,hat,Sue,26

ignored

Q1 = catalog  [hat, Sue]

A(Q1)= [$12,hat, Sue]

Q2 = [$12,hat, Sue]  emp

delete from catalog [$12, hat]

insert into sale [hat, Sue]

A(Q2)= [$12,hat,Sue,26]

hat Sue


Choices to deal with anomalies
Choices to deal with anomalies

  • Keep all relations in the DW (storage-expensive!)

  • Run all queries as distributed (may not be feasible! --legacy systems-- + poor performance)

  • Use specialized algorithms. E.g., Eager Compensation Algorithm (ECA), STROBE.


Another anomaly example

Sold

price clerk

catalog

catalog

sale

sale

price item

price item

item clerk

item clerk

$12 hat

hat Sue

Another anomaly example

V = [$12, Sue]-= V

V = [$12, Sue]-= V WRONG!

$12 Sue

Delete(catalog[$12,hat])

Q1= p,c([$12,hat] sale)

A(Q1) = 

A(Q2) = 

Delete(sale[hat,Sue])


Yet another anomaly example

Depts

Store

City Add.

NY Madison Ave

catalog

Store

Dept City

City Add.

Shoes NY

Yet another anomaly example

Depts = Dept(catalog  Store)

Shoes Bags

Shoes Bags Bags

Q1= Dept(catalog [NY,Madison Av])

Q2= Dept([Bags,NY]  Store)

Insert(catalog[Bags,NY])

A(Q1) = [[Shoes],[Bags]]

A(Q2) = [[Bags]]

Insert(Store[NY, Madison Av.

Bags NY


Eager compensating algorithm eca
Eager Compensating Algorithm(ECA)

Principle: send compensating queries to offset the effect of concurrent updates

ONLY GOOD IF ALL THE SOURCE RELATIONS ARE STORED IN ONE NODE (ONE SOURCE).


Anomaly example revisited eca

Depts

Store

City Add.

NY Madison Ave

catalog

Store

Dept City

City Add.

Shoes NY

Anomaly example revisited (ECA)

Depts = Dept(catalog  Store)

Q2= Dept([Bags,NY]  Store) -Dept([Bags,NY] 

[NY,Madison Ave]]

Shoes Bags

Q1= Dept(catalog [NY,Madison Av])

Insert(catalog[Bags,NY])

A(Q1) = [[Shoes],[Bags]]

A(Q2) = 

Insert(Store[NY, Madison Av.

Bags NY


Eca algorithm
ECA Algorithm

SOURCE DATA WAREHOUSE (DW)

S_upi: Execute Ui W_upi: receive Ui send Ui to DW Qi=V(Ui)-QQSQj(Ui) trigger W_upi at DW UQS = UQS + {Qi} Send Qi to S trigger S_qui at S

S_qui : Receive Qi W_ansi: Receive Ai let Ai = Qi(ssi) COL = COL + Ai Send Ai to DW UQS = UQS - {Qi} trigger W_ansi at DW if UQS =  MV=MV+COL COL =  ssi = current source state UQS = unanswered query set


Eca key
ECA-key

Avoids the need for compensating queries.

Necessary condition: the view contains key attributes for each of the base tables (e.g., star schema)


Example of eca key

UQS=UQS+{Q2}={Q1,Q2}

A1 = {[bag,Jane]}

COL = {[bags,Jane],

[bags,Sue]}

COL = {[hat,Sue]}

COL = {[bags,Jane]}

UQS = {Q2}

UQS = 

hat Sue

COL = 

Q1= i,d(catalog [hat,Jane])

UQS = {Q1}

Sells

Item clerk

bags Sue bagsJane

Insert(catalog[bag,acc]))

catalog

catalog

catalog

emp

Item dept.

Item dept.

Item dept.

item clerk

hat acc

bags acc

acc Sue

hat acc

bags acc

Example of ECA-key

Q2= i,c([bags,acc] emp)

A(Q2) = {[bags,Sue],[bags,Jane]}

Delete(catalog,[hat,acc])

Insert (sale[acc,Jane])

acc Jane


Strobe algorithm ideas
Strobe algorithm ideas

  • Apply actions only after a set of interleaving updates are all processed

    • Wait for sources to quiesce

  • Compensate effects of interleaved updates

    • Subtract effects of later updates before installing changes

  • Can combine these ideas

    STROBE IS A FAMILY OF ALGORITHMS


Strobe terminology
Strobe Terminology

  • The materialized view MV is the current state of the view at the warehouse V(ws).

  • Given a query Q that needs to be evaluated, the function next_source(Q) returns the pair (x,Qi), where x is the next source to contact and Qithe portion of the query that can be answered by x.

  • Example: if V = r1  r2  r3, and U and update received from r2, then Q = (r1  U  r3) and next_source(Q) = (r1, r1  U)


Strobe terminology 2
Strobe terminology (2)

Source_evaluation(Q): /returns answers to Q/

Begin

i = 0; WQ = Q; A0 = Q;(x,Q1)  next_source(WQ); While x is not nil do Let i = i + 1; Send Qi to source x; When x returns Ai, let WQ = WQ(Ai); Let (x,Qi+1)  next_source(WQ); Return(Ai);

End


Strobe algorithm
Strobe Algorithm

Source DW -After exec. Ui, send Ui to DW AL =  -When receiving Qi When Ui is received Compute Aiover ss[x] if a deletion Send Aito DW Qj  UQS add Ui to pend(Qj) Add key_del(MV,Ui) to AL if an insertion Qi = V(Ui), pend(Qi) =  Ai= source_evaluate(Qi); Uj  pend(Qi), key_del(Ai,Uj); Add insert(MV,Ai) to AL When UQS =  , apply AL to MV as a single transaction, without adding duplicate tuples to MV Reset AL


Example with strobe

Pend(Q1) =

AL ={key_del(MV,U2)}

AL = 

Apply key_del(A12,U2)  A2 = 

Add nothing to AL UQS =  MV = 

Q1=catalog [hat,Sue]emp

Pend(Q1) = U2

catalog

U2=Del([$12,hat])

A11=[$12,hat,Sue]

Sold

Q11=(catalog[hat,Sue])

price item

A12=[$12,hat,Sue,26]

Q12 =[$12,hat,Sue]  emp

U1=Insert(sale, [hat, Sue])

price item clerk age

sale

catalog

emp

price item

item clerk

clerk age

Sue 26

$12 hat

Example with Strobe

hat Sue


Transaction strobe

AL = {del([hat,Sue]} MV =

AL = {ins([shoes,Jane]} MV = {[shoes,Jane]}

T1 = {delete(sale,[hat,Sue]), insert(sale,[shoes,Jane])}

sale

sale

sale

sale

sale

item clerk

item clerk

item clerk

item clerk

item clerk

shoes Jane

hat Sue

hat Sue

shoes, Jane

Transaction-Strobe


Multiple table consistency

Sales

Customer

Multiple table consistency

  • More than 1 table at warehouse

  • Multiple tables share source data

  • Updates at source should be reflected in all warehouse tables at the same time

V3:

Total-sales

V1:

Customer-info

V2:

Cust-prefs


Multiple table consistency1
Multiple table consistency

Multiple table consistency

Warehouse

V1

V2

...

Vn

Single table consistency

Sources

S1

S2

...

Sm

Source consistency


Painting algorithm
Painting algorithm

  • Use merge process (MP) to coordinate sending updates to warehouse

  • MP holds update actions for each table

  • MP charts potential table states arising from each set of update actions

  • MP sends batch of update actions together when tables will be consistent


ad