1 / 21

A Deferred Cleansing Method for RFID Data Analytics

A Deferred Cleansing Method for RFID Data Analytics. IBM Almaden Research Center Jun Rao Sangeeta Doraiswamy Latha S. Colby University of California at Los Angeles Hetal Thakkar. RFID and Its Applications. Radio Frequency Identification Radio-based barcode

blade
Download Presentation

A Deferred Cleansing Method for RFID Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Deferred Cleansing Method for RFID Data Analytics IBM Almaden Research Center Jun Rao Sangeeta Doraiswamy Latha S. Colby University of California at Los Angeles Hetal Thakkar

  2. RFID and Its Applications • Radio Frequency Identification • Radio-based barcode • Becoming widely used in supply-chain, asset tracking … • Standardization based on Electronic Product Code (EPC) • Analytics on RFID data • simple: where is e1 at time t1 +50? • complex: average time spent per hop in the supply chain

  3. RFID Data Tends to be Dirty • Various types of anomalies • Physical: radio interference, media type, etc Redundant reads : (e1, t1, r1, l1) (e1, t1+2 secs, r1, l1) False reads : (e1, t1, r1, l1) ---> (e1, t1, r2, l2) Missing reads : (e1, t1, r1, l1) <--- (e1, t1+3, r2, l2) (e1, t1 + 10, r3, l3) • Logical: tend to be application dependent (e1, t1, r1, back room) (e1, t1+2, r2, sales floor) (e1, t1+5, r1, back room) (e1, t1+9, r2, sales floor) • Small number of anomalies ---> large error in analysis • Cleaning RFID data is imperative!

  4. Eager Cleansing vs. Deferred Cleansing • Conventional approach to cleansing is eager • At the edge server: de-dup, smoothing, … • Before loading into a warehouse (ETL) • have more context than the edge • Clean once, reuse at query time • Typically reducing data size downstream • Best strategy if applicable • Sometimes eager cleansing is not applicable • Don’t know how to clean until analyzing the data • More than one cleaned version (app-dependant anomalies) • Law enforcement (pharmaceutical e-pedigree tracking ) • We propose deferred cleansing • Load everything • Clean at query time • Has runtime overhead, but offers flexibility • Complementary to eager cleansing

  5. Overview of Our Approach USER QUERY USER RULE 3 1 CLEANSING RULES ENGINE QUERY REWRITE ENGINE 6 4 5 2 DATABASE EPC READS TABLE RULES TABLE

  6. Outline • Cleansing rules and their implementation • Query rewrite over cleansing rules • Experimental results • Conclusion

  7. RFID Data Characteristics • EPC sequences, each of which has all reads of a EPC in rtime order • Very useful for cleansing as well as querying • Many sequence-based languages proposed • But SQL/OLAP (standardized in SQL 99) can do sequence processing! Duplicate removal: with v1 as ( select biz_loc as loc_current, max(biz_loc) over (partition by epc order by rtime asc rows between 1 preceding and 1 preceding) as loc_before from R ) select * from v1 where loc_current != loc_before or loc_before is null; • (e1, t1, r1, l1) • (e1, t1+2 secs, r1, l1)

  8. Exploit SQL/OLAP for Sequence-based Cleansing • Pros • more efficient (compared with self-joins) • standardized (supported by major DB vendors) • integrated: parallelism, optimization • Cons • complex syntax • Solution • specify cleansing rules in a simpler language (based on SQL-TS) • have impact on query rewrite as well • implement rules in DBMS using SQL/OLAP

  9. Cycle Rule • Scenario Back room (X) Sales floor (Y) case (epc1) [X Y X Y X Y] [X Y] CLUSTER BY epc SEQUENCE BY rtime target reference an ordered list of singleton references

  10. Reader Rule • Scenario docking door (reader D) warehouse (has location tag) forklift (reader X) r1 (readerD) r2 (readerX) X t2 mins B is a set reference SQL/OLAP implementation max(case when reader = 'readerX' then 1 else 0 end) over (… range between 1 macro sec following and t2 min following) as has_readerX_after

  11. Missing Rule • Scenario L1 L2 L3 case (epcC) X X pallet (epcP) X X X (X)

  12. Query RFID Data over Cleansing Rules • Q=σs(R) • Q[C] is the answer to Q with respect to rule C • Naïve implementation: Q[C] = σs(ФC(R)), where ФC is cleans input using rule C • Traditional predicate pushdown through view not directly applicable • Can we do this Q[C] = ФC(σs(R))? (incorrect)

  13. Example 1 Reader rule t1-2 t1 t1+2 case on forklift r1(readerD) r2(readerX ) Q1:σrtime<t1(R) ] σs(ФC(R)): {} ФC(σs(R)): {r1} e1 = σrtime<t1(ФC(σrtime<t1+5(R))) (expanded rewrite)

  14. Example 2 Duplicate rule t2-2 t2 t2+2 case r3 (loc1) r4 (loc1) Q2:σrtime>t2(R) [ σs(ФC(R)): {} ФC(σs(R)): {r4} e2=σrtime>t2(ФC(RepcΠepc(σrtime>t2(R)))) (Join-back rewrite, always applicable)

  15. Rewrite Summary • Expanded rewrite • work at rule level, instead of SQL/OLAP level • collect conditions in cleansing rules referencing target reference • keep only position preserving conditions • run transitivity between surviving rule conditions and query conditions • predicates derived on target reference can be pushed down • Choose the rewrite between expanded and join-back • Extended to support multiple rules and join queries

  16. Experimental Setup locs (13k) gln desc site state city comment caseR(s*1.5k) epc rtime reader biz_loc biz_step EPC_info(s*50) epc product lot manufacture_date, expiration_date comment parent(s*50) child_epc parent_epc steps (100) biz_step desc type comment product (1,000) product manufacturer comment palletR(s*30) epc … RFID Data Schema

  17. Queries and Rules q1. “Dwell” analysis: average staying time between adjacent locations. with v1 as ( select biz_loc as current_loc, rtime, max(rtime) over (… rows 1 preceding) as prev_time, max(biz_loc) over (… rows 1 preceding) as prev_loc from caseR where rtime <= T1 ) select l1.loc_desc, l2.loc_desc, avg(rtime-prev_time) from v1, locs l1, locs l2 where v1.prev_loc = l1.gln and v1.current_loc = l2.gln group by l1.loc_desc, l2.loc_desc • 1 GB base data • Varying anomaly percentage • implemented by inversing the rules • DB2 UDB V8.2 • Indexes on queries attributes q2. Site analysis select p.manufacturer, count(distinct s.type), count(distinct c.reader) from caseR c, steps s, locs l, epc_info i, product p where c.biz_step=s.biz_step and c.biz_loc=l.gln and c.epc=i.epc and i.product=p.product and c.rtime >= T2 and l.site = ‘distribution center 2’ group by p.manufacturer

  18. Single Rule, 10% anomalies, Varying Selectivity • Both rewrites are more efficient than naïve • Cleansing overhead comes from sort and scalar aggregates in SQL/OLAP • sort required by cleansing is shared by q1 • Tradeoffs between expanded and join-back rewrite • Expanded can’t use all predicates in the query; Join-back has to do extra joins • Cleansing overhead amortized over joins and aggregate

  19. 10% selectivity, 10% anomalies, Varying Rules • Additional overhead per extra rule is moderate • sort required in SQL/OLAP is amortized in multiple rules • “Missing rule” adds the most overhead • Has to sort both case reads as well as pallet reads

  20. Conclusion • Proposed a deferred cleansing approach to RFID data • Complementary to eager cleansing • Has overhead, but offers flexibility • SQL-TS based cleansing rules for simplicity • SQL-OLAP implementation for efficiency • Two query rewrites exploit query predicates and guarantee correctness • Experimental results show deferred cleansing is affordable for typical analytical queries

  21. Extended SQL-TS DEFINE [rule name] ON [table name] FROM [table name] CLUSTER BY [cluster key] SEQUENCE BY [sequence key] AS [pattern] WHERE [condition] ACTION [DELETE | MODIFY | KEEP] • Cluster by (epc) and sequence by (rtime) define sequences • Pattern defines an ordered list of references • a reference with no * sign refers to a single row • a reference with a * sign refers to a set of rows • Where clause specifies condition on attributes in references • existential semantic on set reference • Action is defined on a singleton reference (target reference) AS (A, B) WHERE A.biz_loc =B.biz_loc DELETE B

More Related