1 / 37

Trustworthy Keyword Search for Regulatory Compliant Record Retention

Trustworthy Keyword Search for Regulatory Compliant Record Retention. Soumyadeb Mitra University of Illinois IBM Almaden. Windsor W. Hsu IBM Almaden. Marianne Winslett University of Illinois. Spending on eDiscovery Growing at 65% CAGR. Average F500 Company Has 125 Non-Frivolous

gking
Download Presentation

Trustworthy Keyword Search for Regulatory Compliant Record Retention

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trustworthy Keyword Search for Regulatory Compliant Record Retention Soumyadeb Mitra University of Illinois IBM Almaden Windsor W. Hsu IBM Almaden Marianne Winslett University of Illinois

  2. Spending on eDiscovery Growing at 65% CAGR Average F500 Company Has 125 Non-Frivolous Lawsuits at Any Given Time There is a need for trustworthy record keeping Email Instant Messaging Files Corporate Misconduct Digital Information Explosion Soaring Discovery Costs Records IDC Forecasts 60B Business Emails Annually By 2006 Focus on Compliance HIPAA Sources: IDC, Network World (2003), Socha / Gelbmann (2004)

  3. Regret Query Alice Bob What is trustworthy record keeping? Establish solid proof of events that have occurred Storage Device time Commit Record Adversary Bob should get back Alice’s data

  4. This leads to a unique threat model time Query is trustworthy Commit is trustworthy Adversary has super-user privileges Record is created properly Record is queried properly • Access to storage device • Access to any keys Adversary could be Alice herself

  5. Traditional schemes do not work time Cannot rely on Alice’s signature

  6. WORM storage helps address the problem Record Overwrite/ Delete New Record Adversary cannot delete Alice’s record Write Once Read Many

  7. Index required due to high volume of records Index time Commit Record Update Index Query from Index Regret Bob Alice Adversary

  8. A B’ In effect, records can be hidden/altered by modifying the index Or replace B with B’ Hide record B from the index B B The index must also be secured (fossilized)

  9. Most business records are unstructured, searched by inverted index One WORM file for each posting list Keywords Posting Lists Query 1 3 11 17 Data 3 9 Base 3 19 Worm 7 36 Index 3

  10. Doc: 79 79 79 Query Data Index 79 Index must be updated as new documents arrive • 500 keywords = 500disk seeks • ~1 sec per document Keywords Posting Lists Query 1 3 11 17 Data 3 9 Base 3 19 Worm 7 36 Index 3

  11. 79 81 83 Amortize cost by updating in batch Buffer • 1 seek per keyword in batch • Large buffer to benefit infrequent terms • Over 100,000 documents to achieve 2 docs/sec Keywords Posting Lists Doc: 79 Query Query 1 3 11 17 Doc: 80 Data 3 9 Doc: 81 Base 3 19 Query Worm 7 36 Index 3 Doc: 82 Doc: 83 Query

  12. Index Commit Record Buffer Buffer Index is not updated immediately • Prevailing practice – email must be committed before it is delivered Alice time Alter Omit Adversary

  13. Can storage server cache help? • Storage servers have huge cache • Data committed into cache is effectively on disk • Is battery backed-up • Inside the WORM box, so is trustworthy

  14. Doc: 79 80 79 Query Cache Miss Base 79 Index 79 Cache Miss Caching works in blocks • Caching does not benefit infrequent terms Cache Hit Cache Miss Query 1 3 11 17 Data 3 9 Base 3 19 Worm 7 36 Index 3 Doc: 80 Query

  15. Simulation results show caching is not enough

  16. Simulation results show caching is not enough • What if number posting lists ≤ Number of cache blocks? • Each update will hit the cache

  17. So, merge posting lists so that the tails blocks fit in cache Only 1 random I/O per document, for 4K block size Keyword Encodings Query 1 3 11 Document IDs Data 3 9 31 Base 3 19 00 01 10 00 1 3 3 3 Worm 7 36 01 00 10 01 9 11 19 31 Index 3

  18. The tradeoff is longer lists to scan during lookup Workload lookup cost before merging: ∑ tw qw After merging into A = {A1,…, An} : ∑( ∑tw) (∑qw ) A w € A w € A VLDB 2006 : Query answered by scanning both posting lists length of posting list for keyword w # of times w is queried in workload length of A w # of times A is searched

  19. Which lists to merge? We need a heuristic solution • Choose A={A1, A2 .. An} • n = Cache blocks • Minimize ∑( ∑tw) * (∑qw ) • Problem is NP-complete • Reduction from minimum-sum-of-squares • So, try some merging heuristics on a real-world workload • 1 million documents from IBM’s intranet • 300,000 queries

  20. A few terms contribute most of the query workload cost (tw *qw)

  21. Different merging heuristics were tried • Separate lists for high contributor terms • Merging heuristics • Based on qwtw • Random merging • Details of heuristics, evaluation in paper

  22. 24 24 Additional index support is needed to answer conjunctive queries quickly VLDB and 2006 VLDB 2006 2 3 2 2 3 2 7 24 m 13 7 24 31 13 n 13 31 24 24 31 31 Merge Join : O (m+n) Index Join : m log(n)

  23. How to maintain B+trees on WORM • B+trees require node split and join • B+tree on posting list are special case • Documents IDs are inserted in an increasing order • Can be built bottom up without split/joins • Please refer to our paper

  24. 25 27 25 26 30 B+tree index is insecure, even on WORM • Path to an element depends on elements inserted later – Adversary can attack it 23 7 13 31 4 7 11 13 19 23 29 31 2

  25. Our solution is jump indexes • Path to an element only depends on elements inserted before • Jump index is provably trustworthy • Leverages the fact that document IDs are increasing • O(log N) lookup : N - # of documents • Supports range queries too • Reasonable performance as compared to B+ trees for conjunctive queries in experiments with real-workload For details, see our paper

  26. Conclusions • WORM storage by itself is not enough • We need a trustworthy index too • Trustworthy inverted indexes can be built efficiently • 10-15% slowdown, for non-conjunctive queries • Within 1.5x of optimal B-tree performance, for conjunctive queries • Other possible uses • Indexing time

  27. Future work • Ranking attacks • Adversary can attack a lot of junk documents • Secure generic index structure • Path to an element is fixed • Supports range queries

  28. Questions

  29. n+2 n 0 1 2 3 4 5 Element Pointers n + 21≤ n+2 < n + 22 A new index structure required: Jump Index • ithpointer points to an element ni n + 2i≤ ni < n + 2(i+1)

  30. Already Set Follow Pointer 0 1 2 3 4 7 2 5 1 + 22≤ 7 < 1 + 23 1 + 20≤ 2 < 1 + 21 5 + 21≤ 7 < 5 + 22 log(N) pointers to N 1 + 22≤ 5 < 1 + 23 Jump index in action 0 1 2 3 4 1

  31. Start here Follow Pointer Got 7 0 1 2 3 4 7 5 2 5 + 21≤ 7 < 5 + 22 1 + 22≤ 7 < 1 + 23 Path to an element does not depend on future elements Lookup (7) 0 1 2 3 4 1

  32. (0,1) l .. .. ( i ,j ) (0,2) (0,B-1) (1,0) (0,1) Jump index elements are stored in blocks • Storing pointers with every element is inefficient • p entries are grouped together • Brach factor B. • Pointer (i,j) from block b points to b’ having smallest x l + j*Bi≤ x < l + (j+1)*Bi pentries Jump Pointers

  33. Jump index evaluation parameters • p : Elements grouped together • B :Branching factor • L : Block size p + (B-1) logB N ≤ L • Evaluation • Index update performance • Pointers have to be set • Query performance

  34. Update performance levels off at reasonable cache size

  35. Query performance is close to optimal (Btree)

  36. Btree on WORM

  37. Is this a real threat? • Would someone want to delete a record after a day its created? • Intrusion detection logging • Once adversary gain control, he would like to delete records of his initial attack • Record regretted moments after creation • Email best practice - Must be committed before its delivered

More Related