1 / 47

Bootstrapping Privacy Compliance in Big Data System

Bootstrapping Privacy Compliance in Big Data System. Shayak Sen, Saikat Guha et al Carnegie Mellon University Microsoft Research Presenter: Cheng Li. We have your everything. Your bank account. Your mobile. Your social network. Your shopping account. We will keep it as a secret.

Download Presentation

Bootstrapping Privacy Compliance in Big Data System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bootstrapping Privacy Compliance in Big Data System Shayak Sen, Saikat Guha et al Carnegie Mellon University Microsoft Research Presenter: Cheng Li

  2. We have your everything Your bank account Your mobile Your social network Your shopping account

  3. We will keep it as a secret

  4. This is how we work Lots of Meeting! Low Efficiency! Legal team craft privacy policy Privacy Champion interprets policy Audit Team verifies compliance Developer writes code

  5. Life could be much easier encode refine Less Meeting code analysis

  6. Outline • Introduction • LEGALEASE • Goal • Syntax • Domain-Specific Attribute • Formal Semantics • Properties • GROK • Validation • Discussion • Conclusion

  7. LEGALEASE • Goal • Usability: Policy clauses are structured very similarly to clauses in English language policy. • Expressivity: Clauses are built around an attribute abstraction that allows the language to evolve as policy evolves. • Compositional Reasoning: LEGALEASE provides meaningful syntactic restrictions to allow compositional reasoning.

  8. Outline • Introduction • LEGALEASE • Goal • Syntax • Domain-Specific Attribute • Formal Semantics • Properties • GROK • Validation • Discussion • Conclusion

  9. LEGALEASE • Syntax Domain-Specific attributes are defined in concept lattice LegleasePolicies are checked at each node in the data dependency graph. Each node is labeled with attr’s name and set of values. ALLOW: permits node labeled with subset of values. DENY: forbids node labeled with sets that overlaps the attribute values.

  10. LEGALEASE • Example • Full IP address will not be used for advertising. IP address may be used for detecting abuse. In such cases it will not be combined with account information. • DENY DataType IPAddress UseForPurpose AdvertisingEXCEPTALLOW DataType IPAddress:TruncatedALLOW DataType IPAddress UseForPurpose AbuseDetect EXCEPT DENY DataType IPAddress, AccountInfo

  11. Outline • Introduction • LEGALEASE • Goal • Syntax • Domain-Specific Attribute • Formal Semantics • Properties • GROK • Validation • Discussion • Conclusion

  12. LEGALEASE • Domain-specific Attribute • Attribute values are organized as a concept lattice. • Advantages of concept lattice: • Abstracts away semantics. • The lattice structure allows users to concisely define sets of elements through their least upper bound. • The lattice structure allows us to statically check the policy for certain classes of errors.

  13. LEGALEASE • Attribute define in the implementation • InStore attribute: encode certain policies around collection and storage of data.

  14. LEGALEASE • Attribute define in the implementation • UseForPurpose attribute: Encode the data usage.

  15. LEGALEASE • Attribute define in the implementation • AccessByRole attribute: For encoding internal access-control based policies.

  16. LEGALEASE • Attribute define in the implementation • DataType attribute: • Policy datatypes: types of data

  17. LEGALEASE • Attribute define in the implementation • DataType attribute: • Policy datatypes: Category of data types • Limited typestate: A limited way of tracking history.

  18. LEGALEASE • Attribute define in the implementation • DataType attribute: • Combining policy datatypes and typestates: • t:s where t is policy datatypes and s is typestates.

  19. Outline • Introduction • LEGALEASE • Goal • Syntax • Domain-Specific Attribute • Formal Semantics • Properties • GROK • Validation • Discussion • Conclusion

  20. LEGALEASE • Formal Semantics • Notions: • T – a vector of sets of latice elements. • Tx – the value of attribute x in T. • TG – Graph node. • TC – Policy clause vector.

  21. LEGALEASE • Formal Semantics • where is ALLOW TC applies to a graph node TG if TG ⊑TC • is for each x, DENY TC applies to TG if

  22. LEGALEASE • Formal Semantics • A graph node is allowed by an ALLOW clause if and only if the clause applies and is allowed by each exception.

  23. LEGALEASE • Formal Semantics • A graph node is denied by an DENY clause if and only if the clause applies and is denied by each exception.

  24. Outline • Introduction • LEGALEASE • Goal • Syntax • Domain-Specific Attribute • Formal Semantics • Properties • GROK • Validation • Discussion • Conclusion

  25. LEGALEASE • Properties • Totality: C should either allow T or deny it. • Unicity: C cannot allow T and deny T at the same time. • Monotonicity: If C1 C2, then for any TG, C1 allows TG implies that C2 allows TG and C2;C2 denies TG implies C1 denies TG.

  26. Outline • Introduction • LEGALEASE • GROK • Validation • Discussion • Conclusion

  27. GROK • GROK System Nodes are labeled with attribute Different granularity Confidence value

  28. GROK • Data Flow Edges and Labeling Nodes • Log Analysis: Use log to bootstrap the coarse-grained data flow graph • Label file nodes with InStore attribute, entity nodes with AccessByRole attribute. (high confidence) • Label UseForPurpose attribute for each job. (low confidence)

  29. Log Analysis

  30. GROK • Data Flow Edges and Labeling Nodes • Syntactic Analysis:Label Datatype attr by syntactically analyzing the source code of the job that read or wrote data. (low confidence)

  31. Syntactic Analysis

  32. GROK • Data Flow Edges and Labeling Nodes • Semantic Analysis: Refine file nodes to a collection of column nodes. Refine job nodes to a sub-graph of nodes.

  33. Semantic Analysis

  34. GROK • Data Flow Analysis • Copy DataType attribute of one node to all nodes that data flows to. • Join two attributes that has the same confidence value. • If data flow through UDF(user defined function), check whether typestate has been modified. If it does, assign low confidence value.

  35. GROK • Verifying Labels • Attributes verified by developers are assigned with high confidence value. low confidence attribute source file low = IPAddress reverse mapping Contact the developer with highest-ranking source file low = IPAddress low = UserAgent … related source file related low confidence attribute

  36. GROK • Implementation static semantic analyzer processes individual jobs from the cluster log into the nodes and edges in data dependency graph without attr GROK data flow analyzer collates all the graph node, syntactic analysis and conservative data flow analysis, augmented with attrs.

  37. Outline • Introduction • LEGALEASE • GROK • Validation • Discussion • Conclusion

  38. Validation • Scale • 100 day period, 77 thousand jobs each day, submitted by over 7 thousand entities in over 300 functional units. • 1.1 million unique lines of code, 21% changes on a day-to-day basis.

  39. Validation • Coverage add manual verification add dataflow analysis simulate syntactic analyses on real-world DDG

  40. Validation • Usability • Online survey • 12 participants from Microsoft privacy champions. • Majority of participants were able to use LEGALEASE to code policy clauses

  41. Validation • Expressiveness

  42. Outline • Introduction • LEGALEASE • GROK • Validation • Discussion • Conclusion

  43. Discussion • Expressiveness: LEGALEASE cannot express policies based on first-order temporal-logic. However, LEGALEASE is enough to express privacy policies. • Infer sensitive data: Unless explicitly labeled, GROK cannot detect inference from non-sensitive data to sensitive data. • Precision: Major source of precision comes from overly conservative treatment of UDF.

  44. Discussion • False Negatives: The authors are unable to characterize the exact nature of false negatives in the system due to lack of ground truth. • Assurance: The system can not guarantee the result in face of adversarial developers’ behavior.

  45. Outline • Introduction • LEGALEASE • GROK • Validation • Discussion • Conclusion

  46. Conclusion • Automated privacy compliance checking • LEGALEASE: stating privacy policies as a form of restrictions on information flows. • GROK: data inventory that maps low level data types in code to high level policy concepts. • Evaluation results show that • LEGALEASE is expressive enough to capture real-world privacy policies. • GROK could bootstrap labeling the graph with LEGALEASE at massive scale.

  47. Thank you! Questions?

More Related