Trust and Security in Biological Databases: Security When Collaborating

Trust and Security in Biological Databases:Security When Collaborating.Gio WiederholdDepts. of Computer Science, Electr. Eng. and MedicineStanford University, CA.www-db.stanford.edu/people/gio.html • Four related points will be made, they are not primarily technological, since the the majority of failures we experience in protecting privacy are caused by misunderstanding of settings and objectives. • Protection of Privacy requires checking what goes OUT. • Access control mechanisms only keep bad guys from getting IN. • In bioinformatics and medicine there are many types of collaborators. • Collaborators are allowed in, but what they take out must be controlled. Gio AAAS 04

1 & 3. Protection of Privacy requires checking what goes OUT. Privacy requires that data considered private do not fall into inappropriate or public hands. Google view. Private data resides in a variety of systems used by a variety of collaborators • Medical record systems - holistic Caregivers and researchers • Drug toxicity and effectiveness studies Researchers and pharmas • Hospital and clinic admission records Caregivers and managers • Financial records Managers and accountants • Billing and payment information Accountants and payors Those participants have legitimate access rights. They don’t have the right to reveal the information they need. They don’t have the right to read related information. Gio AAAS 04

Accounting Laboratory Accreditation Laboratory staff Clinics Medical Research Insurance Carriers Inpatient Billing Pharmacy Ward staff Patient Physician Etc.. CDC The complexity of biological/medical information disables fine-grained data access specifications Data providers cannot predict information usage Collaborations are dynamic, and grow when productive Gio AAAS 04

Not done now in computer systems Collected contents Release filter Access control Why this omission? • Privacy is entrusted to security specialists and surrogates • Cryptographers: important tools, but serves binary settings • Database administrators: valued for making data available • Network administrators: keep accessibility by users 2. Access control mechanisms keep bad guys from getting IN. Current Solution-- SECURITY: Keep bad guys OUT • Access control requires authentication and authorization • Collaborators and Customers get into authorized areas • Once they are IN no further checking occurs in computer systems • Further checking is done when physical assets are protected • Examples: warehouses, even warehouse stores: Gio AAAS 04

Release filter 4. Collaborators are allowed in, but what they take out must be controlled. PRIVACY: Solution more research needed, some startups • Symmetric checking of access to information systems and also checking of the subsequent release of their contents • Act like a warehouse store • Check and/or remove restricted topics in outgoing documents • Researchers: Names, employers, addresses, emails, . . . • Payors: other incidents, prior diseases, admissions, . . . • . . . : check specific contents for each collaborating & authorized role • Better: check that all terms in outgoing documents are acceptable • Use a topic-specific inclusive word/phrase list, and filter others • most application / usage areas use less than 3000 terms • Paranoia is safest, and the cost is bearable • Trapped documents can be released by a security officer • Extract text from images, as x-rays, and then check those texts • Many media contain unexpected private or identifying data Gio AAAS 04

Inference protection has the same duality Approach: Assure that each result cell is based on many instances, reducing the likelihood of successful inference Access Control Estimate expected lowest counts for any possible query result. Provide only data limitedto a safe hierarchical level Release Filter Use actual cell counts . If inadequate, aggregate up the hierarchy until count is adequate Never absolutely secure, but the best one can do now? Gio AAAS 04

Release checking can also protect privacy in commercial domains Release filter Customers are collaborators; you want customers IN, not OUT In simple and perfect systems they cannot access private areas, but • System failures - trap doors, etc. abound • Release checking provides a backstop and intrusion detection • Updates for customer convenience create unexpected interactions • Helpful query modification broadens access • New usages were not foreseen during design partitioning • Customer access to inventory for rapid supply-line verification • New, unthought of collaborators -- Russians in Kosovo need access to info Techniques -- much content has signatures that are (nearly) unique • Check to stop credit card numbers in outgoing data, as from music sites • Check to stop email addresses in outgoing reports • Don’t rely exclusively on access control when the objective is to protect release of private information ! Gio AAAS 04

Abstract: Security when Collaborating Panel presentation on “Trust and Security in Biological Databases”; Gio Wiederhold, Ph.D, Stanford University, CA Traditional security mechanisms have focused on access control, assuming that we can distinguish the good and the bad guys, and can label any data collection as being accessible to the good guys. If those assumptions hold the technology is conceptually simple, and only made hard by technical faults. However, there are many practical situations where such sharp distinctions cannot be made, so that the technologies developed to solve access control become inadequate. In medicine, but also in many commercial data collections we find unstructured data. Such data are collected and stored without the submitter being fully aware of their future use and hence unable to consider all future access needs. A complementary technology to augment access control is result filtering: namely inspecting the contents of documents before they leave the boundary of the protected system. I will briefly cite the issue in two settings, one simple and one more complex. Military documents have long been classified into mandatory and discretionary classifications. Legitimate accessors are identified with respect to those categories. But when a new situation arises, the old labels are inadequate. When we had to share information with the Russians in Kosovo, no adequate labeling existed. Relabeling all stored documents was clearly impractical. A filter can be written to check the text for limited, locally relevant contents, and make those available. Any document containing unrecognized noun-phrases would be withheld, or could be handed over to a security officer for manual processing. More complex situations occurs when we have statistical data, as census, or, as in bioinformatics, phenotypic and genomic data. We want to prevent the release of statistical summaries for cells that have fewer than 10 instances say, to reduce the likelihood of inference back to an individual. If we use access control, we have to precompute the minima for columns and rows and aggregate their categorizations for access to prevent release. However, the distributions in those cells is very uneven. So if we check the actual contents at the time of release, we can allow much smaller categories to be used for access and only omit or aggregate cells that are too small. Checking results being released can also provide a barrier for credit card theft and the like. If a person who masquerades as a customer locates a trapdoor and removes 10,000 credit cards instead of an MP3 tune, that can easily be recognized, since those data have very different signatures. In summary, many of our accessors are collaborators or customers, although we know little about them. We want to give them the best possible service, and still protect our property or the privacy that individuals are trusting us to keep. Focusing only on access control, and then not checking what is released is an inadequate, even a naive approach for systems involving collaboration. Research leading to these concepts and supporting technologies was supported by NSF under the HPCC and DL2 programs Gio AAAS 04

Trust and Security in Biological Databases: Brief Biography Security when Collaborating; Gio Wiederhold, Stanford University, CA Gio Wiederhold is an emeritus professor of Computer Science, Medicine and Electrical Engineering at Stanford University. Since 1976 he has supervised 33 PhD theses in these departments. Currently Gio is continuing part time at Stanford and consulting. He still has seminars on Business on the Internet and and on Genome databases. Research being disseminated includes privacy protection in collaborative settings, large-scale software composition, enabling interoperation of semantically heterogeneous information systems, including simulations for projecting outcomes. His consulting now focuses on valuation of intellectual property inherent in software. Gio Wiederhold was born in Italy, received a degree in Aeronautical Engineering in Holland in 1957 and a PhD in Medical Information Science from the University of California at San Francisco in 1976. Prior to his academic career he spent 16 years in the software industry. Wiederhold has authored and coauthored more than 350 publications and reports on computing and medicine. He spent 1991-1994 in Washington as a program manager at DARPA. Wiederhold has been elected fellow of the ACMI, the IEEE, and the ACM. His web page is http://www-db.stanford.edu/people/gio.html. Information about protection the release of private information can be found at http://www-db.stanford.edu/pub/gio/TIHI/TIHI.html Gio AAAS 04

Mark Schildhauer: @KNB Knowledge Network for Biodiversity • CEED (Caveat Emptor) Ecology DB) repository- Ecoinformatics • Sharing is essential. (NIH site) • [Long-term ecological research network] • Vast scope of ecology - virus spreading, Killer Bees. • Just now moving from individual research records to repositories. • What happens when researcher dies. • Remove location info on sensitive protected species, Rhinos, Abalone • Trust is needed. ---- ---Concerns - • fear about being scooped. Less of a concern for mature scientists • fear of being misinterpreted • fear of your mistakes being discovered • Whose data are they anyhow? IP rights. • Developing flexible access controls. New technology. PKI, authentication etc. Need metadata Gio AAAS 04

Latanya Sweeney Must do better in Genomic, could cause much damage to science. Data privacy lab: 1. data detectives - discover linkages. 2. data protection, based on experiem\nces Example: Access a video tape that is still protected: ad hoc methods with face masking and by pixelation. Peter Denning? Biosurveilance data passed though PrivaCert -- NY Used discharge data.voter lists, Change to monthly dates and higher level disease codes. Work by Bradley Malin. split genotypes and phenotypes. Can still use data for analysis. Trails attack problem - many institutions seen on path to tertiary care can be reconstructed <publish in TODS not in World Scientific> Iceland DB has a data protection committee, used encrypted links. Gent model Belgium - semi-trusted third party with double encryption. Gio AAAS 04

Separate data from identity = don't do that by field addresses trail, but vulnerable to family attack Utah - Genetrustee system - vulnerable to inference and trails attack. Have to understand real Gio AAAS 04

Maria Zemankova, NSF `Privacy' `Security` `Trust' are poorly defined in an operational sense. Medical Informatics: Diagnosis aid: Addison's disease? who is Addison Rights to information re contagious disease. locality related diseases (Mad Cow) <Some actress - Barbara Streisand? lost her privacy case versus the California Coastal survey> Rakesh Agrawal VLDB 2002. Look at Grants and Awards at NSF.GOV Gio AAAS 04

What are the key issues for biological DBs? How pressing are they? What are your suggestions for addressing them? What are the implications of not addressing them What advice do you have for the audience? Consider FIRPA law about releasing information about student participation Gio AAAS 04

Trust and Security in Biological Databases: Security When Collaborating