1 / 28

Data Quality is Bad? Deal With It

Data Quality is Bad? Deal With It. Dennis Shasha New York University. Data Quality Problem –challenges.

mea
Download Presentation

Data Quality is Bad? Deal With It

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Quality is Bad?Deal With It Dennis Shasha New York University

  2. Data Quality Problem –challenges • Two companies merge or two divisions want to share data. Problem: identify common customers even though their names are spelled differently (work with Bellcore/Telcordia colleagues: Munir Cochinwala, Verghese Kurien, and Gail Lalk) • Real-time sensor network. Problem: sensors fail; want to avoid false alarms (work with physicist Alan Mincer and student Yunyue Zhu)

  3. My Approach • Let’s look at fields that have dealt with data quality problems for years though they consider these problems part of business as usual. • We will ask: what do these fields do and how might that help us?

  4. Data Quality Problem – biology • Take two genetically identical plants, treat them in the same way, and measure the RNA expression levels. Get vastly different results. • Differences increase if experiments done in different labs or by different people in the same lab. • Even breathing can be dangerous… • Goal: find causal relationships among genes.

  5. What Can One Do? • One way to tease out causality is to perform a time series experiment on closely spaced time points. • Want close spacing to be able to say gene expression level at time t depends on gene expression levels at t-1. • Start with noise-free model.

  6. Noise-free Modeling of Transcriptome Time Series Data Time t t + 1 t + 2 t + 3 t + 4 gene zk Gene expression gene zi f f f f Red Squares represent a transition function f to be learned TFs t TFs t+3 TFs + targets t+1 TFs + targets t+4 Explain target gene expression as function of up to 4 input TFs Krouk et al 2010 submitted [19]

  7. Modeling Noise (poor quality) • There is reason to believe that Gaussian noise is a decent model of the inconsistencies in biological replicates. • So model the relationship between observations and “true” value by a Gaussian noise component. • We’ll see whether this is a good idea or not.

  8. 0 3 6 9 12 15 20 min (A) Transcriptome Data Set – time series Y(t) Z(t) Z(t+1) Y(t+1) Y(t+2) Z(t+2) Y(t+3) Z(t+3) Z(t+4) Y(t+4) Z(t+5) Y(t+5) (B) Noisy model (black box is Gaussian noise) Predict 20 min? observation model g dynamic model f 71% correct 0 3 6 9 12 15 Training set 0 3 6 9 12 12 15 min 15 min “Leave-out-last” test: Predict direction of change of each gene @ 20 min (C) Naive 51% correct Training set “Trend-forecast” test: Predict direction of change of each gene @ 20 min Krouk et al 2010 submitted [19]

  9. Test and Adaptation • Test the model by predicting values at a time point not used in the training. • Predictions are not generally perfect, so adaptation is to figure out which other time points to test. • One way to do this is to perform the training and testing process with one fewer experiment. If the most critical experiment is at time t, then gather more data at time t.

  10. Lessons from Network Inference • The objective is predictive power. • Use the training set to train noise model and causal relationships among the genes. • If predictions work out, then good. • Modeling data quality is part of the learning problem.

  11. Physics -- supernovas • Look at sky and observe showers of gamma particles. • Model the background as a Poisson process. • Look for exceptionally high bursts (these can last seconds, minutes, hours, up to days). • Aim telescopes in the appropriate part of the sky.

  12. Astrophysical Application Motivation: In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. An unusual event burst may signal an event interesting to physicists. 900 1800 Technical Overview: 1.The sky is partitioned into 1800*900 buckets. 2.14 Sliding window lengths are monitored from 0.1s to 39.81s

  13. Physics -- adaptation • A burst is only the first filter for detecting a supernova. • If certain kinds of bursts (e.g. 10 second long bursts) lead to false positives often, then adjust the thresholds.

  14. Physics -- lessons • Once again the noise model is an integral part of the problem setting. • Adaptation is ongoing (no fixed training set). • Because physicists are looking for a single piece of information, e.g. there is a supernova at location X,Y, redundancy can overcome noise.

  15. Drug Testing • Give N patients a drug and N patients a placebo. • This is a classic “data quality”/”biological variation” situation. Different patients will react differently to a drug and almost all patients will benefit from a placebo. • Two questions: is the drug better than the placebo and how much?

  16. Drug Testing -- Resampling • Suppose you arrange the results in a table (patient id, drug/placebo, improvement). • Compute the average improvement for the drug population • Evaluate significance using a permutation test • Evaluate the level using confidence intervals • Don’t require assumption about distribution.

  17. Typical table Patient improvement Drug/Placebo 10 Drug 12 Placebo 8 Drug -3 Placebo 20 Drug 4 Placebo Drug improvement: 38/3; Placebo: 11/3

  18. One Permutation of table Patient improvement Drug/Placebo 10 Drug 12 Placebo 8 Drug -3 Placebo 20 Placebo 4 Drug Drug improvement: 22/3; Placebo: 29/3

  19. Significance Test – is the drug’s apparent effect due to luck? • count = 0 • do 10,000 times permute the drug/placebo column recompute improvement under permutation if recomputed improvement >= measured improvement in real test then count+= 1 P-value = count/10,000; chance that improvement was due to chance.

  20. Confidence interval – what’s a good estimate of the drug’s benefit • count = 0 • do 10,000 times take 2N elements from the original table with replacement compute improvement Sort the 10,000 improvement scores and compute 95% confidence interval as 250th score to 9,750th score.

  21. Lessons from Drug Testing • Assume different patients can react differently. • Is the drug benefit significant? • How much of a benefit does it have? • Lesson: questions are simple; individual noise is overcome with redundancy.

  22. Data Quality Problem – adversaries • A farmer in the developing world wants to do a banking transaction. • The bank has appointed the shopkeeper the bank agent. The shopkeeper will call the bank over an insecure phone line. • The farmer doesn’t know whether the shopkeeper is truly honest and even whether messages can be intercepted and mangled (poor quality due to adversary).

  23. Basic Solution • Bank provides a collection of (essentially) one-time nonces and one-time pads to each of farmer and shopkeeper ahead of time. • Per transaction: each of farmer/shopkeeper sends one-time nonce and messages to the bank listing the amount of the transaction. • The bank verifies their identities via the nonces and the farmer/shopkeepers verify the amounts via the one-time pad.

  24. “Quality Issues” this Solves • Replay is impossible because nonces are one-time. • Mangling will be detected because of one-time pads. • False confederates and hacking of telephone network will be detected thanks to one-time pads. • Even a determined adversary can be overcome. Never mind a little random noise.

  25. Application – record matching • Develop noise model: how sounds are misheard or how symbols are mistyped? • Develop training set having correct outcomes but also metadata properties (e.g. who took the information and when was it taken) in case noise characteristics/probabilities depend on that. • Model cost of errors vs. cost to clean.

  26. Application – sensor reading • Be conscious of what the goals of the sensor are, e.g. fire/no fire; earthquake/no earthquake. • Use burst detection to locate possibly troublesome sensors in quiet times. • Error model is key: could there be an adversary? Can you use non-parametric stats?

  27. Lessons • Data quality problems (i.e. noise or adversarial attacks) are an everyday occurrence in many fields. • First lesson: model the amount of noise and design system to answer critical question (e.g. what is causal network, is drug effective, where is supernova) in spite of noise.

  28. More Lessons • Second lesson: If you can design for an adversary, then get noise correction for free. • Third lesson: Use the meta-data to try to localize bursts of errors to try to shut down the reason for noise.

More Related