1 / 12

Big Data: Size, Complexity and Analytics

Big Data: Size, Complexity and Analytics. Nicoleta Serban, PhD Associate Professor H. Milton Stewart School of Industrial & Systems Engineering Georgia Institute of Technology. What is Big?. Size or Quantity

Download Presentation

Big Data: Size, Complexity and Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data: Size, Complexity and Analytics Nicoleta Serban, PhD Associate Professor H. Milton Stewart School of Industrial & Systems Engineering Georgia Institute of Technology

  2. What is Big? Size or Quantity • Gigabyte ( bytes) vs Terabyte ( bytes) vs. Petabyte ( bytes) vs. Exabyte ( bytes) Complexity or Heterogeneity • Dependencies: temporal, spatial or network • Randomness: sampling scheme • High dimensionality: multiple features • Depth: multiple hierarchies

  3. Why Size Matters? Infrastructure for managing information: • Storage – relational database vs. distributed systems vs. cloud computing • Retrieval – random vs. sequential access • Representation – level of knowledge vs. derivation of features • Safeguards – protection of privacy and confidentiality

  4. Why Complexity Matters? Translation of information to data to knowledge: • Infrastructure – supercomputers vs. distributed computers • Computation - single-threaded vs. parallelizable computational methods • Analytics – exponentially growing number of hypotheses • Inference – the dangers of ‘blind’ data mining vs. mathematical rigor

  5. Data Science Framework • Data • Representation • Sampling • Information • Infrastructure • Management • Decisions • System engineering • Knowledge • Computation • Tools • Data architectures • Data integration, sharing and federation • Data privacy rules • Data wrangling • Deriving hypotheses • Validating hypotheses • Eliciting causal relations • Designing, planning, and optimizing • Testing, ranking, scoring • System dynamics • Data mining • Machine learning • Statistical inference • Network analysis • Simulations • Visualization

  6. A Proof of Concept: Medicaid Project • Information: • Identifiable patient-level claims data • 5 years+14 states = • 266,839,307,070 Observations • 2 Terabytes of information • Data: • Represented as patient care trajectories: utilization, cost and patient characteristics • Sampled by disease Challenge #1: HIPPA and CMS data safeguards compliance - data environment: access, sharing, linking, storage Challenge #2: Database backbone - projected research needs - projected computational needs Challenge #3: Data Processing - unavailability of tools to process-mine claims - additional data and information needs - expert opinion & collaborations

  7. Medicaid Project: Health Analytics • Data: • Condition: Pediatric Asthma • Baseline Metrics • Care Pathway • Access & Outcomes • Knowledge • Systematic disparities in access, outcomes and cost • Network of providers • Profiles of patient-level care pathways Process Mining Spatial Statistical Models Functional Data Analysis Unsupervised classification Sequence clustering Markov-decision processes Optimization

  8. Medicaid Project: Health Analytics • Knowledge: • Systematic disparities in access, outcomes and cost • Network of providers • Profiles of patient-level care pathways • Decision Making: • Policy interventions • Network Interventions Markov-decision processes Causal Inference Optimization Modeling Simulations

  9. Medicaid Project: Resources • Legal Process & CMS Approval (~ 2yrs) • Costly IT infrastructure implementation • Extensive IT support • Constrained computing infrastructure • Large team of students • Funding & Deliverables • Visibility

  10. Medicaid Project: Opportunities • Developing the proof of concept in developing larger infrastructures for protected information • Becoming the center for deployment of tools for mining claims data • Advancing rigor in health analytics • Educating students and visiting researchers • Informing policy making in understanding and managing the healthcare system

  11. Acknowledgements Co-Principal investigator: Dr. Swann Supporting Institutes and Organizations • National Science Foundation (CAREER Award) • Institute of People and Technology • Children’s Healthcare of Atlanta Research Team IT Staff: Matthew Sanders and Paul Diederich Postdoctoral fellow: Dr. Monica Gentili Undergraduate students: Yuchen Zheng, Alex Terry, Pravara Harati, Qiming Zhang, Sean Monahan Graduate students: Kevin Johnson (MS), Erin Garcia, Ben Johnson, Zihao Li, Ross Hilton

  12. Contact Us NicoletaSerban nserban@isye.gatech.edu Julie Swann jswann@isye.gatech.edu

More Related