1 / 17

KDDRG Research Projects

KDDRG Research Projects. Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer Science Worcester Polytechnic Institute. Some Current Analytical Data Mining Research Projects at WPI. Mining Complex Data: Set and Sequence Mining Systems performance Data Sleep Data Financial Data

vea
Download Presentation

KDDRG Research Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDDRG Research Projects Prof. Carolina Ruiz ruiz@cs.wpi.edu Department of Computer Science Worcester Polytechnic Institute

  2. Some Current Analytical Data Mining Research Projects at WPI • Mining Complex Data: Set and Sequence Mining • Systems performance Data • Sleep Data • Financial Data • Web Data • Data Mining for Genetic Analysis • Correlating genetic information with diseases • Predicting gene expression patterns • Data Mining for Electronic Commerce • Collaborative and Content-Based Filtering • Using Association Rules and using Neural Networks

  3. Analyzing Sleep Data • Purpose: • Associations between sleep patterns and health/pathology • Obtain patterns of different sleep stages (4 sleep+REM +Wake) • DATA SET • Clinical (sequential) • Electro-encephalogram (EEG), • Electro-oculogram (EOG), • Electro-myogram (EMG), • Probe measuring flow of Oxygen in blood etc. Diagnostic (tabular) • Questionnaire responses • Patient’s demographic info. • Patient’s medical history (Source: http://www. blsc.com) • Potential Rules: • Association Rules • (Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13% • (B) Classification Rules • (snoring= HEAVY) & (AHI* > 30/hour): severe OSA*** • => (Race = Caucasian)confidence=70%, support= 8% • *AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea WPI, UMassMedical, BC

  4. Input Data • Each instance: [Tabular | set | sequential] * attributes attr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth P1 P2 P3 …

  5. Analyzing Financial Data • Sequential data – daily stock values • “Normal” (tabular/relational) data • sector (computers, agricultural, educational, …), type of government, product releases, companies awards, … • Desired rules: • If DELL’s stock value increases & 1999<year<2002 => IBM’s stock value decreases

  6. Events – Financial DataBasic events: 16 or so financial templates [Little&Rhodes78]difficult pattern matching – alignments and time warping Panic Reversal Head & Shoulders Reversal Rounding Top Reversal Descending Triangle Reversal

  7. WPI WekaTool for mining complex temporal/spatial associations

  8. Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI), and Alvarez (CS, BC) • SNP analysis • discovering correlations between sequence variations and diseases • Gene expression • discovering patterns that cause a gene to be expressed in a particular cell

  9. Correlating Genetics with Diseases • Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research • Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.

  10. Genomic Data Resources Wirth, B. et al. Journal of Human Molecular Genetics

  11. Our System: CAGE To predict gene expression based on DNA sequences. Muscle Cell Gene 3 Gene 1 Gene 2 Neural Cell CAGE Gene 1 Gene 3 Gene 2 Seam Cells On Gene 1 Gene 3 Gene 2 Off

  12. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene expression Analysis PR1 PROMOTER(S) CELL TYPES neural neural muscle neural muscle neural neural neural muscle M1 M4 M2 M5 CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR2 M1 M4 M5 AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA PR3 M4 M1 CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA PR4 M1 M2 M5 GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA PR5 M1 M4 ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA PR6 M3 M4 M5 GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC PR7 M5 M2 M3 M1 TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA PR8 M2 M4 M5 ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC PR9 M4 M3 ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA

  13. TF 1 TF 3 TF 2 GENE M1 M4 M2 240 100 Gene Expression • Transcription of DNA into RNA TRANSCRIPTIONAL PROTEINS PROMOTER REGION ..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA MOTIFS M1, M2, M4 MUSCLE CELL

  14. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 PR1 PROMOTER(S) neural neural muscle neural muscle neural neural neural muscle M1 M4 M2 M5 CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR2 M1 M4 M5 AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA PR3 M4 M1 CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA PR4 M1 M2 M5 GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA PR5 M1 M4 ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA PR6 M3 M4 M5 GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC PR7 M5 M2 M3 M1 TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA PR8 M2 M4 M5 ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC PR9 M4 M3 ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA

  15. Coefficient of variation of distances (cvd) between two motifs: “Well-clustered” motifs M1 240 M4 100 M2 150 M5 M1 260 M4 210 M5 M4 360 M1 M1 100 M2 350 M5 M1 190 M4 IR1={M1,M2,M5} (M1,M2) = 120.1 (M1,M2) = 216.6 cvd(M1,M2) = 0.55 M3 120 M4 150 M5 M5 210 M2 100 M3 110 M1 M2 18 M4 21 M5 M4 60 M3

  16. Distance-based Association Rules Sample distance-based assoc. rule • Given: • min-support • min-confidence • max-cvd thresholds • Mine: • all distance-based association rules

  17. Ali Benamara. Dharmesh Thakkar. Senthil K Palanisamy. Zachary Stoecker-Sylvia. Keith A. Pray. Jonathan Freyberger. Maged El-Sayed. Parameshvyas Laxminarayan. Aleksandar Icev. Wendy Kogel. Michael Sao Pedro. Christopher Shoemaker. Weiyang Lin. Jonathan Rudolph Eduardo Paredes Iavor N. Trifonov. Takeshi Kawato Cindy Leung and Sam Holmes. John Baird (BB), Jay Farmer, Rebecca Gougian (BB), Ken Monterio (BB), Paul Young. Zachary Stoecker-Sylvia. Kristin Blitsch (BB), Ben Lucas, Sarah Towey(BB) Wendy Kogel, Brooke LeClair, Christopher St. Yves. Brian Murphy, David Phu (CS/BB), Ian Pushee, Frederick Tan (CS/BB). Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano (BB). Christopher Cole. Michael Ciman and John Gulbrandsen. Tara Halwes Christopher Martino. Matthew Berube. Anna Novikov. Amy Kao and Dana Rock. Grad. & Undergrad. Students

More Related