ADVENTURES IN DATA MINING
Sponsored Links
This presentation is the property of its rightful owner.
1 / 47

ADVENTURES IN DATA MINING PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

ADVENTURES IN DATA MINING. Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841

Download Presentation

ADVENTURES IN DATA MINING

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


ADVENTURES IN DATA MINING

Margaret H. Dunham

Southern Methodist University

Dallas, Texas 75275

[email protected]

This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841

Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;[email protected]


The 2000 ozone hole over the antarctic seen by EPTOMS

http://jwocky.gsfc.nasa.gov/multi/multi.html#hole


Data Mining Outline

  • Introduction

  • Techniques

    • Classification

    • Clustering

    • Association Rules

  • Examples

Explore some interesting data mining applications


Introduction

  • Data is growing at a phenomenal rate

  • Users expect more sophisticated information

  • How?

UNCOVER HIDDEN INFORMATION

DATA MINING


But it isn’t Magic

  • You must know what you are looking for

  • You must know how to look for you

Suppose you knew that a specific cave had gold:

  • What would you look for?

  • How would you look for it?

  • Might need an expert miner


Description

Behavior

Associations

“If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.”

“If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”

Classification Clustering Link Analysis

(Profiling) (Similarity)


CLASSIFICATION

Assign data into predefined groups or classes.


x

<90

>=90

x

A

<80

>=80

x

B

<70

>=70

x

C

<50

>=60

D

F

Classification Ex: Grading


Katydids

Given a collection of annotated data. (in this case 5 instancesof Katydidsand five ofGrasshoppers), decide what type of insect the unlabeled example is.

Grasshoppers

(c) Eamonn Keogh, [email protected]


The classification problem can now be expressed as:

  • Given a training database predict the class label of a previously unseen instance

previously unseen instance =

(c) Eamonn Keogh, [email protected]


10

9

8

7

6

5

4

3

2

1

1

2

3

4

5

6

7

8

9

10

Antenna Length

Abdomen Length

Katydids

Grasshoppers

(c) Eamonn Keogh, [email protected]


Facial Recognition

(c) Eamonn Keogh, [email protected]


1

0.5

0

50

100

150

200

250

300

350

400

450

0

Handwriting Recognition

(c) Eamonn Keogh, [email protected]

George Washington Manuscript


Rare Event Detection


Dallas Morning News

October 7, 2005


CLUSTERING

Partition data into previously undefined groups.


http://149.170.199.144/multivar/ca.htm


What is Similarity?

(c) Eamonn Keogh, [email protected]


Two Types of Clustering

Partitional

Hierarchical

(c) Eamonn Keogh, [email protected]


Hierarchical Clustering ExampleIris Data Set

Versicolor

Setosa

Virginica

The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.

Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .


ASSOCIATION RULES/ LINK ANALYSIS

Find relationships between data


ASSOCIATION RULES EXAMPLES

People who buy diapers also buy beer

If gene A is highly expressed in this disease then gene A is also expressed

Relationships between people

Book Stores

Department Stores

Advertising

Product Placement

http://www.amazon.com/Data-Mining-Introductory-Advanced-Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485&sr=1-1


Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.

DILBERT reprinted by permission of United Feature Syndicate, Inc.


Data Mining Outline

  • Introduction

  • Techniques

  • Examples

    • Vision Mining

    • Law Enforcement (Cheating, Plagiarism, Fraud, Criminal Behavior,…)

    • Bioinformatics


Vision Mining

  • License Plate Recognition

    • Red Light Cameras

    • Toll Booths

    • http://www.licenseplaterecognition.com/

  • Computer Vision

    • http://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/vid/


How Stuff Works, “Facial Recognition,” http://computer.howstuffworks.com/facial-recognition1.htm


Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.


No/Little Cheating

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.


Rampant Cheating

Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.


Jialun Qin, Jennifer J. Xu, DaningHu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network”  Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.


http://www.time.com/time/magazine/article/0,9171,1541283,00.html


DNA

http://www.visionlearning.com/library/module_viewer.php?mid=63

Basic building blocks of organisms

Located in nucleus of cells

Composed of 4 nucleotides

Two strands bound together


DNA

transcription

RNA

translation

Protein

Central Dogma: DNA -> RNA -> Protein

CCTGAGCCAACTATTGATGAA

CCUGAGCCAACUAUUGAUGAA

Amino Acid

www.bioalgorithms.info; chapter 6; Gene Prediction


Human Genome

Scientists originally thought there would be about 100,000 genes

Appear to be about 20,000

WHY?

Almost identical to that of Chimps. What makes the difference?

Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)


RNAi – Nobel Prize in Medicine 2006

siRNA may be artificially added to cell!

Double stranded RNA

Short Interfering RNA (~20-25 nt)

RNA-Induced Silencing Complex

Binds to mRNA

Cuts RNA

Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3


miRNA

  • Short (20-25nt) sequence of noncoding RNA

  • Known since 1993 but significance not widely appreciated until 2001

  • Impact / Prevent translation of mRNA

  • Generally reduce protein levels without impacting mRNA levels (animal cells)

  • Functions

    • Causes some cancers

    • Guide embryo development

    • Regulate cell Differentiation

    • Associated with HIV


C Elegans

Homo Sapiens

Mus Musculus

All Mature

ACG

CGC

GCG

UCG

TCGR – Mature miRNA(Window=5; Pattern=3)


TCGRs for Xue Training Data

C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.


Affymetrix GeneChip® Array

http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx


Microarray Data Analysis

  • Each probe location associated with gene

  • Measure the amount of mRNA

  • Color indicates degree of gene expression

  • Compare different samples (normal/disease)

  • Track same sample over time

  • Questions

    • Which genes are related to this disease?

    • Which genes behave in a similar manner?

    • What is the function of a gene?

  • Clustering

    • Hierarchical

    • K-means


Microarray Data - Clustering

"Gene expression profiling identifies clinically relevant subtypes of prostate cancer"

Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004


BIG BROTHER ?

  • Total Information Awareness

    • http://infowar.net/tia/www.darpa.mil/iao/index.htm

    • http://www.govtech.net/magazine/story.php?id=45918

    • http://en.wikipedia.org/wiki/Information_Awareness_Office

  • Terror Watch List

    • http://www.businessweek.com/technology/content/may2005/tc20050511_8047_tc_210.htm

    • http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/

    • http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html

  • CAPPS

    • http://www.theregister.co.uk/2004/04/26/airport_security_failures/

    • http://www.heritage.org/Research/HomelandDefense/BG1683.cfm

    • http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/

    • http://en.wikipedia.org/wiki/CAPPS


http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236


Thanks!


  • Login