- By
**erika** - Follow User

- 663 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Data mining and its application and usage in medicine' - erika

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Data Mining and Medicine

- History
- Past 20 years with relational databases
- More dimensions to database queries
- earliest and most successful area of data mining
- Mid 1800s in London hit by infectious disease
- Two theories
- Miasma theory Bad air propagated disease
- Germ theory Water-borne
- Advantages
- Discover trends even when we don’t understand reasons
- Discover irrelevant patterns that confuse than enlighten
- Protection against unaided human inference of patterns provide quantifiable measures and aid human judgment
- Data Mining
- Patterns persistent and meaningful
- Knowledge Discovery of Data

The future of data mining

- 10 biggest killers in the US
- Data mining = Process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data

Major Issues in Medical Data Mining

- Heterogeneity of medical data
- Volume and complexity
- Physician’s interpretation
- Poor mathematical categorization
- Canonical Form
- Solution: Standard vocabularies, interfaces between different sources of data integrations, design of electronic patient records
- Ethical, Legal and Social Issues
- Data Ownership
- Lawsuits
- Privacy and Security of Human Data
- Expected benefits
- Administrative Issues

Why Data Preprocessing?

- Patient records consist of clinical, lab parameters, results of particular investigations, specific to tasks
- Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
- Noisy: containing errors or outliers
- Inconsistent: containing discrepancies in codes or names
- Temporal chronic diseases parameters
- No quality data, no quality mining results!
- Data warehouse needs consistent integration of quality data
- Medical Domain, to handle incomplete, inconsistent or noisy data, need people with domain knowledge

What is Data Mining? The KDD Process

Knowledge

Pattern Evaluation

Data Mining

Task-relevant

Data

Selection

Data

Warehouse

Data Cleaning

Data Integration

Databases

From Tables and Spreadsheets to Data Cubes

- A data warehouse is based on a multidimensional data model that views data in the form of a data cube
- A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
- Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
- Fact table contains measures (such as dollars_sold) and keys to each of related dimension tables
- W. H. Inmon:“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”

Data Warehouse vs. Heterogeneous DBMS

- Data warehouse: update-driven, high performance
- Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
- Do not contain most current information
- Query processing does not interfere with processing at local sources
- Store and integrate historical information
- Support complex multidimensional queries

Data Warehouse vs. Operational DBMS

- OLTP (on-line transaction processing)
- Major task of traditional relational DBMS
- Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
- OLAP (on-line analytical processing)
- Major task of data warehouse system
- Data analysis and decision making
- Distinct features (OLTP vs. OLAP):
- User and system orientation: customer vs. market
- Data contents: current, detailed vs. historical, consolidated
- Database design: ER + application vs. star + subject
- View: current, local vs. evolutionary, integrated
- Access patterns: update vs. read-only but complex queries

Why Separate Data Warehouse?

- High performance for both systems
- DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery
- Warehouse tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
- Different functions and different data:
- Missing data: Decision support requires historical data which operational DBs do not typically maintain
- Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
- Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Typical OLAP Operations

- Roll up (drill-up): summarize data
- by climbing up hierarchy or by dimension reduction
- Drill down (roll down): reverse of roll-up
- from higher level summary to lower level summary or detailed data, or introducing new dimensions
- Slice and dice:
- project and select
- Pivot (rotate):
- reorient the cube, visualization, 3D to series of 2D planes.
- Other operations
- drill across: involving (across) more than one fact table
- drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

sources

Extract

Transform

Load

Refresh

Operational

DBs

Multi-Tiered Architecture

Monitor

&

Integrator

OLAP Server

Metadata

Analysis

Query

Reports

Data mining

Serve

Data

Warehouse

Data Marts

Data Sources

OLAP Engine

Front-End Tools

Data Storage

Steps of a KDD Process

- Learning the application domain:
- relevant prior knowledge and goals of application
- Creating a target data set: data selection
- Data cleaning and preprocessing: (may take 60% of effort!)
- Data reduction and transformation:
- Find useful features, dimensionality/variable reduction, invariant representation.
- Choosing functions of data mining
- summarization, classification, regression, association, clustering.
- Choosing the mining algorithm(s)
- Data mining: search for patterns of interest
- Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant patterns, etc.
- Use of discovered knowledge

Common Techniques in Data Mining

- Predictive Data Mining
- Most important
- Classification: Relate one set of variables in data to response variables
- Regression: estimate some continuous value
- Descriptive Data Mining
- Clustering: Discovering groups of similar instances
- Association rule extraction
- Variables/Observations
- Summarization of group descriptions

Leukemia

- Different types of cells look very similar
- Given a number of samples (patients)
- can we diagnose the disease accurately?
- Predict the outcome of treatment?
- Recommend best treatment based of previous treatments?
- Solution: Data mining on micro-array data
- 38 training patients, 34 testing patients ~ 7000 patient attributes
- 2 classes: Acute Lymphoblastic Leukemia(ALL) vs Acute Myeloid Leukemia (AML)

Clustering/Instance Based Learning

- Uses specific instances to perform classification than general IF THEN rules
- Nearest Neighbor classifier
- Most studied algorithms for medical purposes
- Clustering– Partitioning a data set into several groups (clusters) such that
- Homogeneity: Objects belonging to the same cluster are similar to each other
- Separation: Objects belonging to different clusters are dissimilar to each other.
- Three elements
- The set of objects
- The set of attributes
- Distance measure

Measure the Dissimilarity of Objects

- Find best matching instance
- Distance function
- Measure the dissimilarity between a pair of data objects
- Things to consider
- Usually very different for interval-scaled, boolean, nominal, ordinal and ratio-scaled variables
- Weights should be associated with different variables based on applications and data semantic
- Quality of a clustering result depends on both the distance measure adopted and its implementation

Minkowski Distance

- Minkowski distance: a generalization
- If q = 2, d is Euclidean distance
- If q = 1, d is Manhattan distance

xi

Xi (1,7)

12

8.48

q=2

q=1

6

6

xj

Xj(7,1)

K-nearest neighbors algorithm

- Initialization
- Arbitrarily choose k objects as the initial cluster centers (centroids)
- Iteration until no change
- For each object Oi
- Calculate the distances between Oi and the k centroids
- (Re)assign Oi to the cluster whose centroid is the closest to Oi
- Update the cluster centroids based on current assignment

Dataset

- Data set from UCI repository
- http://kdd.ics.uci.edu/
- 768 female Pima Indians evaluated for diabetes
- After data cleaning 392 data entries

Hierarchical Clustering

- Groups observations based on dissimilarity
- Compacts database into “labels” that represent the observations
- Measure of similarity/Dissimilarity
- Euclidean Distance
- Manhattan Distance
- Types of Clustering
- Single Link
- Average Link
- Complete Link

Hierarchical Clustering: Comparison

5

1

5

5

4

1

3

1

4

1

2

2

5

2

5

5

2

1

5

2

5

2

2

2

3

3

6

6

3

6

3

1

6

3

3

1

4

4

4

1

3

4

4

4

Single-link

Complete-link

Average-link

Centroid distance

Compare Dendrograms

1 2 5 3 6 4

1 2 5 3 6 4

1 2 5 3 6 4

Single-link

Complete-link

Centroid distance

Average-link

2 5 3 6 4 1

Which Distance Measure is Better?

- Each method has both advantages and disadvantages; application-dependent
- Single-link
- Can find irregular-shaped clusters
- Sensitive to outliers
- Complete-link, Average-link, and Centroid distance
- Robust to outliers
- Tend to break large clusters
- Prefer spherical clusters

Dendrogram from dataset

- Minimum spanning tree through the observations
- Single observation that is last to join the cluster is patient whose blood pressure is at bottom quartile, skin thickness is at bottom quartile and BMI is in bottom half
- Insulin was however largest and she is 59-year old diabetic

Dendrogram from dataset

- Maximum dissimilarity between observations in one cluster when compared to another

Dendrogram from dataset

- Average dissimilarity between observations in one cluster when compared to another

Supervised versus Unsupervised Learning

- Supervised learning (classification)
- Supervision: Training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
- New data is classified based on training set
- Unsupervised learning (clustering)
- Class labels of training data are unknown
- Given a set of measurements, observations, etc., need to establish existence of classes or clusters in data

Classification and Prediction

- Derive models that can use patient specific information, aid clinical decision making
- Apriori decision on predictors and variables to predict
- No method to find predictors that are not present in the data
- Numeric Response
- Least Squares Regression
- Categorical Response
- Classification trees
- Neural Networks
- Support Vector Machine
- Decision models
- Prognosis, Diagnosis and treatment planning
- Embed in clinical information systems

Least Squares Regression

- Find a linear function of predictor variables that minimize the sum of square difference with response
- Supervised learning technique
- Predict insulin in our dataset :glucose and BMI

Decision Trees

- Decision tree
- Each internal node tests an attribute
- Each branch corresponds to attribute value
- Each leaf node assigns a classification
- ID3 algorithm
- Based on training objects with known class labels to classify testing objects
- Rank attributes with information gain measure
- Minimal height
- least number of tests to classify an object
- Used in commercial tools eg: Clementine
- ASSISTANT
- Deal with medical datasets
- Incomplete data
- Discretize continuous variables
- Prune unreliable parts of tree
- Classify data

Algorithm for Decision Tree Induction

- Basic algorithm (a greedy algorithm)
- Attributes are categorical (if continuous-valued, they are discretized in advance)
- Tree is constructed in a top-down recursive divide-and-conquer manner
- At start, all training examples are at the root
- Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain)
- Examples are partitioned recursively based on selected attributes

Construction of A Decision Tree for “Condition X”

[P4,P5,P10]

Yes: 3, No:0

[P6,P14]

Yes: 0, No:2

YES

YES

YES

NO

NO

[P1,…P14]

Yes: 9, No:5

Age?

30…40

<=30

>40

[P1,P2,P8,P9,P11]

Yes: 2, No:3

[P3,P7,P12,P13]

Yes: 4, No:0

[P4,P5,P6,P10,P14]

Yes: 3, No:2

Vision

History

no

yes

excellent

fair

[P9,P11]

Yes: 2, No:0

[P1,P2,P8]

Yes: 0, No:3

Entropy and Information Gain

- S contains si tuples of class Ci for i = {1, ..., m}
- Information measures info required to classify any arbitrary tuple
- Entropy of attribute A with values {a1,a2,…,av}
- Information gained by branching on attribute A

Entropy and Information Gain

- Select attribute with the highest information gain (or greatest entropy reduction)
- Such attribute minimizes information needed to classify samples

Rule Induction

- IF conditions THEN Conclusion
- Eg: CN2
- Concept description:
- Characterization: provides a concise and succinct summarization of given collection of data
- Comparison: provides descriptions comparing two or more collections of data
- Training set, testing set
- Imprecise
- Predictive Accuracy
- P/P+N

Example used in a Clinic

- Hip arthoplasty trauma surgeon predict patient’s long-term clinical status after surgery
- Outcome evaluated during follow-ups for 2 years
- 2 modeling techniques
- Naïve Bayesian classifier
- Decision trees
- Bayesian classifier
- P(outcome=good) = 0.55 (11/20 good)
- Probability gets updated as more attributes are considered
- P(timing=good|outcome=good) = 9/11 (0.846)
- P(outcome = bad) = 9/20 P(timing=good|outcome=bad) = 5/9

Bayesian Classification

- Bayesian classifier vs. decision tree
- Decision tree: predict the class label
- Bayesian classifier: statistical classifier;predict class membership probabilities
- Based on Bayes theorem; estimate posterior probability
- Naïve Bayesian classifier:
- Simple classifier that assumes attribute independence
- High speed when applied to large databases
- Comparable in performance to decision trees

Bayes Theorem

- Let X be a data sample whose class label is unknown
- Let Hi be the hypothesis that X belongs to a particular class Ci
- P(Hi) is class prior probability that X belongs to a particular class Ci
- Can be estimated by ni/n from training data samples
- n is the total number of training data samples
- ni is the number of training data samples of class Ci

Formula of Bayes Theorem

More classification Techniques

- Neural Networks
- Similar to pattern recognition properties of biological systems
- Most frequently used
- Multi-layer perceptrons
- Input with bias, connected by weights to hidden, output
- Backpropagation neural networks
- Support Vector Machines
- Separate database to mutually exclusive regions
- Transform to another problem space
- Kernel functions (dot product)
- Output of new points predicted by position
- Comparison with classification trees
- Not possible to know which features or combination of features most influence a prediction

Multilayer Perceptrons

- Non-linear transfer functions to weighted sums of inputs
- Werbos algorithm
- Random weights
- Training set, Testing set

Support Vector Machines

- 3 steps
- Support Vector creation
- Maximal distance between points found
- Perpendicular decision boundary
- Allows some points to be misclassified
- Pima Indian data with X1(glucose) X2(BMI)

What is Association Rule Mining?

- Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories

Example of Association Rules

{High LDL, Low HDL} {Heart Failure}

- People who have high LDL (“bad” cholesterol), low HDL (“good cholesterol”) are at

higher risk of heart failure.

Association Rule Mining

- Market Basket Analysis
- Same groups of items bought placed together
- Healthcare
- Understanding among association among patients with demands for similar treatments and services
- Goal : find items for which joint probability of occurrence is high
- Basket of binary valued variables
- Results form association rules, augmented with support and confidence

Association Rule Mining

Trans containing both X and Y

D

Trans containing X

Trans containing Y

- Association Rule
- An implication expression of the form X Y, where X and Y are itemsets and XY=
- Rule Evaluation Metrics
- Support (s): Fraction of transactions that contain both X and Y
- Confidence (c): Measures how often items in Y appear in transactions thatcontain X

The Apriori Algorithm

- Starts with most frequent 1-itemset
- Include only those “items” that pass threshold
- Use 1-itemset to generate 2-itemsets
- Stop when threshold not satisfied by any itemset
- L1 = {frequent items};

for (k = 1; Lk !=; k++) do

- Candidate Generation: Ck+1 = candidates generated from Lk;
- Candidate Counting: for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t
- Lk+1 = candidates in Ck+1 with min_sup

return k Lk;

Apriori-based Mining

Data base D

1-candidates

Freq 1-itemsets

2-candidates

TID

Items

Itemset

Sup

Itemset

Sup

Itemset

10

a, c, d

a

2

a

2

ab

Scan D

20

b, c, e

b

3

b

3

ac

30

a, b, c, e

c

3

c

3

ae

40

b, e

d

1

e

3

bc

Min_sup=0.5

e

3

be

ce

Counting

3-candidates

Freq 2-itemsets

Scan D

Itemset

Sup

Itemset

Itemset

Sup

ab

1

bce

ac

2

Scan D

ac

2

bc

2

Freq 3-itemsets

ae

1

be

3

bc

2

ce

2

Itemset

Sup

be

3

bce

2

ce

2

Principle Component Analysis

- Principle Components
- In cases of large number of variables, highly possible that some subsets of the variables are very correlated with each other. Reduce variables but retain variability in dataset
- Linear combinations of variables in the database
- Variance of each PC maximized
- Display as much spread of the original data
- PC orthogonal with each other
- Minimize the overlap in the variables
- Each component normalized sum of square is unity
- Easier for mathematical analysis
- Number of PC < Number of variables
- Associations found
- Small number of PC explain large amount of variance
- Example 768 female Pima Indians evaluated for diabetes
- Number of times pregnant, two-hour oral glucose tolerance test (OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold thickness, Two-hour serum insulin, BMI, Diabetes pedigree function, Age, Diabetes onset within last 5 years

National Cancer Institute

- CancerNet http://www.nci.nih.gov
- CancerNet for Patients and the Public
- CancerNet for Health Professionals
- CancerNet for Basic Reasearchers
- CancerLit

Conclusion

- About ¾ billion of people’s medical records are electronically available
- Data mining in medicine distinct from other fields due to nature of data: heterogeneous, with ethical, legal and social constraints
- Most commonly used technique is classification and prediction with different techniques applied for different cases
- Associative rules describe the data in the database
- Medical data mining can be the most rewarding despite the difficulty

Download Presentation

Connecting to Server..