predictive modeling with heterogeneous sources n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Predictive Modeling with Heterogeneous Sources PowerPoint Presentation
Download Presentation
Predictive Modeling with Heterogeneous Sources

Loading in 2 Seconds...

play fullscreen
1 / 19

Predictive Modeling with Heterogeneous Sources - PowerPoint PPT Presentation


  • 135 Views
  • Uploaded on

Predictive Modeling with Heterogeneous Sources. Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji University, China 3 IBM T. J. Watson Research Center 4 Hong Kong University of Science and Technology.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Predictive Modeling with Heterogeneous Sources' - nieve


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
predictive modeling with heterogeneous sources

Predictive Modeling with Heterogeneous Sources

Xiaoxiao Shi1 Qi Liu2 Wei Fan3 Qiang Yang4 Philip S. Yu1

1 University of Illinois at Chicago

2 Tongji University, China

3 IBM T. J. Watson Research Center

4Hong Kong University of Science and Technology

why learning with heterogeneous sources
Why learning with heterogeneous sources?

Standard Supervised Learning

Training (labeled)

Test (unlabeled)

Classifier

85.5%

New York Times

New York Times

slide3

Why heterogeneous sources?

How to improve

the performance?

In Reality…

Training (labeled)

Test (unlabeled)

47.3%

Labeled data are insufficient!

New York Times

New York Times

why heterogeneous sources
Why heterogeneous sources?

Labeled data from other sources

Target domain

test (unlabeled)

82.6%

47.3%

New York Times

Reuters

  • Different distributions
  • Different outputs
  • Different feature spaces
real world examples
Real world examples
  • Social Network:
    • Can various bookmarking systems help predict social tags for a new system given that their outputs (social tags) and data (documents) are different?

Wikipedia

ODP

Backflip

Blink

……

?

4/18

real world examples1
Real world examples
  • Applied Sociology:
    • Can the suburban housing price census data help predict the downtown housing prices?

?

#rooms #bathrooms #windows price

5 2 12 XXX

6 3 11 XXX

#rooms #bathrooms #windows price

2 1 4 XXXXX

4 2 5 XXXXX

5/18

other examples
Other examples
  • Bioinformatics
    • Previous years’ flu data  new swine flu
    • Drug efficacy data against breast cancer  drug data against lung cancer
    • ……
  • Intrusion detection
    • Existing types of intrusions  unknown types of intrusions
  • Sentiment analysis
    • Review from SDM Review from KDD

6/18

learning with heterogeneous sources
Learning with Heterogeneous Sources
  • The paper mainly attacks two sub-problems:
    • Heterogeneous data distributions
      • Clustering based KL divergence and a corresponding sampling technique
    • Heterogeneous outputs (to regression problem)
      • Unifying outputs via preserving similarity.

7/18

learning with heterogeneous sources1
Learning with Heterogeneous Sources
  • General Framework

Source data

Unifying

data distributions

Unifying outputs

Target data

Target data

Source data

8/18

unifying data distributions
Unifying Data Distributions
  • Basic idea:
    • Combine the source and target data and perform clustering.
    • Select the clusters in which the target and source data are similarly distributed, evaluated by KL divergence.

9/18

an example
An Example

T

D

Adaptive Clustering

Combined Data

10/18

unifying outputs
Unifying Outputs
  • Basic idea:
    • Generate initial outputs according to the regression model
    • For the instances similar in the original output space, make their new outputs closer.

11/18

slide13

Modification

Modification

Initial Outputs

Initial Outputs

21.25

31.75

37

16

26.5

12/18

experiment
Experiment
  • Bioinformatics data set:

13/18

experiment2
Experiment
  • Applied sociology data set:

15/18

slide18

Conclusions

  • Problem: Learning with Heterogeneous Sources:
    • Heterogeneous data distributions
    • Heterogeneous outputs
  • Solution:
    • Clustering based KL divergence help perform sampling
    • Similarity preserving output generation help unify outputs