Hubness in the context of feature selection and generation
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Hubness in the Context of Feature Selection and Generation PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on
  • Presentation posted in: General

Hubness in the Context of Feature Selection and Generation. Milo š Radovanović 1 Alexandros Nanopoulos 2 Mirjana Ivanovi ć 1 1 Department of Mathematics and Informatics Faculty of Science, University of Novi Sad , Serbia 2 Institute of Computer Science University of Hildesheim, Germany.

Download Presentation

Hubness in the Context of Feature Selection and Generation

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Hubness in the context of feature selection and generation

Hubness in the Context of Feature Selection and Generation

Miloš Radovanović1 Alexandros Nanopoulos2

Mirjana Ivanović1

1Department of Mathematics and Informatics

Faculty of Science, University of Novi Sad, Serbia

2Institute of Computer Science

University of Hildesheim, Germany


K occurrences n k

k-occurrences (Nk)

Nk(x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set

  • Nk(x) is the in-degree of node x in the k-NN digraph

    It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk

  • Music retrieval [Aucouturier 2007]

  • Speech recognition [Doddington 1998]

  • Fingerprint identification [Hicklin 2005]

FGSIR'10


Skewness of n k

Skewness of Nk

What causes the skewness of Nk?

  • Artefact of data?

    • Are some songs more similar to others?

    • Do some people have fingerprints or voices that are harder to distinguish from other people’s?

  • Specifics of modeling algorithms?

    • Inadequate choice of features?

  • Something more general?

FGSIR'10


Hubness in the context of feature selection and generation

FGSIR'10


Contributions outline

Contributions - Outline

Demonstrate the phenomenon

  • Skewness in the distr of k-occurrences

    Explain its main reasons

  • No artifact of data

  • No specifics of models (inadequate features, etc.)

  • A new aspect of the „curse of dimensionality“

    Impact on Feature Selection and Generation

FGSIR'10


Outline

Outline

Demonstrate the phenomenon

Explain its main reasons

Impact on FSG

Conclusions

FGSIR'10


Collection of 23 real text data sets

Collection of 23 real text data sets

SNk is standardized 3rd moment of Nk

If SNk = 0 no skew, positive (negative) values signify right (left) skew

High skewness indicates hubness

FGSIR'10


Collection of 14 real uci data sets microarray data

Collection of 14 real UCI data sets+ microarray data

FGSIR'10


Outline1

Outline

Demonstrate the phenomenon

Explain its main reasons

Impact on IR

Conclusions

FGSIR'10


Where are the hubs located

Where are the hubs located?

Spearman correlation between N10 and distance from data set mean

Hubs are closer to the data center

FGSIR'10


Centrality and its amplification

Centrality and its amplification

  • Hubs due to centrality

    • vectors closer to the center tend to be closer to all other vectors

    • thus more frequent k-NN

  • Centrality is amplified by dimensionality

point A closer to center than point B

∑ sim(A,x) - ∑ sim(B,x)

x

x

FGSIR'10


Concentration of similarity

Concentration of similarity

Concentration: as dim grows to infinity

  • Ratio between standard deviation of pairwise similarities (distances) and their expectation shrinks to zero

  • Minkowski [François 2007, Beyer 1999, Aggarwal 2001]

    • Meaningfulness of nearest neighbors?

      Analytical proof for cosine sim [Radovanović 2010]

FGSIR'10


The hyper sphere view

The hyper-sphere view

E

√V

Hyper-sphere view

  • Most vectors are about equidistant from the center and from each other, and lie on the surface of a hyper-sphere

  • Few vectors lie at the inner part of hyper-sphere, closer to its center, thus closer to all others

    • This is expected for large but finite dimensionality, since is non negligible

FGSIR'10


What happens with real data

What happens with real data?

Spearman correlation between N10 anddistance from data/cluster center

Real text data are usually clustered (mixture of distributions)

Cluster with k-Means (#clusters = 3*Cls)

Compare with

Generalization of the hyper-sphere view with clusters

FGSIR'10


Uci data

UCI data

FGSIR'10


Can dim reduction help

Can dim reduction help?

Intrinsic dimensionalityis reached

FGSIR'10


Uci data1

UCI data

FGSIR'10


Outline2

Outline

Demonstrate the phenomenon

Explain its main reasons

Impact on FSG

Conclusions

FGSIR'10


Bad hubs as obstinate results

“Bad” hubs as obstinate results

  • Based on information about classes,k-occurrences can be distinguished into:

    • “Bad” k-occurrences, BNk(x)

    • “Good” k-occurrences, GNk(x)

    • Nk(x) = BNk(x) + GNk(x)

FGSIR'10


How do bad hubs originate

How do “bad” hubs originate?

  • Mixture is important also:

    • High dimensionality and skewness of Nkdo not automatically induce “badness”

    • “Bad” hubs originate from a combination of high dimensionality and violation of the CA

      • Cluster Assumption (CA): Most pairs of vectors in a cluster should be of the same class [Chapelle 2006]

FGSIR'10


Skewness of n k vs features

Skewness of Nkvs. #features

Skewness stays relatively constant

It abruptly drops when intrinsic dimensionality is reached

Further feature selection may incur loss of information.

FGSIR'10


Badness vs features

Badness vs. #features

Similar observations

When reaching intrinsic dimensionality, BNk ratio increases

The representation ceases to reflect the information provided by labels very well

FGSIR'10


Feature generation

Feature generation

When adding features to bring new information to the data:

  • Representation will ultimately increase SNkand, thus, produce hubs

  • The reduction of BNk ratio “flattens out” fairly quickly, limiting the usefulness of adding new features in the sense of being able to express the “ground truth”

    If instead of BNk ratio we use classifier error rate, the results are similar

FGSIR'10


Conclusion

Conclusion

  • Little attention by research in feature selection/ generation to the fact that in intrinsically high-dimensional data, hubs will :

    • Result in an uneven distribution of the cluster assumption violation (hubs will be generated that attract more label mismatches with neighborin points)

    • Result in an uneven distribution of responsibility for classification or retrieval error among data points.

  • Investigating further the interaction between:

    • hubness and

    • different notions of CA violation

  • Important new insights into feature selection/generation

FGSIR'10


Thank you alexandros nanopoulos nanopoulos@ismll de

Thank You!Alexandros [email protected]

FGSIR'10


  • Login