1 / 20

Data & Knowledge Engineering

Data & Knowledge Engineering. Presenter : YAN-SHOU SIE Authors : Jeroen de Knijff , Flavius Frasincar , Frederik Hogenboom 2012. DKE. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation.

armani
Download Presentation

Data & Knowledge Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data & Knowledge Engineering Presenter : YAN-SHOU SIE Authors : Jeroen de Knijff, Flavius Frasincar, FrederikHogenboom2012. DKE

  2. Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

  3. Motivation • In the past, data were stored physically, not digitally, and were often structured manually so that the desired information could be found easily. • Today, data are often stored digitally and are usually unstructured, as in documents. Manually structuring documents is time consuming.

  4. Objectives • makes it interesting to investigate possibilities to automatically organize documents. This could be performed by automatically generating a concept taxonomy from a document corpus. • In our current work, we present a framework for automatically constructing a domain taxonomy from text corpora. We call this framework Automatic Domain Taxonomy Construction from Text (ADTCT).

  5. Methodology ADTCT Framework

  6. Methodology-ADTCT Framework • Term Extraction:use a part-of-speech parser • Term Filtering: • domain pertinence DP • lexical cohesion LC

  7. Methodology-ADTCT Framework • domain consensus DC • norm _freq • final domain score

  8. Methodology-ADTCT Framework • Concept hierarchy creation • subsumption method • hierarchical clustering algorithm

  9. Methodology-ADTCT Framework • subsumption method • Concept x potentially subsumes concept y if: • A score calculated for each potential parent

  10. Methodology • Explained • P(p|x) ex : ‘Technology adaptation’ potential parents : Technology , Technological , Adaptation 0.05 0.4 = t 0.2 0.4 0.32 0.6 Technological : Technology adaptation :

  11. Methodology-ADTCT Framework • Hierarchical clustering method • Algorithm: • 1. Start with n clusters (each term is a cluster). • 2. Compute the distances between clusters. • 3. Merge the two nearest clusters into one cluster. Return to step 2 if more than one cluster remains; otherwise, the algorithm has finished. • distance measures • document co-occurrence similarity • window-based similarity

  12. Methodology-ADTCT Framework • document co-occurrence similarity • window-based similarity • Suppose that we have a document with four concepts: ‘Ad,’‘Bert,’ ‘Cees,’ and ‘Dirk.’ If the window size is 2, the following windows are created for this document: {Ad}, {Ad, Bert}, {Bert, Cees},{Cees, Dirk}, and {Dirk}.

  13. Methodology-Implementation • hierarchical clustering algorithm ex : ‘System’ appears in documents {1,3,6,8} and windows {1,5,10,14,18,20,28}; ‘Process’ appears in documents {1,3,6,12} and windows {1,5,12,14,18,25,30}. • the similarities are converted to distances: Max Avg = 0.15 Min document similarity : window similarity :

  14. Methodology • ADCTC Implementation

  15. Experiments • Experimental setup • lexical precision : • common semantic cotopy : • local taxonomic precision : • taxonomic precision and recall : • taxonomic F-measure (TF):

  16. Experiments • Experimental results

  17. Experiments • trade-off decision mathematically • Suppose minimal average depth = 3 , minimal quality = 0.60, t=0.20, t=0.25, t=0.30 obey these constraints. γ=0.40 and λ=0.60 t=0.20 t=0.25 t=0.30

  18. Experiments

  19. Conclusions • Ourevaluationin the field of management and economics indicates that a trade-off between taxonomy quality and depth must be made when choosing one of these methods. • The subsumption method is preferable for shallow taxonomies, whereas the hierarchical clustering algorithm is recommended for deep taxonomies.

  20. Comments • Advantages • Automatically create taxonomies that approach the quality of manually created taxonomies and save even more time • Applications - Clustering , Classification, etc.

More Related