clustering related terms with definitions n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clustering Related Terms with Definitions PowerPoint Presentation
Download Presentation
Clustering Related Terms with Definitions

Loading in 2 Seconds...

play fullscreen
1 / 22

Clustering Related Terms with Definitions - PowerPoint PPT Presentation


  • 78 Views
  • Uploaded on

Clustering Related Terms with Definitions. Scott Piao, John McNaught and Sophia Ananiadou {scott.piao,john.mcnaught,sophia.ananiadou}@manchester.ac.uk National Centre for Text Mining School of Computer Science The University of Manchester. Outline of talk.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Clustering Related Terms with Definitions' - ksena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
clustering related terms with definitions

Clustering Related Terms with Definitions

Scott Piao, John McNaught and Sophia Ananiadou

{scott.piao,john.mcnaught,sophia.ananiadou}@manchester.ac.uk

National Centre for Text MiningSchool of Computer ScienceThe University of Manchester

LREC 2008 Marrakech

outline of talk
Outline of talk
  • Task: match related terms of ontology.
  • Approach: detect and cluster related terms based on definitions.
  • Implementation: definition matching and term clustering, user interface.
  • Evaluation on GO terms.
  • Conclusion.

LREC 2008 Marrakech

task matching terms for ontology enrichment
Task: matching terms for ontology enrichment
  • matching similar or related terms/expressions is important task in NLP and Text Mining applications.
  • Ontology term matching is also closely related to ontology enrichment.
  • In the EU BOOTSTrep Project, some techniques have been tested for ontology entities matching and alignment.
  • Our work focuses on testing and evaluating a text matching tool for identifying related ontology terms with their definitions.

LREC 2008 Marrakech

definitions of term definitions
Definitions of term definitions
  • Ontology terms, such as GO (Gene Ontology) terms, often contain detailed definitions:.
    • id: GO:0000124
    • name: SAGA complex
    • def:"A large multiprotein complex that possesses histone acetyltransferase and is involved in regulation of transcription. The budding yeast complex includes Gcn5p, several proteins of the Spt and Ada families, and several TBP-associate proteins (TAFs); analogous complexes in other species have analogous compositions, and usually contain homologs of the yeast proteins.“
    • id: GO:0005671
    • name: Ada2/Gcn5/Ada3 transcription activator complex
    • def:"A multiprotein complex that possesses histone acetyltransferase and is involved in regulation of transcription. The budding yeast complex includes Gcn5p, two proteins of the Ada family, and two TBP-associate proteins (TAFs); analogous complexes in other species have analogous compositions, and usually contain homologs of the yeast proteins."

LREC 2008 Marrakech

our approach to the issue
Our approach to the issue
  • The definitions can provide a fundamental information source for detecting relations between terms.
  • lexicon definitions have been previously used for analyzing relations between words/terms (Castillo et al., 2003).
  • We assume text matching tools can be used to detect related terms based on the definitions.

LREC 2008 Marrakech

a tool for clustering related texts
A tool for clustering related texts
  • Align similar sentences between texts.
  • Measure the distances between texts based on the aligned sentences.
  • Cluster similar texts based on a distance matrix.

LREC 2008 Marrakech

metrics for pairwise text comparison
Metrics for pairwise text comparison

(δ1=0.85,δ2=0.05,δ3=0.1),

,

(0 <= d <= 1).

For further details, see the paper.

LREC 2008 Marrakech

an effective algorithm text comparison
An effective algorithm text comparison

Cited from Clough et al. (2002)

LREC 2008 Marrakech

clustering texts
Clustering texts
  • Using the text comparison tool, produce distance matrix

matrix elements: eij =1 – dij, (0<=eij<=1)

  • Error Sum of Squares (ESS) hierarchical clustering

LREC 2008 Marrakech

sample of cluster tree
Sample of cluster tree

{layer=9

{layer=10

{layer=11

{layer=12 GO:0009897 GO:0010339 }

{layer=12 GO:0010282 }

}

{layer=11

{layer=12 GO:0045284 }

{layer=12 GO:0045293 }

}

}

{layer=10

{layer=11

{layer=12 GO:0017117 GO:0033202 }

{layer=12 GO:0017119 }

}

LREC 2008 Marrakech

a package for definition comparison and term clustering
A package for definition comparison andterm clustering

synonym lexicon

distance

matrix

term clusterer

user

interface

check

update

clusters

term

database

extended Porter’s stemmer

pairwise definitions comparison

LREC 2008 Marrakech

evaluation
Evaluation
  • The text comparison and clustering components are evaluated on a set of GO terms as test data.
  • In the evaluation, we consider GO terms to be related if they:
    • share a parent term within three layers of ancestor trees via IS_A relation, or
    • have direct parent/child relations (e.g. X is_a Y), or
    • have direct part-of relations (e.g. X is part of Y).

LREC 2008 Marrakech

evaluation1
Evaluation
  • Test data
    • GO terms under the namespace of cellular_component
    • 2,027 found, of which 2,010 have definitions --- actual test data.
    • All of the 2,010 test terms are related as defined previously with one or more other test terms.
  • Our evaluation strategy is to examine:
    • How many clustered terms have the relations defined previously, and
    • How many of the related terms can be covered by the clusters.

LREC 2008 Marrakech

evaluation of bottom layer clusters
Evaluation of bottom-layer clusters

Total_clustered_terms=1,076

LREC 2008 Marrakech

evaluation of the second layer clusters
Evaluation of the second layer clusters

Total_clustered_terms=2,010

LREC 2008 Marrakech

evaluation of the third layer clusters
Evaluation of the third layer clusters

Total_clustered_terms=2,010

LREC 2008 Marrakech

slide19

Application of this package

  • This package can be used as an assistant tool for modifying and enriching ontology and terminology. (Brief demo of interface)

LREC 2008 Marrakech

conclusion
Conclusion
  • Ontology term definitions provide an important information source for term matching.
  • Text comparing and clustering tool can provide useful tool for matching the terms.
  • For a better performance, the tool needs domain knowledge resources.

LREC 2008 Marrakech

acknowledgements
Acknowledgements

This research was supported by EC BOOTStrep Project (ref. FP6-028099).

The UK National Centre for Text Mining is sponsored by the JISC/BBSRC/EPSRC.

LREC 2008 Marrakech

references
References
  • BOOTStrep Project website: http://www.BOOTStrep.org.
  • Castillo, Gabriel, Gerardo Sierra, John McNaught (2003). An improved Algorithm for Semantic Clustering. Proceedings of the 1st international symposium on Information and communication technologies, Dublin.
  • Clough, Paul, Robert Gaizauskas, Scott Piao, Yorick Wilks (2002), METER: MEasuring TExt Reuse, In Proceedings of the ACL-2002, University of Pennsylvania, Philadelphia, USA, pp. 152-159.
  • Gene Ontology http://www.geneontology.org.
  • Piao, Scott and Tony McEnery (2003). A tool for text comparison. Proceedings of the Corpus Linguistics

LREC 2008 Marrakech