text mining tool for ontology engineering based on use of product taxonomy and web directory n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Text mining tool for ontology engineering based on use of product taxonomy and web directory PowerPoint Presentation
Download Presentation
Text mining tool for ontology engineering based on use of product taxonomy and web directory

Loading in 2 Seconds...

play fullscreen
1 / 13

Text mining tool for ontology engineering based on use of product taxonomy and web directory - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Text mining tool for ontology engineering based on use of product taxonomy and web directory. Jan Nemrava and Vojtech Sv atek Department of Information and Knowledge Engineering VSE Praha. Current state. IE and Ontology learning are frequently discussed issues in the field of Semantic Web.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Text mining tool for ontology engineering based on use of product taxonomy and web directory' - remedios-willis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
text mining tool for ontology engineering based on use of product taxonomy and web directory

Text mining tool for ontology engineering based on use of product taxonomy and web directory

Jan Nemrava and Vojtech Svatek

Department of Information and Knowledge Engineering

VSE Praha

current state
Current state
  • IE and Ontology learning are frequently discussed issues in the field of Semantic Web.
  • Semi-automatic and automatic methods ontology-based extraction of informationneeded
  • Web is great source for unstructured text

DATESO 2005

task is
Task is …
  • Collect specific words – verbs in our case – that usually occur together with particular product category as support for ontology designers.
  • Small and specialized ontologies concerning one product category and describing its frequent relations in common text.
  • Make use of fulltext search engines and DMOZ directory for retrieving information
  • And UNSPSC (United Nations Standard Products and Services Code) product catalogue

DATESO 2005

slide4
Web directory are rarely valid taxonomies.
  • It is easy to see that subheadings are often not specializations of headings
  • Some of them are even not concepts (names of entities) but properties that implicitly restrict the extension of a preceding concept in the hierarchy. Consider for example .../Industries/Construction and Maintenance/Materials and Supplies/ /Masonry_and_Stone/Natural Stone/International Sources/Mexico.

DATESO 2005

proposal of method
Proposal of method …
  • Obtain so called „indicator verbs” that characterize particular term (product category in our case) in UNSPSC.
  • Particular terms will be then generalized and may mine verbs that are indicative for the upper level of these terms.
  • join UNSPSC taxonomy and it’s list of products with content of company websites to gain valuable information about verbs that usually occur in one sentence with some product category from the taxonomy.
  • Use hand classified web directories containing relevant web sites.

DATESO 2005

task sequence decomposition
Task sequence decomposition
  • Manually select UNSPSC product and corresponding product category from DMOZ Business branch
    • Search in directory headings names
    • Search in web site description
    • Use fulltext
  • 1) Input: URL of DMOZ directory containing companies that manufacture desired product.
  • Output: List of URL of companies.
  • 2) Input:URL of company website
  • Output: List of web pages containing the target term.
  • 3) Input: Web page containing the term
  • Output: File with extracted sentences containing the term
  • 4) Input:Sentence with term.
  • Output: Tagged sentences
  • 5) Input: Verbs
  • Output: lemmatized, grouped and saved verbs

DATESO 2005

experiment
Experiment
  • Handling equipment branch / UNSPSC product with corresponding DMOZ category
  • Goal is find verbs:
    • common for most products.
    • characterizing one branch of products
    • specific for small group of products, or even only one product
  • 7 product categories, 303 verbs collected that occurred 7300 times at web sites.

DATESO 2005

experiment1
Experiment

DATESO 2005

experiments
Experiments
  • some verbs are obvious to be entirely neutral and do not characterize the products at all. (be, have, provide and use)
  • Some are connected with manufacturing(design, require, offer, make, contact, manufacture, develop, supply)
  • activities describing manipulating with material. (handle, lift, install and move)

DATESO 2005

experiments1
Experiments

DATESO 2005

slide11
normalization
  • Fij = fij * (Vtj / V)
  • Croft’s normalization moderates the effect of high-frequency verbs
  • cf = K + (1 - K) * fij / mij
  • TF/IDF
  • wij = fij * log2(N / n)

DATESO 2005

problem remaining
Problem remaining …
  • Automate assigning UNSPSC category to DMOZ category
  • Some UNSPSC have no appropriate category leading in no or little web sites.
  • Some categories are less informative

DATESO 2005

slide13
Thank you!

DATESO 2005