1 / 24

Mathew Michelson and Craig A. Knoblock

Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web. Mathew Michelson and Craig A. Knoblock. Abstract. Extracting unstructured data is difficult. Traditional methods do not apply. Solution: Unsupervised extraction.

dominy
Download Presentation

Mathew Michelson and Craig A. Knoblock

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock

  2. Abstract • Extracting unstructured data is difficult • Traditional methods do not apply • Solution: Unsupervised extraction • Results are competitive to supervised methods

  3. Introduction • Web data could be useful if extracted (i.e Craigslist)

  4. Introduction • Posts are not structured • The Phoebus method works on this data • But it requires much user input (supervised) • The paper presents and optional unsupervised method • This work extends on unsupervised semantic annotation

  5. Introduction • Current work on UIE relies on redundancy • This approach does not use structural assumptions • This approach relies on similarity no redundancy • This approach creates relational data

  6. Unsupervised Extraction Steps of the algorithm: • Automatically choosing the Reference Set • Matching Posts to the Reference Set • Unsupervised Extraction

  7. Unsupervised Extraction • Automatically choosing the Reference Sets - They choose a reference set based on similarity - They calculate a similarity score and sort the sets - They use percent difference and average score - The algorithm scales linearly with size - They use multiple metrics as similarity score

  8. Unsupervised Extraction • Matching Posts to the Reference Set - A vector-space model is used to match posts - The Jaro-Winkler metric is used to match tokens - Attributes that do not agree are removed - Now we can query the posts (Yay !)

  9. Unsupervised Extraction • Unsupervised Extraction - A baseline is created between extracted field and reference set field - We remove tokens based on the baseline

  10. Experimental Results Reference Sets Post Sets

  11. Experimental Results Jensen-Shannon similarity

  12. Experimental Results TF/IDF similarity

  13. Experimental Results Jaccard similarity

  14. Experimental Results Jaro-Winkler TF/IDF similarity

  15. Experimental Results Results

  16. Experimental Results Dice similarity

  17. Experimental Results Jaccard similarity

  18. Experimental Results TF/IDF similarity

  19. Experimental Results Dice vs Phoebus

  20. Experimental Results Jaro-Winkler vs Smith-Waterman

  21. Experimental Results Comparison with other methods

  22. Related Work • SemTag is a similar system • But it uses a crafted taxonomy • In contrast, SemTag focuses on disambiguation • CRAM is also similar but it requires labeling

  23. Conclusion • This paper introduces an unsupervised information extraction technique • The Jensen-Shannon distance metric is better • Using text acronyms would be beneficial • Entity extraction could be a good idea

  24. Questions

More Related