1 / 21

Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations

Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations. Christian Biemann Uwe Quasthoff Karsten Böhm Christian Wolff I-KNOW'03, Friday, 4th of July. Goals. extraction of multiterms from unannotated text corpora for the use in information visualisation

delila
Download Presentation

Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Discovery and Aggregation of Compound Names for the Use in Knowledge Representations Christian BiemannUwe QuasthoffKarsten BöhmChristian Wolff I-KNOW'03, Friday, 4th of July

  2. Goals • extraction of multiterms • from unannotated text corpora • for the use in information visualisation Example: Company Names Chris Biemann - IKnow'03 Graz

  3. Topics • Patterns of Company Names • From Patterns to Pattern Rules • Search and Verification Algorithm: Use Pattern Rules to find Name Parts • Term Extraction using Name Parts • Experiments and Evaluation • Aggregation of Name Variants • Application to Semantic Networks Chris Biemann - IKnow'03 Graz

  4. ABBReviation NAME parts Legal Form (KIND) A & W Elektrogeräte GmbH & Co. KG A. Baumgarten GmbH Hagedorn GmbH Institut für Angewandte Kreativität DASAG GmbH Japan Steel Works Ltd. K.F.C. Germany Inc. LABSCO Laboratory Supply Company GmbH & Co. KG Patterns of Company Names Regular Expression to capture the structure: (ABBR(FS|CONJ)?)* (NAME|CONN)* (KIND(FS|CONJ)?)* FS: Full Stop CONJ:Conjunctions, like +,&, ... ?: Zero or one *: Zero or more Chris Biemann - IKnow'03 Graz

  5. How to learn missing parts ASD Japanese Steelworks Inc....comprising Japanese Steelworks Inc., ... Suppose: • Japanese is known NAME • Inc. is known KIND Then: • Steelworks should be NAME • ASD should be ABBR • comprising should not be part of the name Use flat features: _UC Upper Case _CAP Capitalized _LC Lower Case _MIX Mixed Case Chris Biemann - IKnow'03 Graz

  6. From Patterns to Pattern Rules (ABBR(FS|CONJ)?)* (NAME|CONN)* (KIND(FS|CONJ)?)* • Pattern ABBR NAME NAME KIND • Pattern Rules _CAP* NAME NAME KIND -> ABBR ABBR _UC* NAME KIND -> NAME ABBR NAME _UC* KIND -> NAME ABBR NAME NAME _MIX* -> KIND ... Chris Biemann - IKnow'03 Graz

  7. Pattern Rules Characteristics ABBR _UC* NAME KIND -> NAME • operate on sequences of - flat features - classes of known words • Problem: match too often- high coverage- low precision Chris Biemann - IKnow'03 Graz

  8. Search and Verification Algorithm Initialise pattern rules Let unused elements := initial set of elements with class Loop: For each unused element Find candidates for new elements by the search step For each candidate Do the verification step Add accepted candidates to new unused elements Output new unused elements Unused elements = new unused elements Chris Biemann - IKnow'03 Graz

  9. Search Step _UC* NAME KIND -> NAME, AGKIND,FilmNAME • use unused element to find example sentences"Film" -> 100 sentences • apply Pattern Rules to obtain candidatesCineMedia  NAME Odeon  NAME Senator  NAME Lunaris  NAMEDie  NAME ... • Fragments containing "Film": • Die CineMedia Film AG übernahm • die Odeon Film AG mit • darunter ein Film über • zu jedem Film interessante • die Senator Film AG über • zukunftsweisenden Film "Jurassic Park" • die Lunaris Film GmbH • erfolgreichsten Film der • . Die Film AG stellte nach... • der Odeon Film AG. Chris Biemann - IKnow'03 Graz

  10. Verification Step _UC* NAME KIND -> NAME, AGKIND,FilmNAME • use candidate to find example sentences"Odeon" -> 30 sentences • apply Pattern Rules and check classifications of candidate"Odeon" is NAME in 17/30 cases ->accept"Senator" is NAME in 2/30 cases -> reject"Die" is NAME in 0/30 cases -> reject • Fragments containing "Odeon": • Die Odeon Film AG (3x) • des Vorstands der Odeon Film AG • rennomierte Viedovertriebskette Odeon • teilen sich Hecos Odeon Sub/200/Center • Rahmenvertrag mit Odeon Zwo • setzt auf Odeon Film AG • ... Chris Biemann - IKnow'03 Graz

  11. Extraction of Multiterms _DELIML NAME NAME KIND _DELIMR =company,AGKIND, Film NAME, OrionNAME • Patterns with delimiters can be used for extraction • Patterns only select appropriate multiterms, not single occurrences of name parts • Multiterms containing "Odeon": • Die Odeon Film AG (3x) • des Vorstands der Odeon Film AG • rennomierte Viedovertriebskette Odeon • teilen sich Hecos Odeon Sub/200/Center • Rahmenvertrag mit Odeon Zwo • setzt auf Odeon Film AG "Odeon Film AG" Chris Biemann - IKnow'03 Graz

  12. Experiment • Prerequisites- take arbitrary company list, - sort words by frequency, - truncate top 1'000 - assign classes NAME, KIND • Pattern Rules- Generate Patterns from Regexp- Generate Pattern Rules from Patterns- Add delimiters to Patterns to get Extraction Patterns Chris Biemann - IKnow'03 Graz

  13. Category Correct With Specifier Fractions Errors Example Odeon Film AG Grazer Andritz AG Großmarkt GmbH Plastik MiniDIL Fraction 75.80 % 17.36 % 6.08 % 0.76 % Evaluation • Input- 1'002 Items- 47 Pattern Rules- 106 Extraction Patterns • Output- over 12'000 Items- over 6'000 multiterms (company names) Chris Biemann - IKnow'03 Graz

  14. Long candidate Short candidate Correct name Düsseldorfer Bank eG Bank eG Düsseldorfer Bank eG Düsseldorfer Rheinmetall AG Rheinmetall AG Rheinmetall AG Mannheimer Pharmexx GmbH Pharmexx GmbH Pharmexx GmbH Infomatec Media AG Infomatec AG Infomatec Media AG JENOPTIK Automatisierungstechnik GmbH Jenoptik GmbH JENOPTIK Automatisierungstechnik GmbH Jenoptik Bauentwicklung GmbH Jenoptik GmbH Jenoptik Bauentwicklung GmbH Kleindienst Datentechnik GmbH Kleindienst GmbH Kleindienst Datentechnik GmbH Nachrichtenagentur dpa-AFX dpa-AFX dpa-AFX Infomatec-Tochtergesellschaft Igel GmbH Igel GmbH Igel GmbH Aggregation of Name Variants • Rule 1: first word is location: remove if short form has high frequency • Rule 2: generic name not aligned to term border: Keep to distinguish between subsidaries • Rule 3: long form has generic name for first word: remove if short form has higher frequency Chris Biemann - IKnow'03 Graz

  15. Application to Semantic Networks Media – Helkon Media AG, ProSieben Media AG, I-D Media AG http://www.wortschatz.uni-leipzig.de http://www.texttech.de Chris Biemann - IKnow'03 Graz

  16. Media Example (2) Helkon Media AG Chris Biemann - IKnow'03 Graz

  17. Media Example (3) ProSieben Media AG Chris Biemann - IKnow'03 Graz

  18. Media Example (4) I-D Media AG Chris Biemann - IKnow'03 Graz

  19. Telekom Example Telekom vs. Deutsche Telekom AG Chris Biemann - IKnow'03 Graz

  20. Summary • Example-based unsupervised learning algorithm for multiterm extraction • Disambiguation of generic company name parts in knowledge representation • More finegrained representation of complex concepts Chris Biemann - IKnow'03 Graz

  21. END Thanks for your attention! Chris Biemann - IKnow'03 Graz

More Related