Automatic Term Extraction from Domain Corpora

Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit Nijmegen, 26 November 2007 Automatic term extraction from domain corpora

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Overview • Corpus versus Domain-based text collections • Customer-case • Term-extraction • Demo

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Corpus versus Domain-based text collections • Corpus to study linguistic phenomena: • INL corpus: NRC-handelsblad • Corpus geschreven Nederlands • British National Corpus • Brown corpus -> SemCor • Domain corpora: • portals • Wikipedia • Customer corpora: • web sites • manuals

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Customer-case • Connect suppliers and buyers and create traffic and advertisement • B2B: companies with specialized products and services • terminology driven • branch driven • C2B: consumers looking for products and services • general language terminology: -> folksonomy • bottom-up

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Product name in ontology of 150,000 products "kleppen, vlinder, pomp, hoge druk" (valves, butterfly, pump, high pressure) product name on company website "Wij zijn gespecialiseerd in: pompen en pomponderdelen zoals kleppen" (We are specialized in: pumps and components such as valves user query searching for products or servcies "vlinderkleppen voor een hoge drukpomp" (butterfly valves for high pressure pumps) Subscription for product names Companies in database 1.5 million websites

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction • morpho-syntactic analysis • statistical analysis • conceptual analysis • contextual analysis

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: morpho-syntactic analysis • Tokenization, tagging and NP-chunking: • “een gele kaart voor de vleugelaanvaller” (a yellow card for the wing-player) • Term candidates: • Syntactic head of NPs: kaart (card); vleugelaanvaller (wing-player). • Word combinations including syntactic head: gele kaart (yellow card); kaart voor vleugelaanvaller (card for wing-player). • Head of compounds: aanvaller (attacker-player). • Term is a concept: • Normalized form (plural-singular variants, synonyms) • Hypernym based on the syntactic head

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: statistical analysis • Reference corpus based on 500 websites of diverse range of companies • Salience = normFreq * normRef • normFreq = normalized frequency of terms on the website normFreq = nTermFrequencynWords / nPages • normRef = normalized number of websites on which the term occurs in the reference corpus • multiwords: normRef = 1-((nWebsitesnWords) / (referenceCorpusSize)) • singlewords: normRef = 1-((nWebsites) / (referenceCorpusSize))

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: conceptual analysis • Structural properties of the term hierarchy • Poor hierarchies: • many tops • few levels • diverse branches • Each branch is a concept: • number of descendants and levels • cumulated frequency of descendants • Branch profiling: • Domain classification of the hierarchy • Domain classification of each branch • Minimal overlap in domain

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Domains Clothing Sport Finance Culture Music Ball sports Winter sports Wordnet: Domain information Concepts Relations Vocabularies of languages 1 rec: 12345 • financial institute rec: 54321 - river side 2 bank 1 rec: 9876 - small string instrument violin 2 rec: 65438 - musician playing a violin violist rec:42654 - musician type-of 1 rec:35576 - string of an instrument type-of part-of string 2 rec:29551 - underwear rec:25876 - string instrument

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Anything can be a product or service: there are no intrinsic properties to define products • Contextual features: • context patterns for products • product pages • special marking in HTML

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Context patterns for products: • 144 patterns in English and 288 patterns in German • [we supply] [we deliver] [we provide] [our products are][we are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products] • Each term is scored for a product context in terms of the strength of the pattern and the distance

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Product pages: • landing page: index.html • html files with product names: product, service, solution • html files referred to by these pages • html files referred to by menus with such names • Special marking in HTML: • meta keywords • headings and titles • menus

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Product terms with feature bundles

<class> <name><![CDATA[arabica-kaffee gemahlene]]></name> <id>48</id> <pos>1</pos> <preferred_form><![CDATA[Arabica-Kaffee Gemahlener]]></preferred_form> <parent_form><![CDATA[Gemahlener]]></parent_form> <documents>1</documents> <frequency>1</frequency> <salience>0.0523</salience> <connectivity>10</connectivity> <modifiers> <modifier>arabica</modifier> <modifier>kaffee</modifier> <modifier>arabica-kaffee</modifier> </modifiers> <profileMatch>-1</profileMatch> <profile/> <termSource><![CDATA[#product]]></termSource> <cumfrequency_parent>1</cumfrequency_parent> <cumdocuments_parent>1</cumdocuments_parent> <siblings>1</siblings> <features> <feature> <featureName>RIGHT</featureName> <featureValue>kaffee</featureValue> <featureScore>1.0</featureScore> </feature> </features> </class>

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Evaluation of French product extraction

Evaluation of French product extraction Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Automatic Term Extraction from Domain Corpora

Automatic Term Extraction from Domain Corpora

Presentation Transcript

Domain-Specific Corpora

Automatic Bibliographic Extraction System ABES

Bilingual term extraction revisited

Extraction of Ontological Information from Corpora (and Lexicon)

Automatic phonetic transcription of large speech corpora

Semi-Automatic Content Extraction from Specifications

Open Domain Event Extraction from Twitter

Automatic Domain Identification

Towards Domain-Independent Information Extraction from Web Tables

Automatic Extraction of Subcategorization Frames From Corpora

Automatic Extraction of Hierarchical Relations from Text

Term Extraction from Financial News

Automatic Extraction of Function Bodies from Software Binaries

Automatic term categorization by extracting knowledge from theWeb

Automatic Creation of Web Services from Extraction Ontologies

Implementing Automatic Value Extraction from Structured Web Pages

DSpace, ETDs, Automatic Metadata Extraction

Extraction of Ontological Information from Lexicon and Corpora

CLASSROOM GAMES FROM CORPORA

Automatic Indexing (Term Selection)