230 likes | 248 Views
Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit Nijmegen, 26 November 2007. Automatic term extraction from domain corpora. Overview. Corpus versus Domain-based text collections Customer-case Term-extraction Demo.
E N D
Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit Nijmegen, 26 November 2007 Automatic term extraction from domain corpora
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Overview • Corpus versus Domain-based text collections • Customer-case • Term-extraction • Demo
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Corpus versus Domain-based text collections • Corpus to study linguistic phenomena: • INL corpus: NRC-handelsblad • Corpus geschreven Nederlands • British National Corpus • Brown corpus -> SemCor • Domain corpora: • portals • Wikipedia • Customer corpora: • web sites • manuals
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Customer-case • Connect suppliers and buyers and create traffic and advertisement • B2B: companies with specialized products and services • terminology driven • branch driven • C2B: consumers looking for products and services • general language terminology: -> folksonomy • bottom-up
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Product name in ontology of 150,000 products "kleppen, vlinder, pomp, hoge druk" (valves, butterfly, pump, high pressure) product name on company website "Wij zijn gespecialiseerd in: pompen en pomponderdelen zoals kleppen" (We are specialized in: pumps and components such as valves user query searching for products or servcies "vlinderkleppen voor een hoge drukpomp" (butterfly valves for high pressure pumps) Subscription for product names Companies in database 1.5 million websites
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction • morpho-syntactic analysis • statistical analysis • conceptual analysis • contextual analysis
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: morpho-syntactic analysis • Tokenization, tagging and NP-chunking: • “een gele kaart voor de vleugelaanvaller” (a yellow card for the wing-player) • Term candidates: • Syntactic head of NPs: kaart (card); vleugelaanvaller (wing-player). • Word combinations including syntactic head: gele kaart (yellow card); kaart voor vleugelaanvaller (card for wing-player). • Head of compounds: aanvaller (attacker-player). • Term is a concept: • Normalized form (plural-singular variants, synonyms) • Hypernym based on the syntactic head
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: statistical analysis • Reference corpus based on 500 websites of diverse range of companies • Salience = normFreq * normRef • normFreq = normalized frequency of terms on the website normFreq = nTermFrequencynWords / nPages • normRef = normalized number of websites on which the term occurs in the reference corpus • multiwords: normRef = 1-((nWebsitesnWords) / (referenceCorpusSize)) • singlewords: normRef = 1-((nWebsites) / (referenceCorpusSize))
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: conceptual analysis • Structural properties of the term hierarchy • Poor hierarchies: • many tops • few levels • diverse branches • Each branch is a concept: • number of descendants and levels • cumulated frequency of descendants • Branch profiling: • Domain classification of the hierarchy • Domain classification of each branch • Minimal overlap in domain
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Domains Clothing Sport Finance Culture Music Ball sports Winter sports Wordnet: Domain information Concepts Relations Vocabularies of languages 1 rec: 12345 • financial institute rec: 54321 - river side 2 bank 1 rec: 9876 - small string instrument violin 2 rec: 65438 - musician playing a violin violist rec:42654 - musician type-of 1 rec:35576 - string of an instrument type-of part-of string 2 rec:29551 - underwear rec:25876 - string instrument
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Anything can be a product or service: there are no intrinsic properties to define products • Contextual features: • context patterns for products • product pages • special marking in HTML
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Context patterns for products: • 144 patterns in English and 288 patterns in German • [we supply] [we deliver] [we provide] [our products are][we are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products] • Each term is scored for a product context in terms of the strength of the pattern and the distance
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Term-extraction: contextual analysis • Product pages: • landing page: index.html • html files with product names: product, service, solution • html files referred to by these pages • html files referred to by menus with such names • Special marking in HTML: • meta keywords • headings and titles • menus
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Product terms with feature bundles
<class> <name><![CDATA[arabica-kaffee gemahlene]]></name> <id>48</id> <pos>1</pos> <preferred_form><![CDATA[Arabica-Kaffee Gemahlener]]></preferred_form> <parent_form><![CDATA[Gemahlener]]></parent_form> <documents>1</documents> <frequency>1</frequency> <salience>0.0523</salience> <connectivity>10</connectivity> <modifiers> <modifier>arabica</modifier> <modifier>kaffee</modifier> <modifier>arabica-kaffee</modifier> </modifiers> <profileMatch>-1</profileMatch> <profile/> <termSource><![CDATA[#product]]></termSource> <cumfrequency_parent>1</cumfrequency_parent> <cumdocuments_parent>1</cumdocuments_parent> <siblings>1</siblings> <features> <feature> <featureName>RIGHT</featureName> <featureValue>kaffee</featureValue> <featureScore>1.0</featureScore> </feature> </features> </class>
Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Evaluation of French product extraction
Evaluation of French product extraction Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007
Evaluation of French product extraction Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007