bilingual term extraction revisited l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bilingual term extraction revisited PowerPoint Presentation
Download Presentation
Bilingual term extraction revisited

Loading in 2 Seconds...

play fullscreen
1 / 20

Bilingual term extraction revisited - PowerPoint PPT Presentation


  • 273 Views
  • Uploaded on

Bilingual term extraction revisited. Špela Vintar . University of Ljubljana spela.vintar@ ff.uni-lj .si. Extracting terms from the A c quis corpus. Using a bilingual subcorpus on Nuclear Energy (EN-SL) No linguistic preprocessing, only stop lists Universal terms and collocations:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Bilingual term extraction revisited' - vangie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
bilingual term extraction revisited

Bilingual term extraction revisited

Špela Vintar.

University of Ljubljana

spela.vintar@ff.uni-lj.si

extracting terms from the a c quis corpus
Extracting terms from the Acquis corpus
  • Using a bilingual subcorpus on Nuclear Energy (EN-SL)
  • No linguistic preprocessing, only stop lists
  • Universal terms and collocations:
    • Council regulation
    • European Union
    • Member State
    • Commission directive
    • Article
    • Having regard to

Danger of “Acquis stoplists”: European Atomic Energy Community

keyness

N

dfi

weight(i, j) = (1 + log(tfi,j)) log —

“keyness”

Measures of keyness:

  • subcorpus vs. general language corpus (here: Acquis)relative corpus frequency
  • document vs. document collectiontf.idf

Applied to single or multi-word units.

examples of unigrams extracted through rel freq
Words not found in the reference corpus

1 sievert

1 gray

1 Sv

1 wT

1 radon

1 becquerel

1 wTHT

1 DT

1 EDA

1 aboveground

1 APPRENTICES

1 Thermonuclear

1 wR

1 dN

1 ankles

1 mSv

1 after-effects

1 DOSE

1 forearms

1 avertable

1 ITER

1 Committed

1 cosmic

1 HT

1 Bq

1 dt

Words with high rel.freq.

0, 54 Radiological

0,49 concerned

0,21 Board

0,11 aid

0,10 Potential

0,08 reasonably

0,08 Reconstruction

0,08 give

0,08 extend

0,07 alia

0,04 CHAPTER

0,01 qualified

0,01measurement

0,01Nuclear

0,01materials

0,01steps

0,01energy

0,01declared

0,01relevant

0,01contaminating

0,01Design

0,01developments

0,01contribute

0,01procedure

0,01reduce

0,01costs

Examples of unigrams extracted through rel. freq
tf idf
radiological 0,67

exposures 0,25

JRC 0,20

lens 0,19

radiation 0,17

apprentices 0,14

ionizing 0,14

serviceable 0,13

dose 0,13

nuclear 0,12

doses 0,12

workplaces 0,11

EXPOSURE 0,11

radioactive 0,10

joule 0,10

Resolutions 0,10

Governors 0,10

Dose 0,08

students 0,08

Chernobyl 0,08

Cabinet 0,076

Nuclear 0,067

exposure 0,067

non-Member0,059

gender 0,056

workers 0,052

Reactor 0,050

Euratom 0,049

proceeds 0,047

disregarded 0,043

Exchanges 0,042

Optimization 0,042

PRACTICES0,042

dosimetric 0,042

exposed 0,037

population 0,036

contaminating 0,033

Tf.idf
tf idf slovene
sevanju 0,19082

radiološkega 0,17864

dozimetrijo 0,17052

sivert 0,13804

radionuklidov 0,13804

sevanja 0,13195

Dana 0,12992

Černobil 0,12180

Izpostavljenost 0,12180

Jedrska 0,11368

dozo 0,09473

prebivalstva 0,09256

sevanjem 0,08932

ITER 0,08120

Oddelkom 0,07308

inovativnosti 0,07308

študente 0,07308

izpostavljenosti 0,07308

radioaktivne 0,06766

SRS 0,06766

doza 0,06496

posameznike 0,06090

pooblaščenimi 0,05684

cepitve 0,05684

nivoji 0,05684

efektivno 0,05684

medicine 0,05278

fuzije 0,05075

zaposlitvijo 0,04872

termonuklearni 0,04872

študentov 0,04872

guvernerjev 0,04872

prioritete 0,04872

reaktorja 0,04872

jedrske 0,04872

delodajalca 0,04669

izpostavljenih 0,04601

ionizirajočemu 0,04466

ekvivalentno 0,04263

dosegljive 0,04060

ionizirajočega 0,04060

jedrskem 0,04060

nuklearnih 0,04060

kontrolirana 0,04060

radiološki 0,04060

Tf.idf - Slovene
other indicators of term hood
Acronyms (NPP, SG, RBB ...)

Unknown words

not found in the reference corpus

unknown to the lemmatizer

Cognates & Named entities

radioactive ### radioaktivna 1.0

radioactive ### Radioaktivna 1.0

Radioactive ### Radioaktivna 1.0

radioactive ### radioaktivne 1.0

radioactive ### radioaktivnih 1.0

radioactive ### radioaktivnimi 1.0

radioactive ### radioaktivno 1.0

radioactive ### radioaktivnosti 1.0

radioactive ### radiokativnega 1.0

radiography ### radiografijo 1.0

radionuclide ### radionuklid 1.0

radionuclide ### radionuklida 1.0

radionuclide ### radionuklidov 1.0

radionuclides ### radionuklide 1.0

radionuclides ### radionuklidov 1.0

ratify ### ratificirajo1.0

Reactor ### reaktorja 1.0

reactor ### reaktorjev 1.0

reactors ### reaktorji 1.0

Other indicators of termhood
identifying multi word units
Identifying multi-word units
  • Collocation extraction techniques
    • Mutual Information (Church & Hanks 1990)
    • Log-likelihood ratio (Dunning 1993)
    • Entropy-based (Shimohata et al. 1997)
    • Semantic non-compositionality (Pearce 2001)
  • Daille (1994): LL is the most appropriate measure
  • for n > 3: n-gram frequency (+ stopword filtering) also works
n gram term weighting
N-gram term weighting
  • statistically extracted n-grams are not necessarily terms  need for filtering / weighting
  • Stopword filtering
  • Weighting with tf.idf, ll-rank/core frequencyw(tw1, w2, w3) = tf.idfw1tf.idfw2tf.idfw3/n * 1/rank
2 grams weighted with rel freq
2-grams, weighted with rel.freq.

Thermonuclear Experimental 1.91766291545192

International Thermonuclear 1.90047962704222

wR values 1.74111305022281

cosmic radiation 1.68720469442766

non-Member States 1.67427461796584

Atomic Energy 0.996377043841846

European Atomic 0.995366262170687

Energy Community 0.995029334946967

Member States 0.994692407723247

Member State 0.994355480499528

exposed workers 0.990312353814892

radiation protection 0.988290790472574

ionizing radiation 0.985847228548466

nuclear power 0.975824483194946

Nuclear Safety 0.97077057483915

3 grams
3-grams

Thermonuclear Experimental Reactor 2.83532583090384

International Thermonuclear Experimental 2.81814254249414

mSv per year 2.73507410483208

APPRENTICES AND STUDENTS 2.69461709334949

exceed 1 mSv 2.46078960008804

feet and ankles 2.2734580636999

European Atomic Energy 1.99321785686789

Atomic Energy Community 1.99288092964417

DECIDED AS FOLLOWS 1.95055494141597

nuclear power stations 1.94785693049428

Nuclear Safety Account 1.94301570053366

controlled nuclear fusion 1.88877041751479

Energy Community represented 1.87461947411856

natural radiation sources 1.87453104455193

nuclear power station 1.87309490609042

apprentices and students 1.86800777160465

Chernobyl nuclear power 1.86257180721416

establishing the European 1.85670559767151

treatment of nested terms
Treatment of nested terms

C-value (Frantzi & Ananiadou 1996)

C-value(a) = (length(a) –1)(freq(a) – t(a)/c(a))

n-gram C-value

Chernobyl nuclear7,3

nuclear power plant15,2

Chernobyl nuclear power plant 20,4

bilingual lexicon extraction
Bilingual lexicon extraction
  • using Twente (Hiemstra 1998)
    • based on the Iterative Proportional Fitting Procedure (IPFP), word-to-word translation model
    • outputs translation candidates + scores for each word in the corpus; both ways
    • using stopword-filtered corpora to improve results
  • bilingual lexicon expanded with cognates
term alignment
Term alignment
  • for each source term candidate we collect all single-word equivalents from the bilingual lexicon jedrska elektrarna Černobil

power 0.50

plant 0.50

Chernobyl 1.00

nuclear 1.00

term alignment16
Term alignment
  • for each source term candidate we collect all single-word equivalents from the bilingual lexicon jedrska elektrarna Černobil

power 0.50

plant 0.50

Chernobyl 1.00

nuclear 1.00

Nuclear power plant 2.00

Power plant 1.00

Chernobyl nuclear power plant 3.00

outcome
Outcome
  • Corpus: 17.000 tokens
  • Extracted 193 Slovene and 199 English term candidates
  • Bilingual (aligned): 112
  • What we miss:
    • Term variation:
      • disposal of waste / emplacement of waste
      • safety levels / levels of safety
    • Hapax:
      • radiation weighting factor, tissue weighting factor
purpose of term extraction
Purpose of term extraction
  • Extraction vs. annotation
problems
Problems
  • Distinguishing between generic and text-specific terms (same form, same frequency!)
  • Capturing low frequency terms in inflected languages
  • We want to capture domain-specific terms. But most texts are multi-domain!