Margit Bowler

Data Collection and Normalization for the Scenario-Based Lexical Knowledge Resource of a Text-to-Scene Conversion System Margit Bowler

Who I Am • Rising senior at Reed College in Portland • Linguistics major, concentration in Russian

Overview • WordsEye & Scenario-Based Lexical Knowledge Resource (SBLR) • Use of Amazon’s Mechanical Turk (AMT) for data collection • Manual normalization of the AMT data and definition of semantic relations • Automatic normalization techniques of AMT data with respect to building the SBLR • Future automatic normalization techniques

WordsEye Text-to-Scene Conversion the humongous white shiny bear is on the american mountain range. the mountain range is 100 feet tall. the ground is water. the sky is partly cloudy. the airplane is 90 feet in front of the nose of the bear. the airplane is facing right.

Scenario-Based Lexical Knowledge Resource (SBLR) • Information on semantic categories of words • Semantic relations between predicates (verbs, nouns, adjectives, prepositions) and their arguments • Contextual, common-sense knowledge about the visual scenes various actions and items occur in

How to build the SBLR… efficiently? • Manual construction of the SBLR is time-consuming and expensive • Past methods have included mining information from external semantic resources (e.g. WordNet, FrameNet, PropBank) & information extraction techniques from other corpora

Amazon’s Mechanical Turk (AMT) • Online marketplace for work • Anyone can work on AMT, however: • It is possible to screen workers by various criteria. We screened ours by: • Located in the USA • 99%+ approval rating

AMT Tasks • In each task, we asked for up to 10 responses. A comment box was provided for >10 responses. • Task 1: Given the object X, name 10 locations where you would find X. (Locations) • Task 2: Given the object X, name 10 objects found near X. (Nearby Objects) • Task 3: Given the object X, list 10 parts of X. (Part- Whole)

AMT Task Results • 17,200 total responses • Spent $106.90 for all three tasks • It took approximately 5 days to complete each task

Goal: How to automatically normalize data collected from AMT in such a way that AMT would be useful for building the Scenario-Based Lexical Knowledge Resource (SBLR)?

Manual Normalization of AMT Data • Removal of uninformative target item-response item pairs between which no relevant semantic relationship was held • Definition of the semantic relations held between the remaining target item-response item pairs • This manually normalized set of data was used as the standard against which we measured various automatic normalization techniques.

Rejected Target-Response Pairs • Misinterpretation of ambiguous target item (e.g. mobile) • Viable interpretation of target item was not contained within the SBLR (e.g. crawfish as food rather than a living animal) • Too generic responses (e.g. store in response to turntable)

Examples of Approved AMT Responses • Locations: mural - gallery lizard - desert • Nearby Objects: ambulance - stretcher cauldron - fire • Part-Whole: scissors - blade monument - granite

Semantic Relations • Defined a total of 34 relations • Focused on defining concrete, graphically depictable relationships • “Generic” relations accounted for most of the labeled pairs (e.g. containing.r, next-to.r) • Finer distinctions were made within these generic semantic relations (e.g. habitat.r, residence.r within the overarching containing.r relation)

Example Semantic Relations • Locations: mural - gallery - containing.r lizard - desert - habitat.r • Nearby Objects: ambulance - stretcher - next-to.r cauldron - fire - above.r • Part-Whole: scissors - blade - object-part.r monument - granite - stuff-object.r

Semantic Relations within Locations Task • We collected 6850 locations for 342 target objects from our 3D library.

Semantic Relations within Nearby Objects Task • We collected 6850 nearby objects for 342 target objects from our 3D library.

Semantic Relations within Part-Whole Task • We collected 3500 parts of 245 objects.

Automatic Normalization Techniques • Collected AMT data was classified into higher-scoring versus lower-scoring sets by: • Log-likelihood and log-odds of sentential co-occurrences in the Gigaword English corpus • WordNet path similarity • Resnik similarity • WordNet average pair-wise similarity • WordNet matrix similarity • Accuracy evaluated by comparison against manually normalized data

Precision & Recall • AMT data is quite cheap to collect, so we were concerned predominantly with precision (obtaining highly accurate data) rather than recall (avoiding loss of some data). • In order to achieve more accurate data (high precision), we will lose a portion of our AMT data (low recall)

Locations Task • Achieved best precision with log-odds. • Within high-scoring set, responses that were too general (e.g. turntable - store) were rejected. • Within low-scoring set, extremely specific locations that were unlikely to occur within a corpus or WordNet’s synsets were approved (e.g. caliper - architect’s briefcase)

Nearby Objects Task • Relatively few target-response pairs were discarded, resulting in high recall. • High precision due to open-ended nature of task; responses often fell under a relation, if not next-to.r.

Part-Whole Task • Rejected target-response pairs from the high-scoring set were often due to responses that named attributes, rather than parts, of the target item (e.g. croissant - flaky) • Approved pairs from the low-scoring set were mainly due to obvious, “common sense” responses that would usually be inferred, not explicitly stated (e.g. bunny - brain)

Future Automatic Normalization Techniques • Computing word association measures on much larger corpora (e.g. Google’s 1 trillion word corpus) • WordNet synonyms and hypernyms • Latent Semantic Analysis to build word similarity matrices

In Summary… • WordsEye & Scenario-Based Lexical Knowledge Resource (SBLR) • Amazon’s Mechanical Turk & our tasks • Manual normalization of AMT data • Automatic normalization techniques used on AMT data and results • Possible future automatic normalization methods

Thanks to… • Richard Sproat • Masoud Rouhizadeh • All the CSLU interns

Questions?

Margit Bowler

Margit Bowler

Presentation Transcript

Rosemarie M. Bowler, Ph.D., M.P.H.

Margit K ppen IG Metall

Blade by Tim Bowler

Kevin Bowler on Tourism 2025 24 June 2014

Waste Management in Estonia Future Challenges Margit Rüütelmann

New Women in Hungarian literature: Margit Kaffka

9020 Bowler Drive

River Boy (written by Tim Bowler)

Top 5 Fastest Bowler In The Indian Cricket Team

Class Teacher – Mr Eaves PPATeacher – Mrs Ali HLTA– Mrs Bowler,

Led by Mrs Nicholls with assistance from Mrs Bowler PPA time covered by Mrs Bowler

Margit Säre

What Kind Of Pin Bowler Are You? - TORQ03