Constructing Folksonomies from User-Specified Relations on Flickr

Constructing Folksonomies from User-Specified Relations on Flickr Anon Plangprasopchok and Kristina Lerman

hierarchical classification Organize Search Recommend Leverage Categorize Motivation Discover Consume Annotation / Metadata Annotate Produce Organize Users Web content

Motivation • Metadata from an individual user may be tooinaccurate and incomplete… • The metadata from different users may complement each other, making it, • in combination, meaningful. Goal: to induce category knowledge from social annotation produced by many users

Folksonomy • Original definition: classification emerging from the use of tags by users (Thomas Vander Wal) • In this work: hidden classification hierarchies from annotation created many users

Hierarchical Relations in Social Web • Appear Implicitly Tags: Insect Grasshopper Australian Macro Orthoptera Goal: to induce deeper hierarchies from this metadata • Appear Explicitly Relations Folder (collection) Sub folder (set)

Outline • Motivation • Approaches • Results • Discussion • Related work

Inducing Hierarchy from Tags • Existing approaches • Graph based (Mika05) • build a network of associated tags (node = tag, edge = co-occurrence of tags) • suggest applying betweenness centrality and set theory to determine broader/narrower relations • Hierarchical Clustering (Brooks06; Heymann06+) • Tags appear more frequently would have higher centrality and thus more abstract. • Probabilistic subsumption (Sanderson99+, Schmitz06) • x is broader than y if x subsumes y • x subsumes y if p(x|y) > t & p(y|x) < t x y

Inducing Hierarchy from Tags • Some difficulties when using tags • to induce hierarchy: Notation: A  B (A is broader than B) (hypernym relation) Washington  United States Car  Automobile Specificity Rarity Insect  Hongkong Color  Brazilian Tags are from different facets* Above relations induced using subsumption approach on tags [Sanderson99+, Schmitz06]

Inducing Hierarchy from user-specified relations • User specified relations, e.g., • Flickr’s Collection-Set , • Delicious’ Bundle-Tag, • Bibsonomy’s Relation-Tag • Key intuition: Not so many people specify peculiar relations like • “automobile”  “car”, or • “Washington”  “United States”

Concept relations holanda netherland holanda Collection The Netherlands - Holanda blijdorp blijdorp rotterdam Set netherland Blijdorp - Rotterdam countri rotterdam netherland countri holland netherland china blijdorp …… Simple Strategy Collection Sets Tokenize + Stem • Remove “noisy” relations • Conflict resolution • Significance test 2. Link concepts & Select path …

Remove noisy relations: 1st approach • Conflict Resolution (when both a->b and b->a appear) • Relation conflicts occur because of noise • Voting scheme: Keep ab (and discard ba) If Nu(ab) > 1 and Nu(ab) > Nu(ba) butterfly insect 2 10 insect butterfly

Remove noisy relations:2nd approach • Significance Test • Use statistical significance test to decide if a  b is significant • Null hypothesis: observed relation ab was generated by chance, via the random, independent generation of individual concepts a, b (according to the binomial distribution). Is “b” narrower than “a” by chance? reject accept  # observations # of ab

anim insect bug moth Link concepts and select path • Link concepts: assume that same terms refer to the same concept. anim anim anim  + bug insect bug insect • Select path: link relations from many users can cause a spaghettigraph • 4 possible paths from anim  moth: • abim • aim • am • abm 26 72 1 Network Bottleneck idea: “the flow bottleneck is a minimum flow capacity among all relations in the path” 10 18 4 • abim [BN score = min(26,1,18) = 1] • aim [BN score = min(72,18) = 18] • am [BN score = min(10) = 10] • abm [BN score = min(26,4) = 4]

Contribution#2:Learning Concept Hierarchies Evaluation & Data Set • Hypothesis: the approach that takes explicit relations into account can induce better hierarchies. • “Better” means more consistent with hand-built hierarchies (ODP ver. 10/08) • The baseline approach is subsumption approach [Schmitz06] Collection and set terms are used instead of tags, making it comparable. • Data Set: • Data from 17 user groups, devoted to wildlife • and naturalist photography • 21,792 of 39,922 users specify at least one collection • 110,543 unique terms (c.f. 166,153 unique terms in ODP), 15,495 terms in common.

Contribution#2:Learning Concept Hierarchies Reference hierarchy Evaluation methodology ODP has many sub hierarchies: comparing to the induced ones are impractical! It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare. Relations (right after tokenized) (ODP) Induce (remove noise+link) Induced hierarchy

Contribution#2:Learning Concept Hierarchies Metrics • Taxonomic Overlap [adapted from Maedche02+] • measuring structure similarity between two trees • for each node, determining how many ancestor and descendant nodes overlap to those in the reference tree. • Lexical Recall • measuring how well an approach can discover concepts, existing in the reference hierarchy (coverage)

Quantitative Results

Contribution#2:Learning Concept Hierarchies Quantitative Results • Manually selecting 32 root nodes • Taxonomic Overlap : • 27 of them are better than those by subsumption • 3 of them get zero score in both approaches • Lexical Recall: • 28 of them are better than those by subsumption • 2 of them get similar score on both approaches • the rest, by subsumption, only induce the root node. • Theproposed approach can induce deeper trees The proposed approach can induce hierarchies more consistent with ODP in almost all cases.

Sport hierarchy

Invertebrate hierarchy

Country hierarchy

Discussion • Simple strategy to aggregate a large number of shallow relations specified by different users into a common, deeper hierarchy • Induced hierarchies are more consistent with ODP • Future work includes: • Term ambiguity • Relation types • Global path • Apply to other datasets

Related Work • Learning concept hierarchy from text data • Syntactic based [Hearst92, Caraballo99, Pasca04, Cimiano+05, Snow+06] • Word clustering [e.g. Segal+02, Blei+03] • Induce concept hierarchy from tags • Graph-based & clustering based [Mika05, Brooks+06, Heymann+06, Zhou07+] • Probabilistic subsumption [Schmitz06] • Ontology alignment [Udrea+07] • Exploit user-specified hierarchy • GiveALink [Markines06+]

Questions? • Is the metric used in evaluation meaningful? • How is the scalability of the system? • Wordnet, ODP is already there. Why do we need this system? • How is this work related to ontology enrichment? • Is it ethical to collect users’ data? • ….?

Spared slides beyond here

Canada Australia “Victoria” Lotus Person name Spain España Open Problems • Term ambiguity - The current approach: similar terms refer to the similar concept …. but.. • - And has no explicit way to merge synonyms (There are also many acronyms & colloquial terms in Social Web) A possible solution: concept clustering

Open Problems • Inducing “related-to” relation • “Flora” and “Fauna”, “Pet” and “Family” • Prepositions or some connectors may give some clues, e.g., “flora & fauna” and “Pets – Family” • Tag distributions may also help Nature Fauna Nature Flora Fauna Flora

Open Problems • True parent selection • Tokenizing collection/set names can cause another problem Flora & Fauna Fauna Flora Insect Insect Insect A possible solution: conditional probability ratio

Conclusion • Propose statistical approaches for • inducing concepts; • inducing concept hierarchies, from social annotation • On going work aim to improve induced hierarchies’ quality includes: • Resolve term ambiguity • Induce “related to” relations • Select the right parent • Evaluate on more data sets These approaches perform better than existing approaches

Social Web spare User Consume Produce Discover Content Annotate Organize Adapted from The Social Web: an Information Revolution (courtesy of Kristina Lerman)

Social Web users 3 Basic Entities Involved (1) User (2) Content (3) Metadata • Produce • Consume • Annotate & Organize content Delicious : 5.3 million users; over 180 million unique URLs[blog.delicious.com, 2009] Flickr: 2 billion photos[techcrunch.com, 2007]/ 4000+ photos upload per min (1/21/2009 morning)

Motivation spare Social Annotation is potentially a good source of evidence for inducing category knowledge, which is useful in many applications, e.g., • Organizing Arranging/ Visualizing users’ content (e.g., semantic directory) • Search/Discovery Especially, binary content like photos and videos, where social annotation functions as a semantic index • Recommendation Learning users’ taste/ interest • Leveraging knowledge bases Updating lexical systems and ontologies for semantic web applications • Categorization Understanding how new content fits to existing ones

Motivation Although metadata from an individual user may be too inaccurate and incomplete, those from different users may complement each other, making them meaningful for the tasks. Goal: to induce category knowledge from social annotation produced by many users

Contribution#2:Learning Concept Hierarchies Evaluation methodology ODP has many sub hierarchies: comparing to the induced ones are impractical! It’s easier to compare when specifying “root concept” and “leaf concepts”, i.e., specifying a certain sub tree to compare.

<root, leaf, odp path> e.g., Animal/Mammal/Rodent/Rat <root, leaf> e.g., <anim, rat> User-specified relations Find ODP root-leaf pairs that overlap w/Flickr Data pre-processing Collection Flickr relations Flickr-ODP root-leaf overlaps Set Significance Test Relation weighting & linking Compute Taxonomic Overlap, Lexical Recall Conflict Resolution Subsumption Hierarchy Construction Evaluation

Why subsumption does not work so well? Countri Ideal China Reality

Contribution#2:Learning Concept Hierarchies Africa Hierarchy

Constructing Folksonomies from User-Specified Relations on Flickr