Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

CrowdsourcingPreference Judgments for Evaluation of Music Similarity Tasks JuliánUrbano, Jorge Morato,Mónica Marrero and Diego Martín jurbano@inf.uc3m.es SIGIR CSE 2010Geneva, Switzerland · July 23rd

Outline • Introduction • Motivation • Alternative Methodology • Crowdsourcing Preferences • Results • Conclusions and Future Work

Evaluation Experiments • Essential for Information Retrieval [Voorhees, 2002] • Traditionally followed the Cranfield paradigm • Relevance judgments are the most important part of test collections (and the most expensive) • In the music domain evaluation has not been taken too seriously until very recently • MIREX appeared in 2005 [Downie et al., 2010] • Additional problems with the construction and maintenance of test collections [Downie, 2004]

Music Similarity Tasks • Given a music piece (i.e. the query) return a ranked list of other pieces similar to it • Actual music contents, forget the metadata! • It comes in two flavors • Symbolic Melodic Similarity (SMS) • Audio Music Similarity (AMS) • It is inherently more complex to evaluate • Relevance judgments are very problematic

Relevance (Similarity) Judgments • Relevance is usually considered on a fixed scale • Relevant, not relevant, very relevant… • For music similarity tasks relevance is rather continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007] • Single melodic changes are not perceived to change the overall melody • Move a note up or down in pitch, shorten it, etc. • But the similarity is weaker as more changes apply • Where is the line between relevance levels?

Partially Ordered Lists • The relevance of a document is implied by its position in a partially ordered list [Typke et al., 2005] • Does not need any prefixed relevance scale • Ordered groups of documents equally relevant • Have to keep the order of the groups • Allow permutations within the same group • Assessors only need to be sure that any pair of documents is ordered properly

Partially Ordered Lists (II)

Partially Ordered Lists (and III) • Used in the first edition of MIREX in 2005[Downie et al., 2005] • Widely accepted by the MIR communityto report new developments [Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006] • MIREX was forced to move to traditionallevel-based relevance since 2006 • Partially ordered lists are expensive • And have some inconsistencies

Expensiveness • The ground truth for just 11 queries took 35 music experts for 2 hours [Typke et al., 2005] • Only 11 of them had time to work on all 11 queries • This exceeds MIREX’s resources for a single task • MIREX had to move to level-based relevance • BROAD: Not Similar, Somewhat Similar, Very Similar • FINE: numerical, from 0 to 10 with one decimal digit • Problems with assessor consistency came up

Issues with Assessor Consistency • The line between levels is certainly unclear[Jones et al., 2007][Downie et al., 2010]

Original Methodology • Go back to partially ordered lists • Filter the collection • Have the experts rank the candidates • Arrange the candidates by rank • Aggregate candidates whose ranks are not significantly different (Mann-Whitney U) • There are known odd results and inconsistencies [Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b] • Disregard changes that do not alter the actual perception, such as clef or key and time signature • Something like changing the language of a text and use synonyms [Urbano et al., 2010a]

Inconsistencies due to Ranking

Alternative Methodology • Minimize inconsistencies [Urbano et al., 2010b] • Cheapen the whole process • Reasonable Person hypothesis [Downie, 2004] • With crowdsourcing (finally) • Use Amazon Mechanical Turk • Get rid of experts [Alonso et al., 2008][Alonso et al., 2009] • Work with “reasonable turkers” • Explore other domains to apply crowdsourcing

Equally Relevant Documents • Experts were forced to give totally ordered lists • One would expect ranks to randomly average out • Half the experts prefer one document • Half the experts prefer the other one • That is hardly the case • Do not expect similar ranks if the expertscan not give similar ranks in the first place

Give Audio instead of Images • Experts may guide by the images, not the music • Some irrelevant changes in the image can deceive • No music expertise should be needed • Reasonable personturker hypothesis

Preference Judgments • In their heads, experts actually do preference judgments • Similar to a binary search • Accelerates assessor fatigue as the list grows • Already noted for level-based relevance • Go back and re-judge [Downie et al., 2010][Jones et al., 2007] • Overlapping between BROAD and FINE scores • Change the relevance assessment question • Which is more similar to Q: A or B? [Carterette et al., 2008]

Preference Judgments (II) • Better than traditional level-based relevance • Inter-assessor agreement • Time to answer • In our case, three-point preferences • A < B (A is more similar) • A = B (they are equally similar/dissimilar) • A > B (B is more similar)

Preference Judgments (and III) • Use a modified QuickSort algorithm to sort documents in a partially ordered list • Do not need all O(n2) judgments, but O(n·log n) X is the current pivot on the segment X has been pivot already

How Many Assessors? • Ranks are given to each document in a pair • +1 if it is preferred over the other one • -1 if the other one is preferred • 0 if they were judged equally similar/dissimilar • Test for signed differences in the samples • In the original lists 35 experts were used • Ranks of a document between 1 and more than 20 • Our rank sample is less (and equally) variable • rank(A) = -rank(B) ⇒ var(A) = var (B) • Effect size is larger so statistical power increases • Fewer assessors are needed overall

Crowdsourcing Preferences • Crowdsourcing seems very appropriate • Reasonable person hypothesis • Audio instead of images • Preference judgments • QuickSort for partially ordered lists • The task can be split into very small assignments • It should be much more cheap and consistent • Do not need experts • Do not deceive and increase consistency • Easier and faster to judge • Need fewer judgments and judges

New Domain of Application • Crowdsourcing has been used mainly to evaluate text documents in English • How about other languages? • Spanish [Alonso et al., 2010] • How about multimedia? • Image tagging? [Nowak et al., 2010] • Music similarity?

Data • MIREX 2005 Evaluation collection • ~550 musical incipits in MIDI format • 11 queries also in MIDI format • 4 to 23 candidates per query • Convert to MP3 as it is easier to play in browsers • Trim the leading and tailing silence • 1 to 57 secs. (mean 6) to 1 to 26 secs. (mean 4) • 4 to 24 secs. (mean 13) to listen to all 3 incipits • Uploaded all MP3 files and a Flash player to a private server to stream data on the fly

HIT Design 2 yummy cents of dollar

Threats to Validity • Basically had to randomize everything • Initial order of candidates in the first segment • Alternate between queries • Alternate between pivots of the same query • Alternate pivots as variations A and B • Let the workers know about this randomization • In first trials some documents were judged more similar to the query than the query itself! • Require at least 95% acceptance rate • Ask for 10 different workers per HIT [Alonso et al., 2009] • Beware of bots (always judged equal in 8 secs.)

Summary of Submissions • The 11 lists account for 119 candidates to judge • Sent 8 batches (QuickSort iterations) to MTurk • Had to judge 281 pairs (38%) = 2810 judgments • 79 unique workers for about 1 day and a half • A total cost (excluding trials) of $70.25

Feedback and Music Background • 23 of the 79 workers gave us feedback • 4 very positive comments: very relaxing music • 1 greedy worker: give me more money • 2 technical problems loading the audio in 2 HITs • Not reported by any of the other 9 workers • 5 reported no music background • 6 had formal music education • 9 professional practitioners for several years • 9 play an instrument, mainly piano • 6 performers in choir

Agreement between Workers • Forget about Fleiss’ Kappa • Does not account for the size of the disagreement • A<B and A=B is not as bad as A<B and B<A • Look at all 45 pairs of judgments per pair • +2 if total agreement (e.g. A<B and A<B) • +1 if partial agreement (e.g. A<B and A=B) • 0 if no agreement (i.e. A<B and B<A) • Divide by 90 (all pairs with total agreement) • Average agreement score per pair was 0.664 • From 0.506 (iteration 8) to 0.822 (iteration 2)

Agreement Workers-Experts • Those 10 judgments were actually aggregated Percentages per row total • 155 (55%) total agreement • 102 (36%) partial agreement • 23 (8%) no agreement • Total agreement score = 0.735 • Supports the reasonable person hypothesis

Agreement Single Worker-Experts

Agreement (Summary) • Very similar judgments overall • The reasonable person hypothesis stands still • Crowdsourcing seems a doable alternative • No music expertise seems necessary • We could use just one assessor per pair • If we could keep him/her throughout the query

Ground Truth Similarity • Do high agreement scores translate intohighly similar ground truth lists? • Consider the original lists (All-2) as ground truth • And the crowdsourced lists as a system’s result • Compute the Average Dynamic Recall [Typke et al., 2006] • And then the other way around • Also compare with the (more consistent) original lists aggregated in Any-1 form [Urbano et al., 2010b]

Ground Truth Similarity (II) • The result depends on the initial ordering • Ground truth = (A, B, C), (D, E) • Results1 = (A, B), (D, E, C) • ADR score = 0.933 • Results2 =(A, B), (C, D, E) • ADR score = 1 • Results1 is identical to Results2 • Generate 1000 (identical) versions by randomly permuting the documents within a group

Ground Truth Similarity (and III) Min. and Max. between square brackets • Very similar to the original All-2 lists • Like the Any-1 version, also more restrictive • More consistent (workers were not deceived)

MIREX 2005 Revisited • Would the evaluation have been affected? • Re-evaluated the 7 systems that participated • Included our Splines system [Urbano et al., 2010a] • All systems perform significantly worse • ADR score drops between 9-15% • But their ranking is just the same • Kendall’s τ= 1

Conclusions • Partially ordered lists should come back • We proposed an alternative methodology • Asked for three-point preference judgments • Used Amazon Mechanical Turk • Crowdsourcing can be used for music-related tasks • Provided empirical evidence supporting the reasonable person hypothesis • What for? • More affordable and large-scale evaluations

Conclusions (and II) • We need fewer assessors • More queries with the same man-power • Preferences are easier and faster to judge • Fewer judgments are required • Sorting algorithm • Avoid inconsistencies (A=B option) • Using audio instead of images gets rid of experts • From 70 expert hours to 35 hours for $70

Future Work • Choice of pivots in the sorting algorithm • e.g. the query itself would not provide information • Study the collections for Audio Tasks • They have more data • Inaccessible • But no partially ordered list (yet) • Use our methodology with one real expert judging preferences for the same query • Try crowdsourcing too with one single worker

Future Work (and II) • Experimental study on the characteristics of music similarity perception by humans • Is it transitive? • We assumed it is • Is it symmetrical? • If these properties do not hold we have problems • Id they do, we can start thinking on Minimal and Incremental Test Collections[Carterette et al., 2005]

And That’s It! Picture by 姒儿喵喵

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Presentation Transcript

Similarity Evaluation Techniques for Filtering Problems

Functional Data Analysis of Continuous Judgments in Music Cognition

Music Preference and Relationship Satisfaction

Evaluation of my music magazine

Leisure Needs 2005 – Type of Music Preference

Homework Tasks for Latin American Music

Music Magazine Evaluation:

An Analysis of Assessor Behavior in Crowdsourced Preference Judgments

Homework Tasks for Programme Music

Preference Based Evaluation Measures for Novelty and Diversity

SENSORY EVALUATION FOR PERCIK SAUCE (CONSUMER PREFERENCE)

Crowdsourcing Blog Track Top News Judgments at TREC

Similarity Matrix Processing for Music Structure Analysis

Enforcement of Judgments:

Crowdsourcing

Evaluation of cultural similarity in playlist generation

Music Magazine Evaluation

Music Preference Among Students at LLC

Evaluation of Conditional Preference Queries

Music Magazine Evaluation

Preference of music

Music Preference and Relationship Satisfaction