html5-img
1 / 17

# Performance measures

Performance measures. Performance measures for matching. The following counts are typically measured for performance of matching : TP : true positives , i.e. number of correct matches FN : false negatives , matches that were not correctly detected

## Performance measures

An Image/Link below is provided (as is) to download presentation Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

### Presentation Transcript

1. Performance measures

2. Performance measuresformatching • The followingcounts are typicallymeasured for performance ofmatching: • TP: truepositives, i.e. numberofcorrectmatches • FN: false negatives, matchesthatwerenotcorrectlydetected • FP: false positives, proposedmatchesthat are incorrect • TN: truenegatives, non-matchesthatwerecorrectlyrejected • Based on them, anyparticularmatchingstrategy at a particularthreshold can beratedby the followingmeasures: • True Positive Rate (TPR) alsoreferredasTrueAcceptance Rate (TAR) = TP / (TP+FN) = TP / P • False positive rate (FPR)alsoreferredasFalse Acceptance Rate (FAR) = FP / (FP+TN) = FP / N • TAR @ 0.001 FAR is a typical performance index used in benchmarks. Ideally, the true positive rate willbecloseto1 and the false positive rate closeto0.

3. ROC curves • As wevary the matchingthreshold at whichTPR and FPR are obtained, we derive a set ofpoints in the TPR-FPR space , which are collectivelyknownas the receiver-operatingcharacteristic (ROC curve). • The ROC curve plots the true positive rate against the false positive rate for a particularcombinationoffeatureextraction and matchingalgorithms. The area under the ROC curve (AUC) isoftenusedas a scalar measureof performance. • As the thresholdθis increased,the number of true positives increases and false positives lowers. The closerthis curve liesto the upper left corner, the betterisits performance. • The ROC curve can alsobeusedtocalculate the MeanAveragePrecision, whichis the averageprecisionasyouvary the thresholdtoselect the best results.

4. R’(q) 1 True R(q) Selection f(d,q)= - - - 0 - + - + - - + + - - - + - + - - - - + - - - + - - - + + + - - 0.98 d1 + 0.95 d2 + 0.83 d3 - 0.80 d4 + 0.76 d5 - 0.56 d6 - 0.34 d7 - 0.21 d8 + 0.21 d9 - - - - Ranking f(d,q)= - R’(q) Performance measure for retrieval sets • Definitionof performance measuresforretrievalsetsstemsfrom information retrieval. • The case ofdocumentselectionisdistinguishedfrom the case in which position in the • retrieval set isconsidered (document ranking). + • With Selection, the classifier is inaccurate: • “Over-constrained” query (terms are too specific)  no relevant documents found • “Under-constrained” query (terms are too general)  over delivery Even if the classifier is accurate, all relevant documents are not equally relevant. • Ranking allows the user to control the boundary according to his/her preferences.

5. retrieved & irrelevant Not retrieved & irrelevant irrelevant retrieved & relevant not retrieved but relevant relevant All docs retrieved not retrieved Retrieved Relevant Performance measuresforunrankedretrievalsets • Two most frequent and basic measures for unranked retrieval sets are Precision and Recall.These are first defined for the simple case where the information retrieval system returns a set of documents for a query

6. Relevant Relevant Very high precision, very low recall High recall, but low precision High precision, high recall Relevant • The advantage of having two numbers is that one is more important than the other in many • circumstances: • Surfers would like every result in the first page to be relevant (i.e. high precision). • Professional searchers are moreconcerned with high recall and will tolerate low precision.

7. F-Measure: is a single measure that that takes into account both recall and precision. It is the the weighted harmonic mean of precision and recall: • Compared to arithmetic mean, both precision and recall must be high for harmonic mean to be high.

8. E-Measure (parameterized F-Measure): a variant of F-measure that trades off precision versus recall. Allows weighting emphasis on precision over recall: • Value of  controls trade-off: •  = 1: equally weights precision and recall (E=F). •  > 1: weights recall more. •  < 1: weights precision more.

9. Performance measuresforrankedretrievalsets • In a ranking context, appropriate setsofretrieveddocuments are givenby the top kretrieveddocuments. Foreachsuch set, precision and recallvalues can beplottedtogive a Precision-Recall curve. Precision-Recall curve plot a trade-offbetweenrelevant and non-relevantitemsretrieved Many relevant documents but many other useful missed The ideal case 1 Precision 1 0 Recall Most relevant documents Butalso many non-relevant Slide contentfromJ. Ghosh

10. Computing Precision-Recall points • Precision-Recall plots are built as follows: • For each query, produce the ranked list of retrieved documents. Setting different thresholds on this ranked list results into different sets of retrieved documents. Different recall/precision measures are therefore obtained. • Mark each document in the ranked list that is relevant. • Compute a recall/precision pair for each position in the ranked list that contains a relevant document. Slide contentfromJ. Ghosh

11. Example 1 • Let total # of relevant documents = 6. Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 R=5/6=0.833;P=5/13=0.38 Missing one relevant document. Doesn’t reach 100% recall Slide fromJ. Ghosh

12. Example 2 • Let total # of relevant documents = 6. Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/3=0.667 R=3/6=0.5; P=3/5=0.6 R=4/6=0.667; P=4/8=0.5 R=5/6=0.833; P=5/9=0.556 R=6/6=1.0;P=6/14=0.429 Slide fromJ. Ghosh

13. InterpolatedPrecision-Recallcurves • Precision-recallcurveshave a distinctivesaw-toothshape: • if the (k+ 1)thdocumentretrievedisnon-relevantthenrecallis the sameasfor the top kdocuments, butprecisiondrops; • ifitisrelevant, thenbothprecision and recallincrease, and the curve jags up and to the right. • Interpolated Precision is often useful to removejiggles. the interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r : pint(r) = maxr’≥r p(r′) Interpolated precision at recall levelr r

14. In order to obtain reliable performance measures, performance is averaged over a • large set of queries: • Compute average precision at each standard recall level across all queries. • Plot average precision/recall curves to evaluate overall system performance on • a document/query corpus. Precision 11 0.75 0.667 0.38 Precision 1 0.67 0.6 0.5 0.556 0.429 Recall 1 Recall

15. Comparing performance of two or more systems • When performance of two or more systems are compared, the curve closest to the upper right-hand corner of the graph indicates the best performance This system has the best performance Precision Recall Slide fromJ. Ghosh

16. Other performance measuresforrankedretrievalsets • Average precision(AP) is a typical performance measure used for ranked sets. Average Precision is defined as the average of the precision scores after each relevant item (true positive, TP) in the scope S. Given a scope S = 7, and a ranked list (gain vector) G = [1,1,0,1,1,0,0,1,1,0,1,0,0,..], where 1/0 indicate the gains associated to relevant/non-relevant items, respectively: AP = (1/1 + 2/2 + 3/4 + 4/5) / 4 = 0.8875. • Mean Average Precision (MAP): Average of the average precision value for a set of queries. • Average Dynamic Precision (ADP) is also used. It is defined as the average sum of precisions with increasing scope S, with 1 ≤ S ≤ #relevant items: = (1 + 1 + 0.667 + 0.75 + 0.80 + 0.667 + 0.571) / 7 = 0.779

17. Other measures for ranked retrieval sets usually employed in benchmarks are the mean values of : • Recognition Rate: total number of queries for which a relevant item is in the 2nd position of the ranked list divided by the number of items in the dataset • 1st tier and 2nd tier: average number of relevant items retrieved respectively in the first n and 2n positions of the ranked list (n =7 typically used in benchmarks). • Cumulated Gain (CG) at a particularrank position p: wherereliis the gradedrelevanceof the result at position i (at rank-5 typically used). • DiscountedCumulated Gain (DCG) at a particularrank position p(highlyrelevantdocumentsappearinglower in a searchresultlist are penalizedreducing the gradedrelevancevaluelogarithmicallyproportionalto the position of the result):

More Related