Evaluation in Information Retrieval

Evaluation inInformation Retrieval Chapter 8

Chapter Outline • IR SystemPerformance Evaluation에 대한 Chapter • Discussion about measuring IR system • Standard Test collection • Evaluating unranked retrieval • Evaluating ranked retrieval • Developing reliable test collection • System Interface and Result summarization

Evaluating Search Engine • IR System 평가의 보편적인 방법 • Search Result의 Relevance를 측정하여 평가 • Relevance measurement를 위해 필요한 것 • A Document collection • A test suite of information needs(≒Query) • A set of relevance judgments(binary, R or N) • Relevance judgment • Binary classification by assessor • Human Expert가 query와 document에 대해 직접 판별 • Relevant or Non-relevant • Information Need과 직접적 연관이 있음

Information Need • Query와는 의미가 약간 다름 • Information Need의 예 • Is drinking red wine more effective at reducing risk of heart attacks than drinking white wine? • Query로 표현한 예 • wine AND red AND white AND heart AND attack AND effective • 실제 Field에서는 두 표현을 혼동함 • 사용자의 실제 Information Need는 Query에서 명확히 표현할 수 없음

For unbiased evaluation • 많은 IRsystem은 tuning에 쓰일 수 있는 parameter를 제공하고 있음 • Test collection에 대하여 parameter를 변경하여 성능을 극대화시키는 것은 무의미 • 원래 가진 성능에 비하여 overstate될 수 있음 • Development test collection에 대하여 parameter를 조정한 뒤, Test collection을 통하여 평가하는 것이 evaluation bias를 줄일 수 있음

Standard Test Collection • The Cranfield collection (1950~) • 선구적인 test collection • 1,398 abstract, 225 queries • relevance judgment for all query-document pair • Text Retrieval Conference(TREC) (1992~) • US National Institute of Standards and Technology • 6 CD containing 1.89 million docs • Relevance judgment for 450 information needs • Relevance는 top k document만 주어짐

Standard Test Collection • GOV2 • GOV2 web pages, 25 million page • 쉽게 구할 수 있는 가장 큰 web collection • NTCIR • NII(Japan) test collection for IR system • Collection 크기는 TREC과 비슷 • East-Asian language에 중점 • 한 언어 이상으로 구성된 document에 대한 query의 성능을 평가할 때 사용

Standard Test Collection • Cross Language Evaluation Forum(CLEF) • Centered on European language and cross-language information retrieval • Reuters-21578 and Reuter-RCV1 • Once had 21,578 news articles • Later released Reuter Corpus Volume 1 • 806,791 documents (Ch4, p.63) • 20 Newsgroups • 20 Usenet newsgroups 으로부터 각 1,000 문서추출 • 중복문서 제거 후 현재 18,941 articles

Evaluation of Unranked Retrieval • Precision & Recall • Precision(P) = #(relevant items retrieved) / #(retrieved items) = P(relevant | retrieved) • P = retrieve한 문서 중 relevant 한 비율 = tp / ( tp + fp )

Evaluation of Unranked Retrieval • Precision & Recall • Recall(R) = #(relevant items retrieved) / #(relevant items) = P(retrieved | relevant) • R = 전체 relevant한 문서 중 retrieved된 비율 = tp / ( tp + fn )

Evaluation of Unranked Retrieval • Accuracy = ( tp + tn ) / ( tp + fp + fn + tn ) • Machine learning classification 평가에 널리 쓰임 • IR의 평가에서는 쓰이지 않음 • 평가할 data의 분류가 편중되어 있음 • 보통 99.9%의 문서들이 non-relevant • retrieve한 k개의 문서에 대하여, non-relevant로 분류할수록 accuracy가 높아짐

Evaluation of Unranked Retrieval • 모든 query에 대하여 99% accuracy를 가지는 search engine의 예 • Accuracy는 IR system을 평가하기에는 부적합

Evaluation of Unranked Retrieval • Precision과 Recall의 관계 • 둘 다 true positive(relevant)한 문서의 양에 집중 • 가져온 결과 중 true positive가 얼마나 많았는지 • 가져온 결과 중 전체 true positive 중 비율이 얼마나 되는지 • 둘 다 중요한 척도 • Typical web surfer – 첫 페이지에 찾는 정보가 있기를 기대 • expecting High Precision • Professional searchers – 찾는 정보가 검색 결과에 있기를 바라며, 많은 false positive도 참고 넘길 수 있음 • expecting High Recall

Evaluation of Unranked Retrieval • Precision과 Recall의 관계 • trade-off의 관계, 둘을 다 충족시킬 수는 없음 • High Precision은 Recall을 줄이고, High Recall은 Precision을 낮춤 • 예) 모든 문서를 다 retrieve한 경우 R = 100%, but low P • F-Measure • 둘을 하나로 통합 • α 혹은 β parameter로 trade-off 비율을 결정

Evaluation of Unranked Retrieval • F-Measure • Weighted harmonic mean • α ∈ [0,1], β=(1-α)/α (∈[0,∞]) • 보통 F1 measure를 많이 사용 •  = 1 or  = ½ • 식이 단순화됨: F = 2PR / (P+R)

Evaluation of Unranked Retrieval • arithmetic mean보다 harmonic mean을 쓰는 이유 • 평균 중 가장 낮은 값을 가짐 • 큰 값에 영향을 잘 받지 않음 • 보통 일의 능률이나 속력 등의 mean을 구할 때는 조화 평균을 사용함 • 예) 1 relevant document of 10,000 • Recall=100% 을 가정하면 Precision=0.01% • 산술 평균일 경우, F-measure=100.01/2=50% • 조화 평균일 경우, F-measure=2*100*0.01/100.01=0.02%

Evaluation of Ranked Retrieval • Precision, Recall, F-measure는 set based measure로, unordered retrieval에 적용 • Ranked Retrieval에서는 위 measurement들이retrieval result set(=top k documents)의 크기에 따라 다양한 값을 가질 수 있으므로, retrieval result set의 크기에 따라 precision-recall curve로 나타낼 수 있음

Evaluation of Ranked Retrieval • A Precision-recall plot for a query

Evaluation of Ranked Retrieval • y축 – precision, x축 – recall 이라고 하면, • k+1번째 retrieve된 document가 non-relevant일 때 Recall은 전과 같지만(true positive의 수가 같음), Precision은 감소 - plot 방향은 밑으로 • k+1번째 retrieve된 document가 relevant일 때 Recall1 up, Precision 소폭 증가 – plot 방향이 오른쪽 위로 (k+1)th document is non-relevant (k+1)th document is relevant

Evaluation of Ranked Retrieval • Interpolated precision • 검색결과의 양을 늘려서 Relevant한 문서가 검색된다면 몇 %의 부정확한 결과는 넘길 수 있다고 하는 가정 • 더 큰 Set의 Precision이 더 높은 값을 가질때, 더 큰 Set이 가지는 Max Precision으로 작은 set의 Precision을 대신함 • Pinterp(r) = maxr’≥rP(r’) ( r → Level of recall )

Evaluation of Ranked Retrieval • 11-point interpolated average precision-recall • 특정 Recall level(tenth of documents from 0 to 1)마다 Precision값을 11 지점 구해서 직선으로 이어 놓은 그래프 • Early TREC competition에서 사용하던 Standard Measure 평가방법 SablR/Cornell 8A1 11pt precision from TREC 8(1999)

Evaluation of Ranked Retrieval • Mean Average Precision

Evaluation in Information Retrieval