I256: Applied Natural Language Processing - PowerPoint PPT Presentation

1 / 21

I256: Applied Natural Language Processing. Marti Hearst Sept 27, 2006. Evaluation Measures. Evaluation Measures. Precision: Proportion of those you labeled X that the gold standard thinks really is X #correctly labeled by alg/ all labels assigned by alg

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

I256: Applied Natural Language Processing

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Marti Hearst

Sept 27, 2006

Evaluation Measures

• Precision:

• Proportion of those you labeled X that the gold standard thinks really is X

• #correctly labeled by alg/ all labels assigned by alg

• #True Positive / (#True Positive + #False Positive)

• Recall:

• Proportion of those items that are labeled X in the gold standard that you actually label X

• #correctly labeled by alg / all possible correct labels

• #True Positive / (#True Positive + # False Negative)

F-measure

• Can “cheat” with precision scores by labeling (almost) nothing with X.

• Can “cheat” on recall by labeling everything with X.

• The better you do on precision, the worse on recall, and vice versa

• The F-measure is a balance between the two.

• 2*precision*recall / (recall+precision)

Evaluation Measures

• Accuracy:

• Proportion that you got right

• (#True Positive + #True Negative) / N

N = TP + TN + FP + FN

• Error:

• (#False Positive + #False Negative)/N

Prec/Recall vs. Accuracy/Error

• When to use Precision/Recall?

• Useful when there are only a few positives and many many negatives

• Also good for ranked ordering

• Search results ranking

• When to use Accuracy/Error

• When every item has to be judged, and it’s important that every item be correct.

• Error is better when the differences between algorithms are very small; let’s you focus on small improvements.

• Speech recognition

Evaluating Partial Parsing

• How do we evaluate it?

Testing our Simple Fule

• Let’s see where we missed:

Incorrect vs. Missed

• Add code to print out which were incorrect