Alias detection in link data sets
Download
1 / 36

Alias Detection in Link Data Sets - PowerPoint PPT Presentation


Alias Detection in Link Data Sets. Master’s Thesis Paul Hsiung. Alias Definition. Alias of names Dubya = G.W. Bush Usama = Osama G.W.Bush = the President Osama bin Laden = the Emir, the Prince Misspelled words Unintentional (typos) Intentional : mortgage = m0rtg@ge (Spam).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Alias Detection in Link Data Sets

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Alias Detection in Link Data Sets

Master’s Thesis

Paul Hsiung


Alias Definition

  • Alias of names

    • Dubya = G.W. Bush

    • Usama = Osama

    • G.W.Bush = the PresidentOsama bin Laden = the Emir, the Prince

  • Misspelled words

    • Unintentional (typos)

    • Intentional : mortgage = m0rtg@ge (Spam)


In What Context Do Aliases Occur?

  • Newspaper articles

  • WebPages

  • Spam emails

  • Any collections of text


Link Data Set

  • A way to represent the context

  • Compose of set of names and links

    • Names are extracted from the text

    • Names can refer to the same entity (“Dubya” and “G.W.Bush”)

    • Links are collection of names and represent a relationship between names


Example

Wanted al-Qaeda terror network chief Osama bin

Laden and his top aide, Ayman al-Zawahri, have

Moved out of Pakistan and are believed to have

Crossed the mountainous border back into

Afghanistan

  • (Osama bin Laden, Ayman al-Zawahri, al-Qaeda)

  • (Pakistan, Osama bin Laden)

  • (Afghanistan, Osama bin Laden)


Graph Representation

Pakistan

Afghanistan

al-Qaeda

Osama

Ayman


Advantages

  • Link data set is easily understood by computers

  • Mimic the way intelligence communities gather data


Alias Detection

  • Given two names in a link data set, are they aliases (i.e. do they refer to the same entity?)

  • How to measure their alias-ness?

  • Semi-supervised learning


Orthographic Measures

  • String edit distance

    • Minimum number of insertions, deletions, and substitutions required to transform one name into the other

    • SED(Osama, Usama) = 2

    • SED(Osama, Bush) = 7

    • Intuitive measure


Some Orthographic Measures

  • String edit distance

  • Normalized string edit distance

  • Discretized string edit distance


Semantic Measures

  • But what about aliases such as the Prince and Osama?

  • Define friends of Osama as people who have occurred in same links with Osama

  • Through link data sets, number of occurrences of each friend can be collected

  • Intuition: friends of the Prince look like friends of Osama

  • Treat friends as probability vectors


Example of Friends

al-Qaeda

10

CNN

Osama

2

5

Islam


Comparing Two Friends Lists

al-Qaeda

2

10

The Prince

CNN

8

2

Osama

50

5

Islam

Music


Some Semantic Measures

  • Dot Product: 10 * 2 + 2 * 8

  • Normalized Dot Product

  • Common Friends: 2 (CNN, AlQaeda)

  • KL Distance:


Classifier

  • So we have a link data set

  • We have some measures of what aliases are

  • We can easily hand-pick some examples of aliases

  • Let’s build a classifier!


Classifier Training Set

  • Positive examples: hand-pick pairs of names in link data set that are known aliases

  • Negative examples: randomly pick pairs of names from the same link data set

  • Calculate measures for all the pairs and insert them as attributes into the training set


Classifier Example:


Classifier : Cross-Validation

  • Experimented with Decision Trees, k-Nearest Neighbors, Naïve Bayes, Support Vector Machines, and Logistic Regression

  • Logistic Regression performed the best


Prediction

  • Given a query name in the link data set with known aliases

  • Pair query name with ALL other names

  • Calculate attributes for all pairs

  • Run each pair through the classifier and obtain a score (how likely are they to be aliases?)


Example


Prediction

  • Use the score to sort the pairs from most likely to be an alias to least likely

  • See where the true aliases lie in the sorted list and produce a ROC curve

  • Evaluate classifier based on ROC curve


Summary

True alias pairs

(no query name)

Random pairs

Query name

Calc Attributes

Calc Attributes

Train

Logistic

Regression

Run Classifier

ROC curve


ROC Curve

  • Start from (0,0) on the graph

  • Go down the sorted list

  • If the name on the list is a true alias, move y by one unit

  • If the name on the list is not a true alias, move x by one unit


Perfect ROC Example

3

2

1

0

1

2

3


ROC Example

3

2

1

0

1

2

3


ROC: Normalize

  • Balance positive and negative examples

  • Area under curve(AUC) = 5/9

  • Able to average multiple curves

1

0.6

0.3

0

0.3

0.6

1


Empirical Results

  • Test on one web page link data set and two spam link data sets

  • Hand pick aliases for each set


Empirical Results

  • Choose an alias from the set of hand pick aliases as a query name

  • Build classifier from other aliases that are not aliases with the query name

  • Do prediction and obtain ROC curve

  • Repeat for each alias in the set of hand pick aliases

  • Average all ROC curves by normalized axis


Evaluation

  • We want to know how significant is each group of attributes

  • Train one classifier with just orthographic attributes

  • Train another with just semantic attributes

  • Train a third with both sets of attributes

  • Compare curve and area under curve (AUC)


Terrorist Data Set

  • Manually extracted from public web pages

  • News and articles related to terrorism

  • Names mentioned in the articles are subjectively linked

  • Used 919 alias pairs for training


Web Page Chart


Spam Data Set

  • Collection of spam emails

  • Filter out html tags

  • All the words are converted to tokens with white spaces being the boundaries

  • Common tokens are filtered (e.g. “the” “a”)

  • Each email represents a link

  • Each link contains tokens from corresponding email


Example

Subject:Mortgage rates as low as 2.95%

Ref<suyzvigcffl>ina<swwvvcobadtbo>nce to<shecpgkgffa>day to as low as

2.<sppyjukbywvbqc>95% Sa<scqzxytdcua>ve thou<sdzkltzcyry>sa<sefaioubryxkpl>nds of dol<scarqdscpvibyw>l<sklhxmxbvdr>ars or b<skaavzibaenix>uy the <br>

ho<solbbdcqoxpdxcr>me of yo<svesxhobppoy>ur dr<sxjsfyvhhejoldl>eams!<br>

  • Filtered to:

    (mortgage, rates, low, refinance, today,

    save, thousands, dollars, home, dreams)


Spam I Chart


Spam II Chart


Conclusion

  • Orthographic measures work well

  • Semantic sometimes better, sometimes worse than orthographic

  • Combining them produces the best

  • Future work includes adding other measures such as phonetic string edit distance

  • Larger question: many aliases to many names


ad
  • Login