Alias detection in link data sets
Download
1 / 36

Alias Detection in Link Data Sets - PowerPoint PPT Presentation


  • 369 Views
  • Uploaded on

Alias Detection in Link Data Sets. Master’s Thesis Paul Hsiung. Alias Definition. Alias of names Dubya = G.W. Bush Usama = Osama G.W.Bush = the President Osama bin Laden = the Emir, the Prince Misspelled words Unintentional (typos) Intentional : mortgage = m0rtg@ge (Spam).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Alias Detection in Link Data Sets' - Audrey


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Alias detection in link data sets l.jpg

Alias Detection in Link Data Sets

Master’s Thesis

Paul Hsiung


Alias definition l.jpg
Alias Definition

  • Alias of names

    • Dubya = G.W. Bush

    • Usama = Osama

    • G.W.Bush = the PresidentOsama bin Laden = the Emir, the Prince

  • Misspelled words

    • Unintentional (typos)

    • Intentional : mortgage = m0rtg@ge (Spam)


In what context do aliases occur l.jpg
In What Context Do Aliases Occur?

  • Newspaper articles

  • WebPages

  • Spam emails

  • Any collections of text


Link data set l.jpg
Link Data Set

  • A way to represent the context

  • Compose of set of names and links

    • Names are extracted from the text

    • Names can refer to the same entity (“Dubya” and “G.W.Bush”)

    • Links are collection of names and represent a relationship between names


Example l.jpg
Example

Wanted al-Qaeda terror network chief Osama bin

Laden and his top aide, Ayman al-Zawahri, have

Moved out of Pakistan and are believed to have

Crossed the mountainous border back into

Afghanistan

  • (Osama bin Laden, Ayman al-Zawahri, al-Qaeda)

  • (Pakistan, Osama bin Laden)

  • (Afghanistan, Osama bin Laden)


Graph representation l.jpg
Graph Representation

Pakistan

Afghanistan

al-Qaeda

Osama

Ayman


Advantages l.jpg
Advantages

  • Link data set is easily understood by computers

  • Mimic the way intelligence communities gather data


Alias detection l.jpg
Alias Detection

  • Given two names in a link data set, are they aliases (i.e. do they refer to the same entity?)

  • How to measure their alias-ness?

  • Semi-supervised learning


Orthographic measures l.jpg
Orthographic Measures

  • String edit distance

    • Minimum number of insertions, deletions, and substitutions required to transform one name into the other

    • SED(Osama, Usama) = 2

    • SED(Osama, Bush) = 7

    • Intuitive measure


Some orthographic measures l.jpg
Some Orthographic Measures

  • String edit distance

  • Normalized string edit distance

  • Discretized string edit distance


Semantic measures l.jpg
Semantic Measures

  • But what about aliases such as the Prince and Osama?

  • Define friends of Osama as people who have occurred in same links with Osama

  • Through link data sets, number of occurrences of each friend can be collected

  • Intuition: friends of the Prince look like friends of Osama

  • Treat friends as probability vectors


Example of friends l.jpg
Example of Friends

al-Qaeda

10

CNN

Osama

2

5

Islam


Comparing two friends lists l.jpg
Comparing Two Friends Lists

al-Qaeda

2

10

The Prince

CNN

8

2

Osama

50

5

Islam

Music


Some semantic measures l.jpg
Some Semantic Measures

  • Dot Product: 10 * 2 + 2 * 8

  • Normalized Dot Product

  • Common Friends: 2 (CNN, AlQaeda)

  • KL Distance:


Classifier l.jpg
Classifier

  • So we have a link data set

  • We have some measures of what aliases are

  • We can easily hand-pick some examples of aliases

  • Let’s build a classifier!


Classifier training set l.jpg
Classifier Training Set

  • Positive examples: hand-pick pairs of names in link data set that are known aliases

  • Negative examples: randomly pick pairs of names from the same link data set

  • Calculate measures for all the pairs and insert them as attributes into the training set



Classifier cross validation l.jpg
Classifier : Cross-Validation

  • Experimented with Decision Trees, k-Nearest Neighbors, Naïve Bayes, Support Vector Machines, and Logistic Regression

  • Logistic Regression performed the best


Prediction l.jpg
Prediction

  • Given a query name in the link data set with known aliases

  • Pair query name with ALL other names

  • Calculate attributes for all pairs

  • Run each pair through the classifier and obtain a score (how likely are they to be aliases?)



Prediction21 l.jpg
Prediction

  • Use the score to sort the pairs from most likely to be an alias to least likely

  • See where the true aliases lie in the sorted list and produce a ROC curve

  • Evaluate classifier based on ROC curve


Summary l.jpg
Summary

True alias pairs

(no query name)

Random pairs

Query name

Calc Attributes

Calc Attributes

Train

Logistic

Regression

Run Classifier

ROC curve


Roc curve l.jpg
ROC Curve

  • Start from (0,0) on the graph

  • Go down the sorted list

  • If the name on the list is a true alias, move y by one unit

  • If the name on the list is not a true alias, move x by one unit


Perfect roc example l.jpg
Perfect ROC Example

3

2

1

0

1

2

3


Roc example l.jpg
ROC Example

3

2

1

0

1

2

3


Roc normalize l.jpg
ROC: Normalize

  • Balance positive and negative examples

  • Area under curve(AUC) = 5/9

  • Able to average multiple curves

1

0.6

0.3

0

0.3

0.6

1


Empirical results l.jpg
Empirical Results

  • Test on one web page link data set and two spam link data sets

  • Hand pick aliases for each set


Empirical results28 l.jpg
Empirical Results

  • Choose an alias from the set of hand pick aliases as a query name

  • Build classifier from other aliases that are not aliases with the query name

  • Do prediction and obtain ROC curve

  • Repeat for each alias in the set of hand pick aliases

  • Average all ROC curves by normalized axis


Evaluation l.jpg
Evaluation

  • We want to know how significant is each group of attributes

  • Train one classifier with just orthographic attributes

  • Train another with just semantic attributes

  • Train a third with both sets of attributes

  • Compare curve and area under curve (AUC)


Terrorist data set l.jpg
Terrorist Data Set

  • Manually extracted from public web pages

  • News and articles related to terrorism

  • Names mentioned in the articles are subjectively linked

  • Used 919 alias pairs for training



Spam data set l.jpg
Spam Data Set

  • Collection of spam emails

  • Filter out html tags

  • All the words are converted to tokens with white spaces being the boundaries

  • Common tokens are filtered (e.g. “the” “a”)

  • Each email represents a link

  • Each link contains tokens from corresponding email


Example33 l.jpg
Example

Subject:Mortgage rates as low as 2.95%

Ref<suyzvigcffl>ina<swwvvcobadtbo>nce to<shecpgkgffa>day to as low as

2.<sppyjukbywvbqc>95% Sa<scqzxytdcua>ve thou<sdzkltzcyry>sa<sefaioubryxkpl>nds of dol<scarqdscpvibyw>l<sklhxmxbvdr>ars or b<skaavzibaenix>uy the <br>

ho<solbbdcqoxpdxcr>me of yo<svesxhobppoy>ur dr<sxjsfyvhhejoldl>eams!<br>

  • Filtered to:

    (mortgage, rates, low, refinance, today,

    save, thousands, dollars, home, dreams)




Conclusion l.jpg
Conclusion

  • Orthographic measures work well

  • Semantic sometimes better, sometimes worse than orthographic

  • Combining them produces the best

  • Future work includes adding other measures such as phonetic string edit distance

  • Larger question: many aliases to many names