1 / 23

Large -Scale Entity-Based Online Social Network Profile Linkage

Large -Scale Entity-Based Online Social Network Profile Linkage. Background & Motivation. Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications. Outline. Problem definition Related work Approach

buzz
Download Presentation

Large -Scale Entity-Based Online Social Network Profile Linkage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Entity-Based Online Social Network Profile Linkage

  2. Background & Motivation • Foot prints in different social networks. • User identification in social analysis. • Privacy & security • Commercial & government applications

  3. Outline • Problem definition • Related work • Approach • Experiment • Conclusion & future work

  4. Problem Definition • Terminology • Identity: Person • Profile/User: Your footprint on social media • Profile Linkage: Link your footprints together • Input & Output • Input: profiles of one site as QUERY and profiles of the other site as TARGET. • Output: all pairs of classified matched profiles.

  5. Characteristics of profile • Name (semi vs. structured) • {“given name”: “haochen”, “family name”: “zhang”} • name: zhanghaochen • Semi-structured schema • Incompleteness & missing attributes • Privacy policy • Virtual identification • Free text description • Bio, About me, Tags • Multilingualism

  6. Multilingualism • Top 5 languages in dataset of Facebook • English • Portuguese • Spanish • Chinese • French • Most frequent tokens in different languages • chris, john, michael • chen, wang, lee • carlos, garcia, daniel • sergey, olga, alexander • About 70% users are in English • 7.2% users register as different locales • Transliteration • 昊辰 => Haochen

  7. Related work • String similarity metrics • name matching (Jaro Winkler, Edit distance) • VSM (TF-IDF) • Pair-wise comparison • schema matching for different prototypes • unsupervised vs. supervised • Indexing technique • blocking • canopy

  8. Overview of approach

  9. Entity-based representation • Extraction • Attribute => Entity • Named-entity recognition • Language detector • Regular expression • Involved entities • Username • Name • Location • Organization • URL • Language • Country • Gender • Birth • Tokenization • General representation • Microsoft Word Breaker • Preparation for canopy

  10. Canopy: design

  11. Canopy: efficiency

  12. Classification: features • String similarity (Jaro Winkler Similarity) • Username, Name • Token similarity (n-gram, IDF) • Username, Name, Location, URL, Organization • All tokens • Enumeration identity • Language, Country

  13. Classification: learning • Supervised Learning • SVM • Naïve Bayes • C4.5 • AdaBoost with C4.5 • Problems • Imbalance between positive instances and negative instances.

  14. Dataset of experiment • Data source • Google+ • Twitter • Facebook • 20000+ profiles for each social network and 10000+ matched pairs

  15. Experiment on artificial dataset • Balanced dataset with equal amount of positive instances and randomly selected negative instances, which is quite different between matched links and unmatched links. • Name features are most important features while largely mirror each other • Country extracted from URL may be deceptive.

  16. Experiment on overall dataset • Imbalanced dataset and ratio of POS and NEG could be 1:100 even after pruning with canopy. • More similar pending pairs hurt performance. • Failed pruning and excessively relying on name features hurts recall.

  17. Parameter tuning • Greater threshold brings more candidates that interference classifier. • Less threshold prunes more matched links by mistake.

  18. Efficiency

  19. Conclusion • We have investigated characteristics of social network user profiles. • We proposed an supervised approach with canopy to solve large-scale profile linkage task. • The approach is proved to be both effective and efficient, while the run-time and complexity can be controlled.

  20. Future work • Investigate deeper in characteristics of profiles in different locale. • Improving learning techniques. • Automatic semi-structured profile comparison or schema mapping. • Improving approach for web people search task.

  21. Theta vs. Corpus size

  22. Web People Search • Search from search engine. • query by username or tokens • 3 * 2 * 8 * 2 = 96 queries, 675 candidates, 63 ground truth • Evaluation (Training classifier with overall training set) • SVM: P=0.30, R=0.77, F1=0.43 and Accuracy=0.81 • AdaBoosted C4.5: P=0.11, R=0.96, F1=0.21 and Accuracy=0.31 • Too similar username or name to correctly classify unmatched instance.

  23. Thank you

More Related