1 / 20

LDA-based Dark Web Analysis

LDA-based Dark Web Analysis. Outline. What is Dark Web? Why do we need to analyze it? How to analyze Dark Web: Our Strategy Web Crawling Topic Discovery based on Latent Dirichlet Allocation (LDA) Optimization Process Conclusion. What is Dark Web?.

sevita
Download Presentation

LDA-based Dark Web Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LDA-based Dark Web Analysis

  2. Outline • What is Dark Web? • Why do we need to analyze it? • How to analyze Dark Web: Our Strategy • Web Crawling • Topic Discovery based on Latent Dirichlet Allocation (LDA) • Optimization Process • Conclusion

  3. What is Dark Web? • Web is a global information platform accessible from different locations. • It is a fast tool to spread information anonymously or with few regulations. • Its cost is relatively low compared with other media. • Dark Web is the place where terrorist/extremist organizations and their sympathizers • exchange ideology • spread propaganda • recruit members • plan attacks • An example of dark web: www.natall.com

  4. Why do we need to analyze it? • To find the hidden topics in the Dark Web community, which are: • embedded in other large scale on-line web sites • information overloaded • multi-lingual

  5. How to analyze Dark Web: architecture of our strategy • GS: Gibbs Sampling – a random walk in the sample space to find the maximum estimation • LDA: Latent Dirichlet Allocation

  6. How to analyze Dark Web: architecture of our strategy • Use a web crawler to download text-based documents • Pruning by removing: • all the HTML tags • irrelevant contents such as images, navigation instructions • Formatting into a plain text file F F := header {doc} header := a line contains the number of documents doc := {term_1} • Feed the text file to GibbbsLDA analyzer to discover the latent topics • Optimize topic discovery

  7. Criteria to select web crawlers • Able to parse ill-coded web pages • Parameterized URLs • Flexible to handle different web site structures • The downloaded web pages will be read by machine rather than human, therefore some kind of normalization must be taken to ensure the text corpus is well formatted and readable • Easy maintenance and of minimal hardware resources • Not necessary to be super fast • Not introduce any intellectual property problem

  8. Web-harvest vs. others

  9. Web-harvest pipeline

  10. Topic discovery based on LDA • LDA is an Information Retrieval (IR) technique • Information Retrieval (IR) • reduces information overload • preserves the essential statistical relationships • Basic and traditional IR methods • tf-idf scheme: term-count pair => term-by-document matrix • LSI (Latent semantic indexing) • pLSI (probabilistic LSI) • Clustering: divide data set into subsets

  11. Dirichlet Distribution • a generalization of the beta distribution

  12. Beta Distribution • a continuous probability distribution with the probability density function (pdf) defined on the interval [0, 1]

  13. LDA graph • corpus level: • α: Dirichlet prior hyper-parameter on the mixing proportion • β: Dirichlet prior hyper-parameter on the mixture component distributions • M: number of documents • document level: • θ: the documents mixture proportion • φ: the mixture component of documents • N: # of words in a document • word level: • ι: hidden topic variable • ω: document variable [H Zhang et al, 2007]

  14. LDA vs. Clustering • Clustering simply partition corpus; one document belongs to on category • LDA-based analysis allows one document to be classified into different categories because of its hierarchy structure

  15. Optimizing the results (1) • LDA does not know how many topics could be there; this value is set by the user • However we can evaluate the multiple “wild guesses” and choose the best one • f(x) is the number of documents that contain the word x • f(y) is the number of documents that contain the word y • f(x,y) if the number of documents that contain both word x and word y • M is the total number of the documents

  16. Optimizing the results (2) • For each topic discovery, find the minimum of average distance of each topic.

  17. Optimizing the results (3) Results: Four topics has the minimum average distance between words in each topic.

  18. A topic list of discovered topics from www.natall.com Discovering New Topics after Optimization

  19. Conclusion Web-harvest integrated with LDA is able to • discover the hidden latent topics from dark web sites. • provide a more flexible and automated tool to counter terrorism. • support a measurable way to optimize the results of LDA. • provide a generic tool to analyze a variety of websites such as financial, medical, etc.

  20. References • Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3:993-1022. Mar. 2003. • An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks, Haizheng Zhang, Baojun Qiu, C. Lee Giles, Henry C. Foley and John Yen, In Proceedings of IEEE Intelligence and Security Informatics, 2007. • Tracing the Event from Evolution of Terror Attacks from On-Line News, Christopher C. Yang, Xiaodong Shi, and Chih-Ping Wei, In Proceedings of IEEE Intelligence and Security Informatics, 2006. • On the Topology of the Dark Web of Terrorist Groups, Jennifer Xu, Hsinchun Chen, Yilu Zhou, and Jialun Qin, In Proceedings of IEEE Intelligence and Security Informatics 2006.

More Related