1 / 23

Building a scalable distributed WWW search engine … NOT in Perl!

Building a scalable distributed WWW search engine … NOT in Perl!. Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org). V1.0 27/07/05. Contents. History Goals Architecture Implementation Why not Perl? Conclusions

gilda
Download Presentation

Building a scalable distributed WWW search engine … NOT in Perl!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a scalabledistributed WWW search engine …NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05

  2. Contents • History • Goals • Architecture • Implementation • Why not Perl? • Conclusions • Credits • Recommended reading

  3. History(of my work in area of information retrieval) • First primitive pathetic stone-age search engine: 1000 documents in the “index” (1997, Perl) • Second engine using proper inverted indexing for Jungle.com: 500,000 products indexed (Perl + Java, 2002) • Current: 50,000,000 pages indexed with a lot more to go (to be revealed, 2005)

  4. Goals • Build a distributed WWW search engine capable of dealing with at least 1 bln web pages based on principles of SETI@Home and D.NET • See to it that the chosen language for implementation (more on this later) fits purpose or more likely learn how to make it work • Eventually make some money out of it

  5. Architecture • Data collection (crawling) • Indexing: turning text into numbers • Merging: turning indexed barrels into single searchable index • Searching: locating documents for given keywords

  6. Data collection (crawling) Distributed crawlers – receive lists of URLs to crawl, crawl them and send back compressed data. Base In the future will do distributed indexing Issues URLs to crawl and receives compressed pages Note: this stage is optional if you already have data to index, ie list of products with their descriptions

  7. Crawler screenshot 1

  8. Crawler screenshot 2

  9. Crawler screenshot 3

  10. Crawler screenshot 4

  11. Crawler screenshot 5

  12. Current Stats Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05

  13. Indexing Indexing is a process of turning words into numbers and creating inverted index. Lexicon(maps words to their numeric WordIDs) Birmingham – 0Perl – 1Mongers – 2City – 3 Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Inverted Index(Each of the WordID has list of (ideally sorted) DocIDs) Data barrel 0 -> 0, 11 -> 0, 22 -> 0,3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID

  14. Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.

  15. Searching Searching is a process of finding documents that contain words from search query Lexicon(maps words to their numeric WordIDs) Search query: “Birmingham Perl” Birmingham – 0Perl – 1Mongers – 2City – 3 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Inverted Index(lists DocIDs for each of the WordID) WordIDs: 0, 1 0 -> 0, 11 -> 0, 22 -> 0,3 -> 1, 2 Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Note: if you use database then it make sense to cluster on WordID

  16. Search engine screenshot 1

  17. Search engine screenshot 2

  18. Implementation • Microsoft .NET C# ported to Linux using Mono (http://www.mono-project.com) • ~90k lines of code (minimal copy/paste) written from scratch • Low level of dependencies (SharpZipLib/SQLite/NPlot)

  19. Why not Perl?(using C# instead) • Not strong in GUI department • Hard to deal with Multi-Threading and Asyncronous sockets • OOP is more of a hack • Lax compile-time checks due to not being strictly typed • Fear of performance bottlenecks forcing to use C++ • Hard to profile for performance analysis • Managed memory lacks support for pointers (?) • Poor exceptions handling • I wanted something new :)

  20. Conclusions • Still work in progress, but some conclusions can be made already: • Inverted indexing approach helps to achieve fast searches • Its tough to build one – don’t try if you ain’t going to see it through! • Crawler is one tough piece of code – 6 months vs 2 months on searching • .NET C# is a decent language suitable for heavy duty tasks like this

  21. Credits • R&D: Alex Chudnovsky <alexc@majestic12.co.uk> • Pioneers*: FiddleAbout, dazza12, lazytom, Mordac, linuxbren, Cyber911, www.vanginkel.info, Vari, ASB, SEOBy.org, arni, japonicus, webstek.info | Pimpel, DimPrawn, Zyron, partys-bei-uns.de, jake, bull at webmasterworld, nada, dodgy4, sri-heinz * Volunteers running crawler and who crawled at least 1 mln URLs as of 27/07/05

  22. Recommended reading • “The Anatomy of a Large-Scale Hypertextual Web Search Engine” Sergey Brin and Lawrence Page of Google (http://www-db.stanford.edu/~backrub/google.html) • “Managing Gigabytes” Ian h. Witten et al ISBN 1-55860-570-3

  23. Join! • Join the project (unmetered broadband required!): majestic12.co.uk Your name could be here!

More Related