1 / 21

A Characterization of the Portuguese Web

A Characterization of the Portuguese Web. Daniel Gomes and Mário J. Silva University of Lisbon http://xldb.fc.ul.pt. Presentation. Introduction Setup Statistics Conclusions Future Work. Terminology. Document: file resultant from a successful HTTP download

kalea
Download Presentation

A Characterization of the Portuguese Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Characterization of the Portuguese Web Daniel Gomes and Mário J. Silva University of Lisbon http://xldb.fc.ul.pt

  2. Presentation • Introduction • Setup • Statistics • Conclusions • Future Work

  3. Terminology • Document: file resultant from a successful HTTP download • Publisher: entity responsible for publishing the document on the Web • Web site: collection of documents referenced by URLs that share the same host name

  4. Why Characterize? • Extraction of cultural, commercial and social aspects: • Presence of natural languages • Most popular web servers • Adequate design and tuning of web applications: • The web is described through its characterization. • Parameters of the Web graph • How many nodes compose the graph • Types of this nodes

  5. Huge Sampling is a “must” WWW is not uniform Small partitions are ignored Characterizing the WWW vs. Community Webs • + Relevant to a certain community • + Less resources • + A complete scan is possible, no sampling! • Difficult to establish boundaries

  6. WWW.TUMBA.PT Publicly available: • Characterize • Search Almost: • Archive • The Portuguese Web

  7. Main objectives: • Estimate the resources need to create a web-archive of the Portuguese Web; • Validate crawls; • Gather guidelines to improve the systems (crawling, repository, index).

  8. Characterization Setup • Viúva Negra Crawlers: gather information from the Web and insert it into Versus. • Versus: keeps documents in files and meta-data in relations. • Web statistics are produced issuing SQL queries to the Versus Repository.

  9. What is the Portuguese Web? • Set of documents of cultural and sociological interest to the Portuguese people. • Language • Brazilian/Portuguese community web sites • Both written in Portuguese • TLDs • Many sites hosted in gTLDs.

  10. Crawler configuration • Influences statistics • The depth of the crawl influences the number of documents gathered • Replication • Mirrors • URL Aliases • Crawl as many documents as possible • Maintain robustness against pathological situations

  11. VN Configuration Parameters • Text documents (list selected MIME types) • Hosted under “.PT” • Hosted under “.COM”, “.NET”, “.ORG”, “.TV”. • Written in Portuguese • Host site had at least one incoming link originated under “.PT” • Download timeout=60s • Max Size=2MB • Avoid traps: • max docs per site=8000 • crawl at most 50 times the same document

  12. Collected Statistics • 4 million URLs and 78 GB. • 83% successfully downloaded (200) • 3.4% not found (404) • 1.2% took more than 1 minute to download • 0.5% bigger than 2 MB

  13. Site statistics Sites per TLD Documents per Site

  14. Language Distribution (.pt only)

  15. Size Distribution

  16. Other Statistics • Average length of an URL is 62 chars • unknown Last-Modified Date: 53% • HTML: 95% • 78 GB of data produced 8.7 GB of text • Meta-tags are scarce (description 17%, keywords 18%) • 15.5% Replication

  17. http://wealth.com.sapo.pt/gui/flat.swf?exbackground=993333&makenavfield0=HitHarvester&makenavfield10=ClickSilo&makenavfield11=BraStart&makenavfield12=AskMiky&makenavfield13=TrafficG&makenavfield14=Click4u&makenavfield1=YesMoreHits&makenavfield2=ClickityCash&makenavfield3=StartFrenzy&makenavfield4=NoMoreHits&makenavfield5=ILoveClicks&makenavfield6=ClixSwap&makenavfield7=EZHits4U&makenavfield8=HitSense&makenavfield9=Clickthru&makenavurl0=http://www.hitharvester.com/referral.asp?ref=kurtz53&makenavurl10=http://www.clicksilo.com/referrals/info.asp?Agent=kurtz53&makenavurl11=http://www.brastart.com/cgi-bin/join.cgi?r=kurtz53&makenavurl12=http://www.askmiky.com/home/signup.php?ref=kurtz53&makenavurl13=http://www.trafficg.com/home.php?member=kurtz53&makenavurl14=http://www.clicks4u.com/X92433/&makenavurl1=http://www.yesmorehits.com/cgi-bin/join.cgi?r=kurtz53&makenavurl2=http://www.clickitycash.com/cgi-bin/join.cgi?refer=52786&makenavurl3=http://www.startfrenzy.com/default.asp?userid=kurtz53&makenavurl4=http://www.nomorehits.com/cgi-bin/start.cgi?referrer=kurtz53&makenavurl5=http://www.iloveclicks.com/signup.asp?referrer=22014&makenavurl6=http://www.clixswap.com/?ref=csa12481&makenavurl7=http://www.ezhits4u.com/index.asp?ref=kurtz53&makenavurl8=http://www.hitsense.com/refer.php?ref=kurtz53&makenavurl9=http://www.clickthru.com/referral?ref=280693&tarframe=_blankhttp://wealth.com.sapo.pt/gui/flat.swf?exbackground=993333&makenavfield0=HitHarvester&makenavfield10=ClickSilo&makenavfield11=BraStart&makenavfield12=AskMiky&makenavfield13=TrafficG&makenavfield14=Click4u&makenavfield1=YesMoreHits&makenavfield2=ClickityCash&makenavfield3=StartFrenzy&makenavfield4=NoMoreHits&makenavfield5=ILoveClicks&makenavfield6=ClixSwap&makenavfield7=EZHits4U&makenavfield8=HitSense&makenavfield9=Clickthru&makenavurl0=http://www.hitharvester.com/referral.asp?ref=kurtz53&makenavurl10=http://www.clicksilo.com/referrals/info.asp?Agent=kurtz53&makenavurl11=http://www.brastart.com/cgi-bin/join.cgi?r=kurtz53&makenavurl12=http://www.askmiky.com/home/signup.php?ref=kurtz53&makenavurl13=http://www.trafficg.com/home.php?member=kurtz53&makenavurl14=http://www.clicks4u.com/X92433/&makenavurl1=http://www.yesmorehits.com/cgi-bin/join.cgi?r=kurtz53&makenavurl2=http://www.clickitycash.com/cgi-bin/join.cgi?refer=52786&makenavurl3=http://www.startfrenzy.com/default.asp?userid=kurtz53&makenavurl4=http://www.nomorehits.com/cgi-bin/start.cgi?referrer=kurtz53&makenavurl5=http://www.iloveclicks.com/signup.asp?referrer=22014&makenavurl6=http://www.clixswap.com/?ref=csa12481&makenavurl7=http://www.ezhits4u.com/index.asp?ref=kurtz53&makenavurl8=http://www.hitsense.com/refer.php?ref=kurtz53&makenavurl9=http://www.clickthru.com/referral?ref=280693&tarframe=_blank

  18. Other Statistics • Average length of an URL is 62 chars • unknown Last-Modified Date: 53% • HTML: 95% • 78 GB of data produced 8.7 GB of text • Meta-tags are scarce (description 17%, keywords 18%) • 15.5% Replication

  19. Conclusions • Defined the Portuguese Web as a crawling policy. • Characterization can not be dissociated from crawling technology. • A search engine repository is a source of interesting statistics. • Statistics are an important tool for validating and designing web applications

  20. Future Work • Study the linkage structure • Crawl other types such as postscripts • Improve the algorithm used to find Portuguese web sites outside the .PT domain • Study the evolution of the Portuguese Web

  21. Thank you for your attention. daniel@tumba.pt http://xldb.fc.ul.pt http://www.tumba.pt

More Related