capturing the web the swedish experience www kb se kw3 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Kulturarw³ PowerPoint Presentation
Download Presentation
Kulturarw³

Loading in 2 Seconds...

  share
play fullscreen
1 / 21
Download Presentation

Kulturarw³ - PowerPoint PPT Presentation

rigel-cervantes
57 Views
Download Presentation

Kulturarw³

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Capturing the web The Swedish experience www.kb.se/kw3 Kulturarw³

  2. Content • The Archive • priorities • storage • what we save • Development • IIPC • Tools, format • conclusion • Background • Kulturarw3 • goals • strategy • Sweden on the net? • Harvesting • Software • Fimding links • problem • Statistics • What have we got?

  3. Background • Legal deposit, 1661 • Latest revision 1993 • Only electronic documents in fixed form • CD-ROM, diskettes • New law • juli 1:st, 2002, exception from personal privacy law. • First Swedish web news paper lost • Printed newspapers since 1645 • Kulturarw3 started 1996 • Still waiting for new legal deposit law

  4. Goals • All web pages in Sweden • pictures, video etc. • .se, .and other Top Level Domains • Electronic journals

  5. Strategy: two choices • Select what is importantHow to know what will be considered important in the future?Labour intense • Everything using automatic softwareGets everything (well, not really)Less labour intense

  6. Strategy • Take snapshots of the Swedish weba few times each year • Gets “all” • Needs less labour • Computer memory is cheap • However, large volumes makes quality control difficult • Selective harvestingabout 150 newspapers every day • In the future; events, eg electionsWith as little human intervention as possible.

  7. Sweden on the web? http://www.kb.se/kbstart.htm Only the domain part relevant • .se • .nu, Niue popular in Sweden. ”nu” means now in Swedish • Others if the server is geographically located in Sweden • Language?

  8. Harvesting software • A harvester (crawler, spider) collects web pages by automatically following links and saving pages • Open-source harvester: Heritrix • Main developer: Internet Archive (IA)‏ • Written in Java. Active community. • Designed for archiving. not indexing. • Earlier: Modified version of Combine • From NetLab, Lund university. • Important!Indexing isn't archiving and archiving isn't indexing! • Collects also pictures, sound etc.

  9. Problems‏ • …or challenges if you are an optimist… • Scripts • Interactive pages • Password protected • Video/streaming material • Social sites

  10. Statistics – what did we get? Bulk crawls (everything Swedish) • First sweep – 1997 , only .se- 6.8 million files- 160 GB data • A sweep 2007-2008 , .se and other tld:s- 270 million files- 11500 GB data

  11. Statistics – what did we get? • Periodika (newspapers) • Started june 2002 • 88 miljoner URLer • 4.0 TB • About 40 000 URLs every day

  12. More statistics Bulk (everything Swedish)‏ • 823 100 web servers (including inlines)‏ • 651 700 “swedish” - .se 50 % - .nu 21% - others 29% • 1549 different MIME-typer found. • Html about 50% • text/html + image/gif + image/jpeg + appl/pdf + text/plain about 97% of the documents. • A lot of garbage, miss-spellings etc.

  13. Trends • Html: stable, 50-60% . Increasing lately • Jpeg: increasing, 11% (-97), 27% (05)‏ • Gif: decreasing, 23% (-97), 11% (-05)‏ • Pdf: increasing, 9:th to 4:th position

  14. Accessing the archive Firsta priority is to access the archive using traditional web technologies. Surf, in “space” and time Free text search Nb, not using traditional library methods: cataloging etc.

  15. Development • International Internet Preservation Consortium (IIPC)‏ • Started by Internet Archive national libraries of: Sweden, Norway, Finland, Danmark, Iceland, UK, France, Italy, Canada, Australia och USA (LoC)Now many more‏ • Develop common standards, tools and methods for web archiving. • Raise awareness

  16. Development, standards • Archiving formats • Earlier formats ‏ • MIME (Multipart Mail Extension)‏ • ARC • NedLib • WARC (Web ARChive file format)‏ • File format for saving web materialeach web page is one record in a warc-fileA record contains metada and content • ISO 28500.

  17. Development, Tools • Tools • Harvesting: Heritrix • Designed for archiving (NOT a modified indexer)‏ • Open soure: Java, Linux etc. • Supported by IIPC • Mainly developed by Internet Archive with contributions • Will (is) support WARC. Supports ARC and MIME • Surfing tools • New Wayback Machine • WERA - surf with time line‏ • WAXToolbar – support when using new WM • NutchWax • Free text search (with time line)‏ • Curator tool • Possible for a new-technician to do collection and quality control

  18. Advices • Use Open standards, open source → IIPC • Get users of the archive • Think big. Hundreds of tera bytes, billions of files • Accept that what you do is a best effort

  19. Conclusion • The web is constantly changing  continuous development. • Possible to get a reasonable picture of the web. But never complete! • Do something now

  20. Questions? Comments? ? ? ?

  21. Links • IIPC: www.netpreserve.org • Kulturarw3: www.kb.se/kw3 • Internet Archive: www.archive.org