Webinfomall the chinese web archive how we got started and how it is now
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

WebInfoMall: the Chinese Web Archive how we got started and how it is now PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on
  • Presentation posted in: General

WebInfoMall: the Chinese Web Archive how we got started and how it is now. Huang Lianen and Li Xiaoming Peking University, China Digital Archive Workshop August 27, 2007, Xian, China. Outline. Motivation developed in 2001

Download Presentation

WebInfoMall: the Chinese Web Archive how we got started and how it is now

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Webinfomall the chinese web archive how we got started and how it is now

WebInfoMall: the Chinese Web Archivehow we got started and how it is now

Huang Lianen and Li Xiaoming

Peking University, China

Digital Archive Workshop

August 27, 2007, Xian, China


Outline

Outline

  • Motivation developed in 2001

    • 2001, I was not able to give an answer when some one asked me what had been on Chinese web 1996.

    • 2100, I’d like to be able to answer concretely if some one will ask me what were on Chinese web 2001 ?

  • Archiving technology

    • For long-term web crawl and store, what technology should be used, especially in a university lab environment ?

  • Exhibition of the archive

    • How do we show the archive to the society ?


On the elapsing nature of web data

On the elapsing nature of Web data

  • Li Xiaoming, “On the estimation of the number of previous Chinese Web pages”, Journal of Peking University, Vol.39, No.3, May 2003, 394-398.

  • As a by-product, we also obtained the result that the time for 50% of current web pages disappearing is about 0.99 year.

Observing the elapsing nature, can we archive them before they are gone ?


We have some advantage

With a search engine, 50% is done !

We have some advantage

The system work started in 2001


The progress and current status

The progress and current status

  • The crawl started in 2001 and the first batch of data was put on line Jan 18, 2002.

  • As of today, there is a total repository over 2.5 billion Chinese web pages (different), more precisely, pages crawled from mainland China’s web

  • About 1 million pages incremental every day.

  • Initially, we used tapes for storage, but changed to hard disks later.

  • Total online data (compressed) volume ≈ 30TB, with an off line backup.

  • Spring 2002, “historical browsing” was provided; summer 2006, beta test of “backward browsing” was tested


Infomall

示例:InfoMall界面


Www sina com cn

示例:输入www.sina.com.cn


2002 1 18

Headquarter of Bin Ladin was bombed.

示例:2002.1.18新浪


Webinfomall the chinese web archive how we got started and how it is now

链接保持

The first air strike in new year, American AF bombed the headquarter of Bin Ladin.


Webinfomall the chinese web archive how we got started and how it is now

继续保持链接


2002 10 8

2002.10.8


2003 9 2

2003.9.2


2004 5 28

2004.5.28


Featured collections sars

Featured collections: sars


Featured collections the first manned space vehicle

Featured collections: the first manned space vehicle


We ask three questions

We ask three questions:

  • What’s the use ?

    • Preserving historical information before it’s lost

    • Implying great opportunities for deep mining

    • Providing access to previous information much more convenient than libraries even if they have kept it.

  • Can we do it ? (or at least get a pretty good start)

    • “we”: a university lab.

  • How we do it ?


Can we do it resource requirement

Can we do it ? (resource requirement)

  • “hard” resource

    • Crawler system: 4 computers of $5,000 each

    • Storage system: about 50 million pages per 1TB, amounts to $4,000. If you need a backup, double the investment.

    • Access web server: $4,000

    • Space (not big, but reliable) to put these machines

    • High speed network connection, ? per month ?

  • “soft” resource

    • Permission for crawling and keeping

    • A staff to handle the daily routine matters

    • Persistent enthusiasm for this undertaking


How we do it

How we do it ?

  • Incremental crawling

    • A scheduled daily operation, collect about one to two million new pages a day, fingerprint compared with previous pages

  • Data storage and incorporation

    • Once a few weeks after having collected enough data

  • Accessibility

    • Wayback machine style

    • Featured exhibitions


Webinfomall hierarchical module data organization

WebInfoMall: hierarchical module data organization

  • Assurance of scalability and dynamic re- configurability

    • Convenient for coping with changes at all levels

record : file : batch : disk : node : system

Matching logical data organization with physical devices structure as close as possible


The architecture

The architecture


The operations under the hood

The operations under the hood


Comparison

A survey done by National Library of China

Web InfoMall is the only large scale web archive in China – operated in a university lab !

Comparison

In the flattened world,

“small can act big !”


Resource sharing

Resource sharing

  • We have published data storage format

  • And provide WebInfoMall data to research community for free.

    • The beneficiary research units include Peking University, Tsinghua University, Chinese Academy of Sciences, Shanghai Jiaotong University, Renmin Univerisyt of China, Harbin Institue of Technology, ....

  • In particular, we built the largest Chinese Web Test collection with compressed 200GB web pages (CWT200g) for evaluation of Chinese web information retrieval technologies


Summary

Summary

  • WebInfoMall, http://www.infomall.cn is the Chinese web archive since 2001, with over 2.5 billion pages in its repository as for 2007.

  • Straightforward technology has been used for building WebInfoMall

    • Linux box + Berkeley DB + hierarchical module data organization

  • We are looking into different ways to access the data to get values more than just information preservation and history browsing


Thanks for your attention

Thanks for your attention

  • [email protected]


  • Login