Finding information on the web
1 / 28

Finding Information on the web - PowerPoint PPT Presentation

  • Uploaded on

Finding Information on the web. Srinivasan Seshadri CTO Kosmix. Early Internet (1992 – 1994). Mozilla Browser People linked to others home pages and other interesting pages People really browsed. INTERNET (1995 – 2002). Search - Altavista, Lycos Google

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Finding Information on the web' - ouida

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Finding information on the web l.jpg

Finding Information on the web

Srinivasan Seshadri

CTO Kosmix

Early internet 1992 1994 l.jpg
Early Internet (1992 – 1994)

  • Mozilla Browser

  • People linked to others home pages and other interesting pages

  • People really browsed

Internet 1995 2002 l.jpg
INTERNET (1995 – 2002)

  • Search - Altavista, Lycos

  • Google

    • Used Hyperlink Graph Structure to Rank Results

Internet now l.jpg
Internet Now

  • Kosmix bringing back joys of browsing and exploring

  • 360 degree view of any topic

    • Topic Home page (why not a topic )

  • Top Informational Sites for a topic and a preview (snippets) are the results!

Information types l.jpg

  • Factual Information (Wiki etc.)

  • Videos

  • Images

  • Forum Discussions

  • Question and Answers

  • News

  • Blogs

  • Structured Information

Future of search l.jpg

  • First step towards providing multiple pivot points for a topic or search

  • Need to make this conversational, stateful – like talking to an expert on the topic..

Transient intent and persistent intent l.jpg
transient Intent and Persistent Intent


    • Searching for a needle in the haystack

    • Exploring the haystack for a topic


    • Interested in the topic for a long time

      • Carnatic Music, Indian Cricket, Internet Industry, Venture Capital

Information l.jpg

Deliver information to the consumer

what they want

when they want

how they want

where they want

Personalized newspaper l.jpg

  • My World is Changing

  • Can not keep track of it

  • Can my world come to me?

Media industry and internet l.jpg

  • Huge pressure on newspapers

    • Ad spending moving online

  • More and more content online

    • Reputed journalists have their own blogs

  • Content Production; Aggregation and Distribution is becoming disaggregated

  • Vanilla online newspaper does not exploit what the internet enables

    • Ability to personalize to nano interests

    • Publish a personalized newspaper for everyone any time

Key technology ingredients l.jpg
Key technology Ingredients

  • Cloud Computing

  • Categorization

  • Relevance

Cloud computing at kosmix l.jpg
Cloud computing at kosmix

  • Storage:

    • Biggest Productivity boost in kosmix in the first year

      • Getting machines to be remotely rebooted!

    • KFS (Kosmix File System) further lowered the time to make data accessible after machine failures

  • Computation:

    • Long Running Computations need to be broken into small restartable/replayable components

Cloud computing at kosmix21 l.jpg
Cloud computing at kosmix

  • Computation Templates:

    • Most of the computation could be expressed as some variant of a single table scan and some aggregate operation (group by) -- called MapReduce by google

    • MapReduce not friendly enough to non programmers

    • SQL not powerful enough in many situations

    • Need a nice scripting language ..

Opportunity l.jpg

  • Many many companies trying to provide interesting web services

    • A gold mine of information in the web that can be used by companies

    • Impractical for each of the companies to build a huge web scale support system (crawling, indexing, KFS, MapReduce etc. etc.)

    • Further most companies want slivers of the web (typically category based slivers – health forums; travel news sites etc. etc.)

    • Web and all the derived information is the biggest database perhaps -- can some one make this accessible and easy to use (using some pay you go model) or perhaps some non profit (academia?) angle here?

Categorization l.jpg

  • Concept Space: space in which all connections are made within kosmix

  • Documents, Queries, External Modules, Advertisements, People are all mapped to points in this space and matched..

    • Internet Industry, Venture Capital documents need to be mapped to these categories even if they don’t contain the original words

Kategorization at kosmix l.jpg
Kategorization at kosmix

  • Leverage human curated sources

    • Wiki corpus is a majorr source of knowledge

  • Huge Automatically Curated Taxonomy

    • 6 million concepts

  • Building a Concept Graph with relationship labels where possible

  • Use a web index to match short pieces of texts with concepts and use taxonomy to refine the matches

Relevance l.jpg

  • Need to combine multiple signals into one number to enable ranking

    • Say Query Relevance Score and Page Relevance Score (text score and page rank)

    • Signals need to be made comparable

    • Normalization alone (making ranges the same) is not enough

    • Need to reconcile different distributions

    • Deviations from the mean

Relevance26 l.jpg

  • More data always beats smarter algorithms

    • Adding positions information in the index greatly increases quality

    • Adding stemming saw a CTR rise of 10%

    • Adding anchors (and page rank) distinguished google

    • Adding origin of anchors (hosts) is a much better measure of independent votes

    • Using demand side popularity (alexa, quantcast) complement web popularity

Relevance27 l.jpg

  • What is a news story?

    • Cluster news articles..

    • Use size of cluster as a measure of popularity

    • How does one do this efficiently?

      • Needs to be online since interests/queries are ad hoc

      • Need to combine some offline preclustering and online methods

Summary l.jpg

  • Consumer:

    • Internet has come a long way in terms of getting information to people

    • Utopian goal of a smart, chatty expert still far away – is a great first step

    • Need good tools to keep on top of the information explosion – personalized newspaper ( is our first stab at this..

  • Technology:

    • Need to deal with large volume of data

    • Efficient Data Analysis and Annotation (e.g., Categorization)

    • Humming Next Gen Database System that grows incrementally, immune to failures, expressive for non programmers