Finding information on the web
Download
1 / 28

Finding Information on the web - PowerPoint PPT Presentation


  • 298 Views
  • Updated On :

Finding Information on the web. Srinivasan Seshadri CTO Kosmix. Early Internet (1992 – 1994). Mozilla Browser People linked to others home pages and other interesting pages People really browsed. INTERNET (1995 – 2002). Search - Altavista, Lycos Google

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Finding Information on the web' - KeelyKia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Finding information on the web l.jpg

Finding Information on the web

Srinivasan Seshadri

CTO Kosmix


Early internet 1992 1994 l.jpg
Early Internet (1992 – 1994)

  • Mozilla Browser

  • People linked to others home pages and other interesting pages

  • People really browsed


Internet 1995 2002 l.jpg
INTERNET (1995 – 2002)

  • Search - Altavista, Lycos

  • Google

    • Used Hyperlink Graph Structure to Rank Results


Internet now l.jpg
Internet Now

  • Kosmix bringing back joys of browsing and exploring

  • 360 degree view of any topic

    • Topic Home page (why not a topic )

  • Top Informational Sites for a topic and a preview (snippets) are the results!


Information types l.jpg
INFORMATION TYPES

  • Factual Information (Wiki etc.)

  • Videos

  • Images

  • Forum Discussions

  • Question and Answers

  • News

  • Blogs

  • Structured Information


Future of search l.jpg
FUTURE OF SEARCH

  • First step towards providing multiple pivot points for a topic or search

  • Need to make this conversational, stateful – like talking to an expert on the topic..


Transient intent and persistent intent l.jpg
transient Intent and Persistent Intent

  • TRANSIENT INTENT

    • Searching for a needle in the haystack

    • Exploring the haystack for a topic

  • PERSISTENT INTENT

    • Interested in the topic for a long time

      • Carnatic Music, Indian Cricket, Internet Industry, Venture Capital


Information l.jpg
INFORMATION

Deliver information to the consumer

what they want

when they want

how they want

where they want


Personalized newspaper l.jpg
PERSONALIZED NEWSPAPER

  • My World is Changing

  • Can not keep track of it

  • Can my world come to me?


Media industry and internet l.jpg
MEDIA INDUSTRY AND INTERNET

  • Huge pressure on newspapers

    • Ad spending moving online

  • More and more content online

    • Reputed journalists have their own blogs

  • Content Production; Aggregation and Distribution is becoming disaggregated

  • Vanilla online newspaper does not exploit what the internet enables

    • Ability to personalize to nano interests

    • Publish a personalized newspaper for everyone any time


Key technology ingredients l.jpg
Key technology Ingredients

  • Cloud Computing

  • Categorization

  • Relevance


Cloud computing at kosmix l.jpg
Cloud computing at kosmix

  • Storage:

    • Biggest Productivity boost in kosmix in the first year

      • Getting machines to be remotely rebooted!

    • KFS (Kosmix File System) further lowered the time to make data accessible after machine failures

  • Computation:

    • Long Running Computations need to be broken into small restartable/replayable components


Cloud computing at kosmix21 l.jpg
Cloud computing at kosmix

  • Computation Templates:

    • Most of the computation could be expressed as some variant of a single table scan and some aggregate operation (group by) -- called MapReduce by google

    • MapReduce not friendly enough to non programmers

    • SQL not powerful enough in many situations

    • Need a nice scripting language ..


Opportunity l.jpg
Opportunity?

  • Many many companies trying to provide interesting web services

    • A gold mine of information in the web that can be used by companies

    • Impractical for each of the companies to build a huge web scale support system (crawling, indexing, KFS, MapReduce etc. etc.)

    • Further most companies want slivers of the web (typically category based slivers – health forums; travel news sites etc. etc.)

    • Web and all the derived information is the biggest database perhaps -- can some one make this accessible and easy to use (using some pay you go model) or perhaps some non profit (academia?) angle here?


Categorization l.jpg
Categorization

  • Concept Space: space in which all connections are made within kosmix

  • Documents, Queries, External Modules, Advertisements, People are all mapped to points in this space and matched..

    • Internet Industry, Venture Capital documents need to be mapped to these categories even if they don’t contain the original words


Kategorization at kosmix l.jpg
Kategorization at kosmix

  • Leverage human curated sources

    • Wiki corpus is a majorr source of knowledge

  • Huge Automatically Curated Taxonomy

    • 6 million concepts

  • Building a Concept Graph with relationship labels where possible

  • Use a web index to match short pieces of texts with concepts and use taxonomy to refine the matches


Relevance l.jpg
Relevance

  • Need to combine multiple signals into one number to enable ranking

    • Say Query Relevance Score and Page Relevance Score (text score and page rank)

    • Signals need to be made comparable

    • Normalization alone (making ranges the same) is not enough

    • Need to reconcile different distributions

    • Deviations from the mean


Relevance26 l.jpg
Relevance

  • More data always beats smarter algorithms

    • Adding positions information in the index greatly increases quality

    • Adding stemming saw a CTR rise of 10%

    • Adding anchors (and page rank) distinguished google

    • Adding origin of anchors (hosts) is a much better measure of independent votes

    • Using demand side popularity (alexa, quantcast) complement web popularity


Relevance27 l.jpg
RELEVANCE

  • What is a news story?

    • Cluster news articles..

    • Use size of cluster as a measure of popularity

    • How does one do this efficiently?

      • Needs to be online since interests/queries are ad hoc

      • Need to combine some offline preclustering and online methods


Summary l.jpg
summary

  • Consumer:

    • Internet has come a long way in terms of getting information to people

    • Utopian goal of a smart, chatty expert still far away – kosmix.com is a great first step

    • Need good tools to keep on top of the information explosion – personalized newspaper (meehive.com) is our first stab at this..

  • Technology:

    • Need to deal with large volume of data

    • Efficient Data Analysis and Annotation (e.g., Categorization)

    • Humming Next Gen Database System that grows incrementally, immune to failures, expressive for non programmers


ad