finding information on the web l.
Skip this Video
Loading SlideShow in 5 Seconds..
Finding Information on the web PowerPoint Presentation
Download Presentation
Finding Information on the web

Loading in 2 Seconds...

play fullscreen
1 / 28

Finding Information on the web - PowerPoint PPT Presentation

  • Uploaded on

Finding Information on the web. Srinivasan Seshadri CTO Kosmix. Early Internet (1992 – 1994). Mozilla Browser People linked to others home pages and other interesting pages People really browsed. INTERNET (1995 – 2002). Search - Altavista, Lycos Google

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Finding Information on the web' - KeelyKia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
finding information on the web

Finding Information on the web

Srinivasan Seshadri

CTO Kosmix

early internet 1992 1994
Early Internet (1992 – 1994)
  • Mozilla Browser
  • People linked to others home pages and other interesting pages
  • People really browsed
internet 1995 2002
INTERNET (1995 – 2002)
  • Search - Altavista, Lycos
  • Google
    • Used Hyperlink Graph Structure to Rank Results
internet now
Internet Now
  • Kosmix bringing back joys of browsing and exploring
  • 360 degree view of any topic
    • Topic Home page (why not a topic )
  • Top Informational Sites for a topic and a preview (snippets) are the results!
information types
  • Factual Information (Wiki etc.)
  • Videos
  • Images
  • Forum Discussions
  • Question and Answers
  • News
  • Blogs
  • Structured Information
future of search
  • First step towards providing multiple pivot points for a topic or search
  • Need to make this conversational, stateful – like talking to an expert on the topic..
transient intent and persistent intent
transient Intent and Persistent Intent
    • Searching for a needle in the haystack
    • Exploring the haystack for a topic
    • Interested in the topic for a long time
      • Carnatic Music, Indian Cricket, Internet Industry, Venture Capital

Deliver information to the consumer

what they want

when they want

how they want

where they want

personalized newspaper
  • My World is Changing
  • Can not keep track of it
  • Can my world come to me?
media industry and internet
  • Huge pressure on newspapers
    • Ad spending moving online
  • More and more content online
    • Reputed journalists have their own blogs
  • Content Production; Aggregation and Distribution is becoming disaggregated
  • Vanilla online newspaper does not exploit what the internet enables
    • Ability to personalize to nano interests
    • Publish a personalized newspaper for everyone any time
key technology ingredients
Key technology Ingredients
  • Cloud Computing
  • Categorization
  • Relevance
cloud computing at kosmix
Cloud computing at kosmix
  • Storage:
    • Biggest Productivity boost in kosmix in the first year
      • Getting machines to be remotely rebooted!
    • KFS (Kosmix File System) further lowered the time to make data accessible after machine failures
  • Computation:
    • Long Running Computations need to be broken into small restartable/replayable components
cloud computing at kosmix21
Cloud computing at kosmix
  • Computation Templates:
    • Most of the computation could be expressed as some variant of a single table scan and some aggregate operation (group by) -- called MapReduce by google
    • MapReduce not friendly enough to non programmers
    • SQL not powerful enough in many situations
    • Need a nice scripting language ..
  • Many many companies trying to provide interesting web services
    • A gold mine of information in the web that can be used by companies
    • Impractical for each of the companies to build a huge web scale support system (crawling, indexing, KFS, MapReduce etc. etc.)
    • Further most companies want slivers of the web (typically category based slivers – health forums; travel news sites etc. etc.)
    • Web and all the derived information is the biggest database perhaps -- can some one make this accessible and easy to use (using some pay you go model) or perhaps some non profit (academia?) angle here?
  • Concept Space: space in which all connections are made within kosmix
  • Documents, Queries, External Modules, Advertisements, People are all mapped to points in this space and matched..
    • Internet Industry, Venture Capital documents need to be mapped to these categories even if they don’t contain the original words
kategorization at kosmix
Kategorization at kosmix
  • Leverage human curated sources
    • Wiki corpus is a majorr source of knowledge
  • Huge Automatically Curated Taxonomy
    • 6 million concepts
  • Building a Concept Graph with relationship labels where possible
  • Use a web index to match short pieces of texts with concepts and use taxonomy to refine the matches
  • Need to combine multiple signals into one number to enable ranking
    • Say Query Relevance Score and Page Relevance Score (text score and page rank)
    • Signals need to be made comparable
    • Normalization alone (making ranges the same) is not enough
    • Need to reconcile different distributions
    • Deviations from the mean
  • More data always beats smarter algorithms
    • Adding positions information in the index greatly increases quality
    • Adding stemming saw a CTR rise of 10%
    • Adding anchors (and page rank) distinguished google
    • Adding origin of anchors (hosts) is a much better measure of independent votes
    • Using demand side popularity (alexa, quantcast) complement web popularity
  • What is a news story?
    • Cluster news articles..
    • Use size of cluster as a measure of popularity
    • How does one do this efficiently?
      • Needs to be online since interests/queries are ad hoc
      • Need to combine some offline preclustering and online methods
  • Consumer:
    • Internet has come a long way in terms of getting information to people
    • Utopian goal of a smart, chatty expert still far away – is a great first step
    • Need good tools to keep on top of the information explosion – personalized newspaper ( is our first stab at this..
  • Technology:
    • Need to deal with large volume of data
    • Efficient Data Analysis and Annotation (e.g., Categorization)
    • Humming Next Gen Database System that grows incrementally, immune to failures, expressive for non programmers