Data cloud
1 / 40

data cloud - PowerPoint PPT Presentation

  • Updated On :

Data Cloud. Yury Lifshits Yahoo! Research My Beliefs. The key challenge in web search is structured search Part 1: What is structured search? The key challenge in structured search is collecting data Part 2: Data distribution & idea of Data Cloud

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'data cloud' - erika

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data cloud l.jpg

Data Cloud

Yury Lifshits

Yahoo! Research

My beliefs l.jpg
My Beliefs

The key challenge in web search is structured search

Part 1: What is structured search?

The key challenge in structured search is collecting data

Part 2: Data distribution & idea of Data Cloud

Part 3: Demo: numeric data distribution

The key challenge in collecting data is incentive design

Part 4: Economics of data distribution

Slide3 l.jpg



Slide11 l.jpg

Data = data of entities + data of content


Structured data

Entity unit:

  • Identifier

  • Metadata:

    • Explicit key-value pairs

    • Relational properties

    • Evaluation

Semi-structured data

Content unit:

  • Body: text, video, audio, or image

  • Metadata:

    • Explicit key-value pairs

    • Relational properties

    • Evaluation

Structured search l.jpg
Structured Search

Factoid search

“what's the value of property X of object Y“

Entity hubs

  • Domain hubs

    Structured object search

    "all concerts this weekend in SF under 20$ sorted by popularity"

  • Time focus

  • Ranking focus

  • Relations focus

    Structured content search

    "all videos with Tom Brady"

    “all comments and blog posts about Bing"

Yury s wishlist l.jpg
Yury’s Wishlist

Business-generated data

  • Products, services, news, wishlists, contact data

    Reality stream, sensors

  • Where what have happened

    Expert knowledge

  • Glossary, issues, typical solutions, object databases, related objects graph


  • Sport, concerts, education, corporate, community, private

    Market graph & signals

  • Like, interested, use, following, want to buy; votes and ratings

Search as a platform l.jpg

Query analysis

Post analysis

App 3

App 1

Classic search

App 2

Structured Data

Web index

Search as a Platform

App 4

Slide15 l.jpg

Data Cloud

How to collect all structured data in one place?

Data producers l.jpg
Data Producers

  • People: forums, wiki, mail groups, blogs, social networks

  • Enterprizes: product profiles, corporate news, professional content

  • Sensors: GPS modules, web cameras, traffic sensors, RFID

  • Transactional data

Data distributors l.jpg

Data distributor is any technical solution to accumulate, organize and provide access to structured and semi-structured data

Data publisher: the original distributor of some data

Data retailer: a consumer-facing distributor of some data

Data Distributors

Data consumers l.jpg
Data Consumers

  • Humans

    • Email

    • Aggregators: news, friend feeds, RSS readers

    • Search

    • Browsing / random walks

  • Intelligence projects

    • Recommendation systems

    • Trend mining

Data cloud19 l.jpg
Data Cloud

Data Cloud is a centralized fully-functional data distribution service

Success metric for data cloud strategy = the total “value” of data on the cloud

To cloud solutions l.jpg
To-Cloud Solutions

  • Extraction

    •, “web tables”

  • Semantic markup, data APIs

    • Yahoo! SearchMonkey

  • Feeds

    • Yahoo! Shopping

    •,, Facebook Connect

  • Direct publishing

On cloud solutions l.jpg
On-Cloud Solutions

  • Ontology maintenance

    • Freebase

  • Normalization, de-duplication, antispam

  • Named entity recognition, metadata inference, ranking

  • Data recycling (cross-references)

    • Amazon Public Data Sets

    • Viral license

  • Hosted search

    • Yahoo! BOSS

From cloud solutions l.jpg
From-Cloud Solutions

  • Search, audience

    • Y! SearchMonkey, Google Base

  • Data API, dump access, update stream

  • Custom notifications


  • Data cloud as a primary backend

  • Access control

    • Ad distribution. (AT&T and Yahoo! Local deal)

Slide23 l.jpg


Joint work with Paul Tarjan

Webnumbr com import l.jpg Import

  • Crawl numbers from the web

    URL + XPath + regex

  • Create “numbr pages”

  • Update their values every hour

  • Keep the history

    Anyone can create a numbr

Webnumbr com export l.jpg Export

  • Embed code

  • Graphs

  • Search & browse

  • RSS

Slide27 l.jpg

Economics of Data Distribution

Joint work with Ravi Kumar and Andrew Tomkins

Network effect in two sided markets l.jpg

Two sided market = every product serves consumers of two types A and B

Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa

Examples: operating systems, credit cards, e-commerce marketplaces

Two-sided network effects: A theory of information product design

G. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne

Network Effect in Two-Sided Markets

Basic model l.jpg
Basic model types A and B

  • Distributors D1, … Dk

  • Producer/consumer joins only one distributor

  • Initial shares (p1,c1) … (pk,ck)

  • New consumer selects a distributor with a probability proportional to pi

  • New producer selects a distributor with probability proportional to ci

Basic model30 l.jpg
Basic model types A and B









Market shares dynamics l.jpg
Market Shares Dynamics types A and B

Theorem 1

Market shares will stabilize

Theorem 2

With super-liner preference rule

one of distributors will tip

Theorem 3

With sub-liner preference rule

market shares will flatten

External factor l.jpg
External Factor types A and B

Preference rule with external factor:


  • Theorem 4

    • Market shares will stabilize on

    • e1 : e2 : … : ek

Coalition l.jpg
Coalition types A and B

Data Cloud

Coalitions l.jpg
Coalitions types A and B

Theorem 5

If all market shares are below 1/sqrt(k)

coalition (sharing data) is profitable for

all distributors


Coalitions are not monotone

Example: 5 : 4 : 1 : 1

Model variations l.jpg
Model Variations types A and B

  • Same-side network effect

  • Different p-to-c and c-to-p rules

  • Multi-homing (overlapping audiences)

  • n^2 vs. nlog n revenue models

  • Mature market: newcomer rate = departing rate

  • Diverse market (many types of producers and consumers)

  • Newcoming and departing distributors

  • Directed coalitions

Slide36 l.jpg

Challenges types A and B

Marketing l.jpg
Marketing types A and B

  • Data demand?

  • Data offerings?

  • Requirements for distribution technology?

Incentive design l.jpg
Incentive design types A and B

  • Incentives for data sharing?

  • Centralized or distributed?

    • For profit or non-profit?

  • Data licensing and ownership?

  • Monetizing data cloud?

  • More challenges l.jpg
    More Challenges types A and B


    • Data marketplace: open data & data demand

    • Search plugins: related objects, glossaries, object timelines

    • Publishing tools for structured data

    • Data client: structured news, bookmarking, notifications

      Tech design:

    • Access management

    • Namespace design

      User interface:

    • Structured search UI

    • Discovery UI

    Slide40 l.jpg

    Thanks! types A and B

    Follow my research: