Internet scale mm retrieval
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Internet- scale MM retrieval PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

RNDr. Jakub Lokoč , Ph.D . Siret Research Group ( www.siret.cz ) Department of SW Engineering Faculty of Mathematics and Physics Charles University in Prague. Internet- scale MM retrieval. What does it mean „internet- scale “?. http://royal.pingdom.com statistics for 2011.

Download Presentation

Internet- scale MM retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Internet scale mm retrieval

RNDr. Jakub Lokoč, Ph.D.

Siret Research Group (www.siret.cz)

Department of SW Engineering

FacultyofMathematicsandPhysics

Charles University in Prague

Internet-scale MMretrieval

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


What does it mean internet scale

Whatdoesitmean „internet-scale“?

http://royal.pingdom.com

statisticsfor 2011

  • 2.1 billion – Internet users worldwide

  • 3.146 billion – number of email accounts worldwide

  • 800+ million – number of users on Facebook

  • 555 million – number of websites (+300 million in 2011)

  • 1trillion – number of video playbacks on YouTube

    • 48 hours –  amount of video uploaded to YouTube every minute

  • 100 billion – Estimated number of photos on Facebook

    • 4.5 million – Number of photos uploaded to Flickr each day

MM data

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Many problems to solve

Many problems to solve…

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Searching huge mm collections

Searching huge MM collections

  • Text-based techniques

    • Advantage – scalable retrieval by inverted files

    • Problem – missing or misguiding annotations

  • Content-based techniques

    • Advantage – no annotation needed, visual similarity

    • Problem – slow retrieval for complex similarity models

  • Hybrid techniques

    • Text-based query + content-based reranking/exploration

    • Content-based query + text-based filtering

    • Adapting content-based data for invertedfiles

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Text based retrieval

Text-based retrieval

  • Document vector model

    • User issues keywords query (google, bing, …)

    • Efficient query evaluation using inverted files

  • Problems

    • Manual annotation only for small data

    • Subjectivityoftheannotation

    • Homonyms, etc.

  • Automatic annotation

    • Surrounding text + linguistic methods + ontologies

    • Content-basedkeywordassignment

    • Still lot of problems to solve…

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Example www google com

Example – www.google.com

  • Text-based retrieval

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Content based retrieval

query object

Content-based retrieval

  • All objects transformed into a similarity model

    • Objects represented by descriptors (histograms, signatures)

    • Descriptors measured by a distance measure d (Lp, SQFD, EMD)

  • User issues an example object as a query q

  • Objects x sorted according to the visual similarity d(q, x)

  • How to solve efficiency problem?

    • Hybrid techniques – not whole DB is searched inthe CB way

    • Distance-based indexes

  • Distributed architectures needed (storage, throughput, …)

Feature extraction

Similarity evaluation

Feature extraction

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Example www google com1

Example – www.google.com

  • Hybrid techniques –reranking page 1

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Example www google com2

Example – www.google.com

  • Hybrid techniques –reranking page 2

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Example siret ms mff cuni cz sir

Example – siret.ms.mff.cuni.cz/sir

  • Hybrid techniques –exploration

J. Lokoč, T. Grošup, T. Skopal

Image Exploration using Online

Feature Extraction and Reranking

ICMR, 2012, Hongkong, China, ACM

J. Lokoč, T. Grošup, T. Skopal

SIR: The Smart Image Retrieval Engine

SISAP, 2012, Toronto, Canada, Springer

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Distance based indexing

Distance-based indexing

  • MM objectsorganizedintoclustersaccording to theirsimilarity

  • Effectiveness depends on the similarity model

Zezula, P., Amato, G., Dohnal, V., Batko, M.

Similarity Search: The Metric Space Approach

(Springer, 2006)

J. Lokoč, P. Čech, J. Novák, T. Skopal, SISAP, 2012, Toronto, Canada, Springer

Cut-region: A Compact Building Block For Hierarchical Metric Indexing

D. Novak, M. Batko, P. Zezula, Information systems, 2011, Elsevier

Metric Index: An efficient and scalable solution for precise and approximate similarity search

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Example mufin

Example - Mufin

  • Content-based search in 100 million Flickr images

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Example mufin1

Example - Mufin

  • MPEG-7 descriptors used – efficient, but effective?

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Distance based indexing1

Distance-based indexing

  • Effective measure

    • Often complex and expensive

  • Efficiency

    • Depends on the index performance

    • Depends also on the data “indexability”

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Distance based indexing2

Distance-based indexing

  • Indexability depends onthe distance distribution ofused distance space

E. Chavez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin

Searching in Metric Spaces, ACM Computing Surveys, 2001

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Facing bad indexability

Facing bad indexability

  • Centralized computing

    • Approximate search

    • Parallel processing

  • Distributed computing

    • Peer-to-peer architecture

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Approximate search

Approximate search

  • Based on various ideas

    • Early termination for good results

    • Reducing query radius

    • When time elapses

    • Accessing % of DB

    • Also distance modifications

  • However, for fast retrieval, the quality deteriorates rapidly

Zezula, P., Amato, G., Dohnal, V., Batko, M.

Similarity Search: The Metric Space Approach

(Springer, 2006)

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Parallel processing

Parallel processing

  • Multi-core CPUs cheap and available

  • Intel Xeon Phi coprocessor

  • GPU cards with thousands of cores

  • Amdahl's and Gustafson's law

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Distributed indexes

Distributed indexes

  • Peer-to-peer architecture

    • Chord protocol (efficient routing)

  • M-Chord, M-Index

    • Map objects to real domain R

    • Use chord protocol for object distribution

    • Query causes interval queries, results merged

D. Novak, P. Zezula, M-Chord: a scalable distributed similarity search structure

InfoScale, 2006, ACM

D. Novak, M. Batko, P. Zezula, Large-scale similarity data management with distributed

Metric Index, Information Processing & Management

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


And all together

And all together

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


Thanks for your attention

Thanks for your attention …

… any questions?

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze


  • Login