internet scale mm retrieval
Download
Skip this Video
Download Presentation
Internet- scale MM retrieval

Loading in 2 Seconds...

play fullscreen
1 / 21

Internet- scale MM retrieval - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

RNDr. Jakub Lokoč , Ph.D . Siret Research Group ( www.siret.cz ) Department of SW Engineering Faculty of Mathematics and Physics Charles University in Prague. Internet- scale MM retrieval. What does it mean „internet- scale “?. http://royal.pingdom.com statistics for 2011.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Internet- scale MM retrieval' - kuniko


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
internet scale mm retrieval

RNDr. Jakub Lokoč, Ph.D.

Siret Research Group (www.siret.cz)

Department of SW Engineering

FacultyofMathematicsandPhysics

Charles University in Prague

Internet-scale MMretrieval

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

what does it mean internet scale
Whatdoesitmean „internet-scale“?

http://royal.pingdom.com

statisticsfor 2011

  • 2.1 billion – Internet users worldwide
  • 3.146 billion – number of email accounts worldwide
  • 800+ million – number of users on Facebook
  • 555 million – number of websites (+300 million in 2011)
  • 1trillion – number of video playbacks on YouTube
    • 48 hours –  amount of video uploaded to YouTube every minute
  • 100 billion – Estimated number of photos on Facebook
    • 4.5 million – Number of photos uploaded to Flickr each day

MM data

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

many problems to solve
Many problems to solve…

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

searching huge mm collections
Searching huge MM collections
  • Text-based techniques
    • Advantage – scalable retrieval by inverted files
    • Problem – missing or misguiding annotations
  • Content-based techniques
    • Advantage – no annotation needed, visual similarity
    • Problem – slow retrieval for complex similarity models
  • Hybrid techniques
    • Text-based query + content-based reranking/exploration
    • Content-based query + text-based filtering
    • Adapting content-based data for invertedfiles

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

text based retrieval
Text-based retrieval
  • Document vector model
    • User issues keywords query (google, bing, …)
    • Efficient query evaluation using inverted files
  • Problems
    • Manual annotation only for small data
    • Subjectivityoftheannotation
    • Homonyms, etc.
  • Automatic annotation
    • Surrounding text + linguistic methods + ontologies
    • Content-basedkeywordassignment
    • Still lot of problems to solve…

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

example www google com
Example – www.google.com
  • Text-based retrieval

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

content based retrieval

query object

Content-based retrieval
  • All objects transformed into a similarity model
    • Objects represented by descriptors (histograms, signatures)
    • Descriptors measured by a distance measure d (Lp, SQFD, EMD)
  • User issues an example object as a query q
  • Objects x sorted according to the visual similarity d(q, x)
  • How to solve efficiency problem?
    • Hybrid techniques – not whole DB is searched inthe CB way
    • Distance-based indexes
  • Distributed architectures needed (storage, throughput, …)

Feature extraction

Similarity evaluation

Feature extraction

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

example www google com1
Example – www.google.com
  • Hybrid techniques –reranking page 1

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

example www google com2
Example – www.google.com
  • Hybrid techniques –reranking page 2

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

example siret ms mff cuni cz sir
Example – siret.ms.mff.cuni.cz/sir
  • Hybrid techniques –exploration

J. Lokoč, T. Grošup, T. Skopal

Image Exploration using Online

Feature Extraction and Reranking

ICMR, 2012, Hongkong, China, ACM

J. Lokoč, T. Grošup, T. Skopal

SIR: The Smart Image Retrieval Engine

SISAP, 2012, Toronto, Canada, Springer

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

distance based indexing
Distance-based indexing
  • MM objectsorganizedintoclustersaccording to theirsimilarity
  • Effectiveness depends on the similarity model

Zezula, P., Amato, G., Dohnal, V., Batko, M.

Similarity Search: The Metric Space Approach

(Springer, 2006)

J. Lokoč, P. Čech, J. Novák, T. Skopal, SISAP, 2012, Toronto, Canada, Springer

Cut-region: A Compact Building Block For Hierarchical Metric Indexing

D. Novak, M. Batko, P. Zezula, Information systems, 2011, Elsevier

Metric Index: An efficient and scalable solution for precise and approximate similarity search

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

example mufin
Example - Mufin
  • Content-based search in 100 million Flickr images

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

example mufin1
Example - Mufin
  • MPEG-7 descriptors used – efficient, but effective?

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

distance based indexing1
Distance-based indexing
  • Effective measure
    • Often complex and expensive
  • Efficiency
    • Depends on the index performance
    • Depends also on the data “indexability”

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

distance based indexing2
Distance-based indexing
  • Indexability depends onthe distance distribution ofused distance space

E. Chavez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin

Searching in Metric Spaces, ACM Computing Surveys, 2001

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

facing bad indexability
Facing bad indexability
  • Centralized computing
    • Approximate search
    • Parallel processing
  • Distributed computing
    • Peer-to-peer architecture

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

approximate search
Approximate search
  • Based on various ideas
    • Early termination for good results
    • Reducing query radius
    • When time elapses
    • Accessing % of DB
    • Also distance modifications
  • However, for fast retrieval, the quality deteriorates rapidly

Zezula, P., Amato, G., Dohnal, V., Batko, M.

Similarity Search: The Metric Space Approach

(Springer, 2006)

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

parallel processing
Parallel processing
  • Multi-core CPUs cheap and available
  • Intel Xeon Phi coprocessor
  • GPU cards with thousands of cores
  • Amdahl\'s and Gustafson\'s law

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

distributed indexes
Distributed indexes
  • Peer-to-peer architecture
    • Chord protocol (efficient routing)
  • M-Chord, M-Index
    • Map objects to real domain R
    • Use chord protocol for object distribution
    • Query causes interval queries, results merged

D. Novak, P. Zezula, M-Chord: a scalable distributed similarity search structure

InfoScale, 2006, ACM

D. Novak, M. Batko, P. Zezula, Large-scale similarity data management with distributed

Metric Index, Information Processing & Management

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

and all together
And all together

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

thanks for your attention
Thanks for your attention …

… any questions?

Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

ad