Query Processing in a Mediator System for Data and Multimedia

Query Processing in a Mediator System for Data and Multimedia D. Beneventano1, C. Gennaro2, M. Mordacchini2, R. Carlos Nana Mbinkeu1 1DII - Università di Modena e Reggio Emilia, via Vignolese 905, Modena, Italy 2ISTI – CNR, via Moruzzi 1, Pisa, Italy

Outline • Motivation • The system and scenario overview • Querying an ontology of data and multimedia sources • mapping • Queryunfoldingfor multimedia conditions • ranking • Conclusion and future work

Motivation • We proposed a method for building a populated domain ontology representative of a set of web data sources. • The method exploits the capabilities of a mediator system (MOMIS) to create an integrated view of a set of data sources, • i.e. a domain ontology schema, and a set of annotations linking data to the integrated view. • We extend that approach with multimedia sources, thus obtaining a methodology for building and querying an ontology representing data and multimedia sources. • There are several use cases where applications interact with ontologies of data and multimedia sources. • Multimedia and data sources are usually represented with different models. No standard for representing at the same time data and multimedia sources has been adopted by large communities. • Different languages and different interfaces for querying “traditional” and “multimedia” data sources have been developed. The formers rely on expressive languages allowing expressing selection clauses, the latters typically implement similarity search techniques for retrieving multimedia documents similar to the ones provided by the user.

Managing a Semantic Peer: MOMIS + MILOS • provides a unified access to different data sources referring to the same domain by means of a Semantic Peer Data Ontology (SPDO) of the data i.e. a common representation of all the data sources belonging to the peer. MOMIS (Mediator envirOnment for Multiple Information Sources) is a framework to perform information extraction and integration of heterogeneous, structured and semistructured, data sources • MILOS is a general purpose Multimedia Content Management System • Manages and serves any multimedia documents • Manages any metadata of documents NeP4B Semantic Peer

Data and Multimedia Sources (DMSs) • Data and Multimedia Source (DMS) is an object oriented database of metadata objects describing a collection of multimedia documents (such as images, videos, etc.) represented with a schema defined in ODLI3 • The DMS schema includes , in general, a set of standard attributes declared using standard predefined ODLI3 types, such as string, double, integer, etc, supporting selection predicates typical of structured and semi-structured data, such as =, <, >, . . . • And multimedia attributes, LMS includes another set of special attributes, declared by means of special predefined classes in ODLI3 which support similarity based searches (Full text search, image similarity, geographical search, etc.)

A sample scenario

Quering DMSs • A DMS Mican be queried using an extension of standard SQL-like syntax SELECT clause. The WHERE clause consists of a conjunctive combination of predicates on the single standard attributes of Mi, as in the following: • ORDER BY + LIMIT K, specify in practice a top-k similarity query SELECTMi.Ak,…, Mi.Sl,… FROMMi WHEREMi.Ax op1 val1 ANDMi.Ay op2 val2 ... ORDER BY Mi.Sw(Q1), Mi.Sz(Q2),… LIMITK

Quering DMSs interface city() { // standard attributes attribute string Name; attribute string Zip; attribute string Country; attribute integer Surface; attribute integer Population; // similarity attributes attribute Image Photo; attribute Text Description; attribute GeoCoordGeoPosition, } // query example SELECT Name FROM city WHERE Country = "Italy“ ORDER BY Photo("http://www.flickr.com/32e324e.jpg"), GeoCoord(40.25, 14.32), Description("sea mozzarella pizza") LIMIT 100 This query tries to find among all Italian cities the ones that best match the image given as example, the textual description, and are nearest as possible to the geographical point of location 40.25N, 14.32E.

DMS: Assumptions • Since we would like to build a general purpose framework, we make the following assumptions: • The way by which the returned objects are ordered is not known (black box); • The DMS does not return scores associated with the objects indicating the relevance of them with respect to the query; • If no ORDER BY clause is specified, DMS will return the records sorted in random order.

Representing the SPDO • We build a conceptualization of a set of DMSs, composed of global classes and global attributes and mappings between the SPDO and the DMS schemata,

Mapping • The query is defined in a semiautomatic way as follows: • A Mapping Table (MT) is specified for each global class G, whose columns represent the n local classes M1,… ,Mn belonging to G and whose rows representthe h global attributes of G. Multimedia attributes can be mapped only onto Global multimedia attributes of the same type. • Join Conditions are defined between pairs of local classes belonging to G and allow the system to identify instances of the same real-world object in different sources.

Example of mapping

Mapping • Resolution Functions are introduced to solve data conflicts of local attribute values associated to the same real-world object. In our framework we consider and implement some of such resolution functions, in particular, the PREFERRED function, which takes the value of a preferred source and the RANDOM function, which takes a random value. • For what concern the multimedia attributes, we introduce a new resolution function, called MOST_SIMILAR, which returns the multimedia objects most similar to the one expressed in the query (if any).

Query the SPDO • Given a global class G with m attributes of which k multimedia attributes, denoted by G.S1,…,G.Sk(as photo and description in the class Hotel) and h standard attributes, denoted by G.A1,…,G.Ah, a query on G (global query) is a conjunctive query, expressed in a simple abstract SQL-like syntax as: SELECTG.Al,…,G.Sj FROMG WHEREG.Ax op1 val1 ANDG.Ay op2 val2 ... ORDER BY G.Sw(Q1), …, G.Sz(Q2) LIMIT K

Query unfolding • To answer a global query on G, the query must be rewritten as an equivalent set of queries (local queries) expressed on the local classes L(G) belonging to G. • the query rewriting is performed by means of query unfolding, which consists of the following four steps: • Computation of Local Query conditions • Computation of Residual Conditions • Fusion of local answers • Application of the Residual Condition

Query Fusion: Ranking • Why? • Modern multimedia content managers typically return multimedia objects (i.e., which support similarity) in decreasing order of relevance, that is, so that the “best” answers are on the top; • we want to preserve this knowledge at global level; • However, since we cannot exploit scores we use the rank as indicator of the relevance of the record returned.

Ranking the results • our problem falls into the category of the partial rank aggregation problems, in which we merge top-k lists rather than fully ranked lists, • We use a simple but yet effective aggregation function for ordinal ranks is the median function: • The score of an object its median position in all the returned lists. • The median function is demonstrated by Fagin et al., to be near-optimal, even for top-k or partial lists. • The algorithm MEDRANK is based on median rank aggregation

The MedRank algorithm • Access the rankings sequentially • when an element has appeared in more than half of the rankings, output it in the aggregated ranking

Example • We would like to found image about the “Arch of Triumph of Rome by night”. • and we assume to have two DMSs containing images of monuments in the world, the first DMS1 with geographical coordinates search capabilities, and the second one DMS2 with image similarity search capabilities;

Example

SELECT… FROM DMS1 WHERE subject=“Monument” ORDER BY GeoCoord(41°53'43.68"N, 12°28'56.34"E ) STOP AFTER 5 ORDER BY Image(URL), GeoCoord(41°53'43.68"N, 12°28'56.34"E Unfortunately if I just for geo coordinates giving the coordinates of Rome as input I found a lot of images of the Colosseum Dist = 1km Dist = 1km DMS1 Dist = 1km Dist = 1km Dist = 2km Roma. Palazzo della Civiltà del Lavoro. EUR

SELECT… FROM DMS2 WHERE type=“Monument” ORDER BY Img(URL), STOP AFTER 5 ORDER BY Image(URL) And if I just search for similarity an image of the “Arch of Triumph of Rome by night” I found a lot of images about the Arch of Triumph of Paris, which is very similar but more famous. DMS2 Roma. Palazzo della Civiltà del Lavoro. EUR

SELECT… FROM WorldMonuments WHERE Subject=“Monument” ORDER BY Img(URL), GeoCoord(41°53'43.68"N, 12°28'56.34"E ) STOP AFTER 5 First element retrieved ORDER BY Image(URL) ORDER BY Image(URL), GeoCoord(41°53'43.68"N, 12°28'56.34"E Dist = 1km Dist = 1km MS1 MS2 Dist = 1km Dist = 1km Dist = 2km Roma. Palazzo della Civiltà del Lavoro. EUR

Conclusion and future work • We presented a methodology implemented in a tool that allows a user to create and query an integrated view of data and multimedia sources. • Future work will be devoted to experiment the tool in real scenarios. In particular, our tool will be exploited for integrating business catalogs related to the area of “tiles”. • We think that such data may provide useful test cases because of the need of connecting data about the features of the tiles with their images.

The end

Building the Data Ontology: MOMIS MOMIS* (Mediator envirOnment for Multiple Information Sources) is a framework to perform information extraction and integration of heterogeneous, structured and semistructured, data sources Semantic Integration of Information • A common data model ODLI3 (derived from ODL-ODMG and I3) & mapped into OLCD description logics Tool-supported techniques to construct the Global Virtual View (GVV) • Local sources wrapping • Local Schema Annotation w.r.t. a common lexical ontology (WordNet) • Semi-automatic discovery of relationships between local schemata • Clustering techniques to build the GVV & mappings between the GVV and local schemata (Mapping Table) • automatic GVV Annotation w.r.t. a common lexical ontology & OWL exportation Global Query Management • Including services and multimedia data sources * D. Beneventano, S. Bergamaschi, F. Guerra, M. Vincini: "Synthesizing an Integrated Ontology ", IEEE Internet Computing Magazine, September-October 2003,42-51. S. Bergamaschi, S. Castano, M. Vincini "Semantic Integration of Semistructured and Structured Data Sources", SIGMOD Record Special Issue on Semantic Interoperability in Global Information, Vol. 28, No. 1, March 1999.

MOMIS architecture COMMON THESAURUSGENERATION GVV GENERATION WRAPPING GLOBAL CLASSES SYNSET# SCHEMA DERIVEDRELATIONSHIPS ODLI3LOCAL SCHEMA 1 SYNSET4 LEXICON DERIVEDRELATIONSHIPS SYNSET1 SYNSET2 Common Thesaurus … … ODLI3 LOCAL SCHEMA N USER SUPPLIEDRELATIONSHIPS MAPPING TABLES INFERRED RELATIONSHIPS MANUALANNOTATION SEMI-AUTOMATIC ANNOTATION

Mapping definition in MOMIS • Mappings among a Global Class G of the GVV and its local classes are represented by a Mapping Table • Global-as-View (GAV) mappings: for each global class G a view VG over the local classes of G is defined by a Full-Join Merge Operator: • Outer Join : to include into the result all tuples of all local sources • Merge : to perform data reconciliation (Resolution functions)

Building the Mappings: an example Join Conditions Join Attribute Resolution Functions avg(L1,L2) Full Join Merge Mapping Table of the global Class Hotel = {L1.resort, L2.hotel} Data Conversion Functions DollarEuro(mean_price) Select name, avg(T_L1.price_avg, T_L2.mean_price) as price, T_L1.Stars, … Full Join from T_L1 outer join T_L2 on (T_L1.Name = T_L2.denomination)

Global Query Management • The querying problem: How to answer queries expressedon the GS (global queries)? • In a Virtual Data Integration system, data reside at the data sources then the query processing is based on Query rewriting : to rewrite a global query as an equivalent set of queries expressed on the local schemata data sources (local queries). • GAV approach: query rewriting is performed by unfolding, i.e. by expanding a global query on G according to the view associated to G • Query Optimization Techniques for the Full-Join Merge Operator • Motivation : • full outer join queries are very expensive, especially in a distributed environment • only limited optimization is performed on full outer join

An example of Full-Join Merge Optmization LQ2= SELECT * FROM L2 WHERE location LIKE "%Modena%" LQ1= SELECT * FROM L1 WHERE City LIKE "%Modena%" INNER JOIN RIGHT JOIN LQ1 FULL JOIN LQ2 Apply resolution functions: price =AVG() Apply residual constraints : price < 200 Result SELECT * FROM G WHERE city LIKE "%Modena%" AND price < 200 AND stars = 4 AND free_wifi = true AND free_wifi = true AND stars = 4

MILOS XML Search Engine: Structure search Fielded search Full text search Multimedia search Schema independent XQuery support(SOAP Web Service) MultiMedia doc. serv.:Allows homoneous acces to heterogeneous media (SOAP Web Service) Metadata Editor: Visual Basic (SOAP Comm.) Retrieval Interface: JSP(SOAP Comm.) Metadata independence: The schema seen in the interface logic can be different of the one(s) used in the repository Repository Metadata Integrator:Access to documents Access to metadata Metadata indepence (SOAP Web Service)

MILOS (2) • The MILOS system is based on a three–tier distributed architecture: • Client tier This is the top most level of the system. It contains client application that interacts with MILOS and that displays results to user applications. • Business logic It manages query processing by integrating and aligning information stored in the databases. It performs reconciliation of retrieved data by managing ranking. • Data tier It is composed of the Large Object Database, that physically stores multimedia documents managed by the system and the metadata database, where all metadata associated with the multimedia items are stored. • Multimedia metadata are represented in the data tier in XML formats. MILOS adopts a native XML database, which supports XML query language standards and offers advanced search and indexing functionality on arbitrary XML documents. • MILOS XML database provides full–text search, automatic classification, and feature similarity search functionalities. • the Large Object Database permits clients of MILOS to deal with multimedia in an uniform way.

The MedRank algorithm • Whenever there are multiple multimedia attributes strange side effects can affect the precision of the answer. • Example: • Suppose we have two image database consisting of monument images. • MS1: provides image similarity and geografic coordinates • MS2: provides only image similarity • The query consists of a sample image and a point coordinates

SELECT… FROM WorldMonuments ORDER BY Image(URL), GeoCoord(41°53'43.68"N, 12°28'56.34"E ) STOP AFTER 5 First element retrieved ORDER BY Image(URL), GeoCoord(41°53'43.68"N, 12°28'56.34"E ORDER BY Image(URL) Dist = 1km Dist = 1km MS1 MS2 Dist = 1km Dist = 1km Dist = 2km Roma. Palazzo della Civiltà del Lavoro. EUR

DMS: Assumptions • The rationale of the above assumptions is that our aim is to work in a general environment with heterogeneous DMSs for which we do not have any knowledge of their scoring functions. • The motivation is that the final scores themselves are often the result of the contributions of the scores of each attribute. A scoring function is therefore usually defined as an aggregation over partial heterogeneous scores (e.g., the relevance for text-based IR with keyword queries, or similarity degrees for color and texture of images in a multimedia database). • Even in the simpler case of single multimedia attributes the knowledge of the scores become meaningless outside the context in which they are evaluated. As an example consider the TF * IDF scoring function used by normaltext search engines. The score of a document depends upon the collection statistics and search engines could use different scoring algorithms. • However, the above assumptions of considering a local DMS as a black box that does not return any score associated to result elements, do not presume that local DMSs do not use internally scoring functions for combing different multimedia attributes . • Typically modern multimedia systems use fuzzy logic to aggregate scores of different multimedia attributes that are graded in the interval [0,1]. Classical examples of thesefunctions are the min and mean functions.

Computation of Local Query conditions • Each atomic predicate Piand similarity predicate in the global query are rewritten into corresponding constraints supported by the local classes. • For example, the constraints stars = 3 is translated into a constrain Stars = 3 considering the local class resort and is not translated into any constraint considering the local class hotel.

Computation of Residual Conditions • Conditions on not homogeneous standard attributes cannot be translated into local conditions: they are considered as residual and have to be solved at the global level.

Computation of Residual Conditions • for multimedia attribute we use the MOST_SIMILAR. For example, suppose we are searching for images similar to one specified in the query by means of ’ORDER BY’ clause. If we retrieve two or more multimedia objects with one or more corresponding images, MOST_SIMILAR function will simply select the image that is more similar to the query image. • However since we do not know scores, how do we evaluate similarity?

Computation of Residual Conditions • Rank Based Similarity: • we simply exploit the rank of the objects in the returned list as indicator of similarity between the attributes values belonging to the objects. • This aspect is related with the problem of the fusion

Fusion of local answers • For each local source involved in the global query, a local query is generated and executed on the local sources. The local answers are fused into the global answer on the basis of the mapping query qGdefined for G, i.e. by using the Full Outerjoin-merge (FOJ) operation. • Computation of the full outer join of local answers (FOJ). The result of this operation is ordered on the basis of the multimedia attributes specified in the query, this aspect is deeply examined in the next Slide. • Application of the Resolution Functions : for each attribute GA of the global query the related Resolution Function is applied to FOJ

Ranking the results • In principle, if we had ALL the (fused) records of the result set we can exploit an optimal rank aggregation method based on a distance measure to quantify the disagreements among different rankings. • In this respect the overall ranking is the one that has minimum distance to the different rankings obtained from different sources. • Several different distance measures are available in literature. However, the difficult of solving the problem of distance-based rank aggregation is related to the choice of the distance measure and its corresponding complexity that can be even NP-Hard in some cases (see Kendall distance). • However, fortunately, our case falls into this category of the partial rank aggregation problems, in which we measures the distance between only the top-k lists rather than fully ranked lists.

Example1 A: ( 1 , 2 , 3 ) B: ( 1 , 1 , 2 ) C: ( 3 , 3 , 4 ) D: ( 3 , 4 , 4 ) 1 http://www.cs.helsinki.fi/u/tsaparas/InformationNetworks/lectures/lecture10.ppt

Combining rankings • In many cases the scores are not known • e.g. meta-search engines – scores are proprietary information • … or we do not know how they were obtained • one search engine returns score 10, the other 100. What does this mean? • … or the scores are incompatible • apples and oranges: does it make sense to combine price with distance? • In this cases we can only work with the rankings

Query Processing in a Mediator System for Data and Multimedia