Case Study: BibFinder

BibFinder: A popular CS bibliographic mediator Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect, Network Bibliography, CSB, CiteSeer More than 58000 real user queries collected Mediated schema relation in BibFinder: paper(title, author, conference/journal, year) Primary key: title+author+year Focus on Selection queries Q(title, author, year) :- paper(title, author, conference/journal, year), conference=SIGMOD Case Study: BibFinder

Selecting top-K sources for a given query • Given a query Q, and sources S1….Sn, we need the coverage and overlap statistics of sources Si w.r.t. Q • P(S|Q) is the coverage (Probability that a random tuple belonging to Q is exported by source S) • P({S1..Sj}|Q) is the overlap between S1..Sj w.r.t. query Q (Probability that a random tuple belonging to Q is exported by all the sources S1..Sj). • If we have the coverage and overlap statistics, then it is possible to pick the top-K sources that will give maximal number of tuples for Q.

Computing Effective Coverage provided by a set of sources Suppose we are calling 3 sources S1, S2, S3 to answer a query Q. The effective coverage we get is P(S1US2US3|Q). In order to compute this union, we need the intersection (overlap) statistics (in addition to the coverage statistics) Given the above, we can pick the optimal 3-sources for answering Q by considering all 3-sized subsets of source set S1….Sn, and picking the set with highest coverage

Selecting top-K sources: the greedy way Selecting optimal K sources is hard in general. One way to reduce cost is to select sources greedily, one after other. For example, to select 3 sources, we select first source Si as the source with highest P(Si|Q) value. To pick the jth source, we will compute the residual coverage of each of the remaining sources, given the 1,2…j-1 sources we have already picked (the residual coverage computation requires overlap statistics). For example picking a third source in the context of sources S1 and S2 will require us to calculate:

What good is a high coverage sourcethat is off-line? • Sources vary significantly in terms of their response times • The response time depends both on the source itself, as well as the query that is asked of it • Specifically, what fields are bound in the selection query can make a difference • Hard enough to get a high coverage or a low response time plan. But now we have to combine them… • Qn: How do we define an optimal plan in the context of both coverage/overlap and response time requirements?

Response time can depend on the query type Range queries on year Effect of binding author field --Response times can also depend on the time of the day, and the day of the week.

Multi-objective Query optimization • Need to optimize queries jointly for both high coverage and low response time • Staged optimization won’t quite work. • An idea: Make the source selection be dependent on both (residual)coverage and response time

Results on BibFinder

Challenges • Sources are incomplete and partially overlapping • Calling every possible source isinefficient and impolite • Need coverage and overlap statistics to figure out what sources are most relevant for every possible query! • We introduce a frequency-based approach for mining these statistics

Outline • Motivation • BibFinder/StatMiner Architecture • StatMiner Approach • Automatically learning AV Hierarchies • Discovering frequent query classes • Learning coverage and overlap Statistics • Using Coverage and Overlap Statistics • StatMiner evaluation with BibFinder • Related Work • Conclusion

Challenges of gathering coverage and overlap statistics • It’s impractical to assume that the sources will export such statistics, because the sources are autonomous. • It’s impractical to learn and store all the statistics for every query. • Necessitate different statistics, is the number possible queries, is the number of sources • Impractical to assume knowledge of entire query population a priori Motivation • We introduce StatMiner • A threshold based hierarchical mining approach • Store statistics w.r.t. query classes • Keep more accurate statistics for more frequently asked queries • Handling the efficiency and accuracy tradeoffs by adjusting the thresholds

BibFinder/StatMiner

Query List

AV Hierarchies and Query Classes

StatMiner

Using Coverage and Overlap Statistics to Rank Sources

Outline • Motivation • BibFinder/StatMiner Architecture • StatMiner Approach • Automatically learning AV Hierarchies • Discovering frequent query classes • Learning coverage and overlap Statistics • Using Coverage and Overlap Statistics • StatMiner evaluation with BibFinder • Related Work • Conclusion

BibFinder/StatMiner Evaluation • Experimental setup with BibFinder: • Mediator relation: Paper(title,author,conference/journal,year) • 25000 real user queries are used. Among them 4500 queries are randomly chosen as test queries. • AV Hierarchies for all of the four attributes are learned automatically. • 8000 distinct values in author, 1200 frequent asked keywords itemsets in title, 600 distinct values in conference/journal, and 95 distinct values in year.

Learned Conference Hierarchy

Space Consumption for Different minfreq and minoverlap • We use a threshold on the support of a class, called minfreq, to identify frequent classes • We use a minimum support threshold minoverlap to prune overlap statistics for uncorrelated source sets. • As we increase any of the these two thresholds, the memory consumption drops, especially in the beginning.

Accuracy of the Learned Statistics • Absolute Error • No dramatic increases • Keeping very detailed overlap statistics would not necessarily increase the accuracy while requiring much more space. For example: minfreq=0.13 and minoverlap=0.1 versus minfreq=0.33 and minoverlap=0

Plan Precision • Here we observe the average precision of the top-2 source plans • The plans using our learned statistics have high precision compared to random select, and it decreases very slowly as we change the minfreq and minoverlap threshold.

Plan Precision on Controlled Sources We observer the plan precision of top-5 source plans (totally 25 simulated sources). Using greedy select do produce better plans. See Section 3.8 and Section 3.9 for detailed information

Number of Distinct Results • Here we observe the average number of distinct results of top-2 source plans. • Our methods gets on average 50 distinct answers, while random search gets only about 30 answers.

Applications • Path Selection in Bioinformatics [LNRV03] • More and More Bioinformatics sources available on Internet • Thousands of paths existing for answering users’ queries • Path Coverage and Overlap Statistics are needed • Text Database Selection in Information Retrieval • StatMiner can provide a better way of learning and storing representatives of the databases • Main Ideas • Maintain a query list and discover frequent asked keyword-sets • Learn keyword-set hierarchy based on the statistics distance • Learn and store coverage (document frequency) for frequent asked keyword-set classes. • A new query will be mapped to a set of close classes and use their statistics to estimate statistics for the query. • Advantages • Multiple-word-term & Scalability

Case Study: BibFinder