1 / 55

Similarity Search for Web Services

Similarity Search for Web Services. Xin (Luna) Dong , Alon Halevy, Jayant Madhavan, Ema Nemes, Jun Zhang University of Washington. Web Service Search. Web services are getting popular within organizations and on the web

huela
Download Presentation

Similarity Search for Web Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity Search for Web Services Xin (Luna) Dong, Alon Halevy, Jayant Madhavan, Ema Nemes, Jun Zhang University of Washington

  2. Web Service Search • Web services are getting popular within organizations and on the web • The growing number of web services raises the problem of web-service search. • First-generation web-service search engines do keyword search on web-service descriptions • BindingPoint, Grand Central, Web Service List, Salcentral, Web Service of the Day, Remote Methods, etc.

  3. Keyword Search does not Capture the Underlying Semantics zip

  4. Keyword Search does not Capture the Underlying Semantics 50

  5. Keyword Search does not Capture the Underlying Semantics zipcode

  6. Keyword Search does not Capture the Underlying Semantics 18

  7. Keyword Search does not Accurately Specify Users’ Information Needs

  8. Keyword Search does not Accurately Specify Users’ Information Needs

  9. Users Need to Drill Down to Find the Desired Operations Choose a web service

  10. Users Need to Drill Down to Find the Desired Operations Choose an operation

  11. Users Need to Drill Down to Find the Desired Operations Enter the input parameters

  12. Users Need to Drill Down to Find the Desired Operations Results – output

  13. How to Improve Web Service Search? • Offer users more flexibility by providing similar operations • Base the similarity comparison on the underlying semantics

  14. 1) Provide Similar WS Operations • Op1: GetTemperature • Input: Zip, Authorization • Output: Return • Op2: WeatherFetcher • Input: PostCode • Output: TemperatureF, WindChill, Humidity Similar Operations  Select the most appropriate one

  15. 2) Provide Operations with Similar Inputs/Outputs • Op1: GetTemperature • Input: Zip, Authorization • Output: Return • Op2: WeatherFetcher • Input: PostCode • Output: TemperatureF, WindChill, Humidity • Op3: LocalTimeByZipcode • Input: Zipcode • Output: LocalTimeByZipCodeResult • Op4: ZipCodeToCityState • Input: ZipCode • Output: City, State Similar Inputs  Aggregate the results of the operations

  16. 3) Provide Composable WS Operations • Op1: GetTemperature • Input: Zip, Authorization • Output: Return • Op2: WeatherFetcher • Input: PostCode • Output: TemperatureF, WindChill, Humidity • Op3: LocalTimeByZipcode • Input: Zipcode • Output: LocalTimeByZipCodeResult • Op4: ZipCodeToCityState • Input: ZipCode • Output: City, State • Op5: CityStateToZipCode • Input: City, State • Output: ZipCode Input of Op2 is similar to Output of Op5  Compose web-service operations

  17. Searching with Woogle Similar Operations, Inputs, Outputs Composable with Input, Output

  18. Searching with Woogle A sample list of similar operations Jump from operation to operation

  19. Elementary Problems • Two elementary problems: • Operation matching: Given a web-service operation, return a list of similar operations • Input/output matching: Given the input/output of a web-service operation, return a list of web-service operations with similar inputs/outputs • Goal: • High recall: Return potentially similar operations • Good ranking: Rank closer operations higher

  20. Can We Apply Previous Work? • Software component matching • Require the knowledge of implementation – We only know the interface • Schema matching • Similarity on different granularity • Web services are more loosely related • Text document matching • TF/IDF: term frequency analysis • E.g. Google

  21. Why Text Matching Does not Apply? • Web page: often long text Web service: very brief description  Lack of information

  22. Web Services Have Very Brief Descriptions

  23. Why Text Matching Does not Apply? • Web page: often long text Web service: very brief description  Lack of information • Web page: mainly plain text Web service: more complex structure  Finding term frequency is not enough

  24. Operations Have More Complex Structures • Op1: GetTemperature • Input: Zip, Authorization • Output: Return • Op2: WeatherFetcher • Input: PostCode • Output: TemperatureF, WindChill, Humidity • Op3: LocalTimeByZipcode • Input: Zipcode • Output: LocalTimeByZipCodeResult • Op4: ZipCodeToCityState • Input: ZipCode • Output: City, State • Op5: CityStateToZipCode • Input: City, State • Output: ZipCode Similar use of words, but opposite functionality

  25. Our Solution Part 1: Exploit Structure Web service description Web Service Corpus Operation Similarity Operation name and description Input parameter names Output parameter names

  26. Why Text Matching Does not Apply? • Web page: often long text Web service: very brief description  Lack of information • Web page: mainly plain text Web service: more complex structure  Finding term frequency is not enough • Operation and parameter names are highly varied Finding word usage patterns is hard

  27. Parameter Names Are Highly Varied • Op1: GetTemperature • Input: Zip, Authorization • Output: Return • Op2: WeatherFetcher • Input: PostCode • Output: TemperatureF, WindChill, Humidity • Op3: LocalTimeByZipcode • Input: Zipcode • Output: LocalTimeByZipCodeResult • Op4: ZipCodeToCityState • Input: ZipCode • Output: City, State • Op5: CityStateToZipCode • Input: City, State • Output: ZipCode

  28. Our Solution Part 2: Cluster Parameters into Concepts Web service description Web Service Corpus Operation Similarity Operation name and description Input parameter names Input parameter names & concepts Concepts Output parameter names Output parameter names & concepts

  29. Outline • Overview • Clustering parameter names • Experimental evaluation • Conclusions and ongoing work

  30. Clustering Parameter Names • Heuristic: Parameter terms tend to express the same concept if they occur together often • Strategy: Cluster parameter terms into concepts based on their co-occurrences • Given terms p and q, similarity from p to q: • Sim(pq) = P(q|p) • Directional: e.g. Sim (zipcode) > Sim (codezip) (ZipCode v.s. TeamCode, ProxyCode, BarCode, etc.) • Term p is close to q: • Sim(pq) > Threshold e.g. city is close to state.

  31. Criteria for an Ideal Clustering • High cohesion and low correlation • cohesion measures the intra-cluster term similarity • correlation measures the inter-cluster term similarity • cohesion/correlation score =

  32. Clustering Algorithm (I) • Algorithm – a series of refinements of the classic agglomerative clustering • Basic agglomerative clustering: merge clusters I and J if term i in I is close to term j in J

  33. Clustering Algorithm (II) • Problem: {temperature, windchill} + {zip} => {temperature, windchill, zip} • Solution: • Cohesion condition: each term in the result cluster is close to most (e.g. half) of the other terms in the cluster • Refined Algorithm: merge clusters I and J only if the result cluster satisfies the cohesion condition

  34. Clustering Algorithm (III) • Problem: {code, zip} + {city, state, street} {code} + {zip, city, state, street} • Solution: split before merge => I I’ I I-I’ I’ I I-I’ I’ I-I’ I’ I-I’ J J J J’ J’ J J-J’ J-J’

  35. Clustering Algorithm (IV) • Problem: {city, state, street} + {zip, code} => {city, state, street, zip, code} • Solution: • noise terms – most (e.g. half) of the occurrences are not accompanied by other terms in the concept • After a pass of splitting and merging, remove noise terms.

  36. Clustering Algorithm (V) • Problems: • The cohesion condition is too strict for large concepts • The terms taken off during splitting lose the chance to merge with other terms • Solution: Run the algorithm iteratively do{ refined agglomerative clustering (a set of splitting-and-merging); remove noise terms; replace each term with its concept; } while (no more merges)

  37. Outlines • Overview • Clustering parameter names • Experimental evaluation • Conclusions and ongoing work

  38. Experiment Data and Clustering Results • Data set: • 790 web services (431 are active) • 1574 distinct operations • 3148 inputs/outputs • Clustering results: • 1599 parameter terms • 623 concepts • 441 single-term concepts (54 frequent terms and 387 infrequent terms) • 182 multi-term concepts (59 concepts with more than 5 terms)

  39. Example Clusters • (temperature, heatindex, icon, chance, precipe, uv, like, temprature, dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, china, local, update) • (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier) • (state, city) • (zip) • (code)

  40. Example Clusters • (temperature, heatindex, icon, chance, precipe, uv, like, temprature, dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, china, local, update) • (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier) • (state, city) • (zip) • (code)

  41. Example Clusters • (temperature, heatindex, icon, chance, precipe, uv, like, temprature, dew, feel, weather, wind, humid, visible, pressure, condition, windchill, dewpoint, moonset, sunrise, moonrise, sunset, heat, precipit, extend, forecast, china, local, update) • (entere, enter, pitcher, situation, overall, hit, double, strike, stolen, ball, rb, homerun, triple, caught, steal, pct, op, slug, player, bat, season, stats, position, experience, throw, players, draft, experier, birth, modifier) • (state, city) • (zip) • (code)

  42. Measuring Top-K Precision • Benchmark • 25 web-service operations • From several domains • With different input/output sizes and description sizes • Manually label whether the top hits are similar • Measure • Top-k precision: precision for the top-k hits

  43. Top-k Precision for Operation Matching Woogle Ignore structure Text matching on descriptions

  44. Top-k Precision for Input/output Matching

  45. Measuring Precision and Recall • Benchmark: • 8 web-service operations and 15 inputs/outputs • From 6 domains • With different popularity • Inputs/outputs convey different numbers of concepts, and concepts have varied popularity • Manually label similar operations and inputs/outputs. • Measure: R-P (Recall-Precision) curve

  46. Impact of Multiple Sources of Evidences in Operation Matching Woogle without clustering Text matching on descriptions Ignore structure

  47. Impact of Parameter Clustering in Input/output Matching Woogle Compare only concepts Compare only parameter names

  48. Conclusions • Defined primitives for web-service search • Algorithms for similarity search on web-service operations • Exploit structure information • Cluster parameter names into concepts based on their co-occurrences • Experiments show that the algorithm obtains high recall and precision.

  49. Ongoing Work I – Template search on Operations Input: city state Output: weather Description: forecast in the next nine days

  50. Ongoing Work I – Template search on Operations GetWeatherByCityState

More Related