Semantically Enabled Search Supervisor: Victor Shafran Students: Anna Burla Lev Haikin

Semantically Enabled Search Supervisor: Victor Shafran Students: Anna Burla Lev Haikin

Goal • Create a semantically enabled search capability for web services compositions from a large repository of WSDLs • Research new approach of collaborating semantic web technologies with “regular” search engines • Design and implement a useful approximation to the goal, that will reveal group of web services answering given query

Scope Reminder Inputs of WSDLs Outputs of WSDLs INPUTS WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL

Milestones (as shown on kickoff) • Preparation & installation • Setup development environment • Learn the tools • Design and implement indexer using Lucene Indexer • By phrases as well as single words • Adding synonyms and normalizing words • Define and implement the semantic and ranked search engine on top of the RDF & Indexer • Query with only one condition on wsdl • Query built of input/output types • Define ranking algorithm • Bonus: Query for web services compositions • Integrate wsdl single chain searcher with single wsdl search by keywords

Methodology • Two subsequent modules for extracting message elements from a wsdl file • Hierarchical structure of elements extractor: • Per wsdl: • Extract schemas (using xslt) • Unite the extracted schemas (build uniting schema with imports) • Create a directory with the wsdl's name • Create input and output files containing elements hierarchy trees by traversing the united schema (an example will follow) • Process duration - less than 30 min. The benefit – using XSOM which has standard interface for parsing XSD schemas • Flattener: • Dumps flat paths out of the hierarchical structure. • Two versions of paths - separated and un-separated • Separation by (upper case) words • A module for extracting data from html • Reusing components from previous semester • Currently used for enriching the scorer and the UI

Methodology • Implementation of the indexer with Lucene Indexer • Indexes the separated version • Indexes Sandbox files • Uses custom Analyzer (synonyms [customer=buyer], filters, standard, with option for stem) • Implementation of search for wsdls by input/output/free text key-words on top of Lucene index

Methodology • Design and implementation of aconfigurabletool for revealingand analyzing single chains by first and last wsdl • Extended graph builder: wsdl -> query -> wsdl • Query is compiled out of un-separated output elements by the configurablesmart query compiler: • Filtering common elements (e.g. MessageHeader, Log..) • Eliminating noisy words (e.g. standard, internal..) • Leaves only robot fields (e.g. ends with ‘ID’) • Follows longest match rule (increase match integrity) • Query is fed into Lucene using proximity search • Potential for useful statistical data • Stores indexes instead of strings (scalability reasons) • Cached • For quick search • Analyze different scoring algorithms in real time • Direct children graph builder - on top of Extended graph: • wsdl => wsdl • Every edge consists of list of queries that connect wsdls • Statistical data containers • ‘Found by Count’ hash • wsdl => list of all queries that reveled the wsdl • ‘Hits map’ hash • query => number of wsdls that it has found • Mean and Max calculator (by ‘Found by Count’)

Methodology • Recursive procedure for chains lookup between start WSDLs and end WSDLs • Search is done over the direct children graph • Extension for multiple start nodes and end nodes by adding virtual source and destination nodes to the graph • Max search depth is defined as input of the procedure • Filter those who's number of hits is too high (by hits map) and filter based on system type (WSDL attributes) • Score algorithm uses statistical data containers, direct children graph and WSDL attributes (application component) • Might use key words for improving scoring (as hints) • Crucial performance improvement using pruning algorithm (from to depth*n in average) • Using cycle detector • Prints chains sorted by score, with connections info

Methodology • Testing • Automatic testing over a golden set • Easy to configure and run • Outputs a summary as well as detailed output • A snapshot of a test xml file:

Methodology • UI • Web based UI • Find WSDLs by input/output/free text • Find WSDL compositions • Show relevant information for results with comfortable comparison options • Visualize the results graphically

Achievements • We introduced a framework that enables finding WSDL compositions by input and output keywords • Based on heuristics • Configurable • Scorer • SmartQueryCompiler • Ad-hoc semantic search • With testing over variable golden set • A UI for the user and the researcher

Conclusions • Conceptual conclusions: • WSDL world of SAP has a complex structure • Lot’s of similar wsdls • No standard for element types across wsdls • Elements that should be matched are not necessarily syntactically or semantically similar (SalesOrderBuyerPartyInternalID vs. CustomerID) • No restrictions on elements in the XSD schemas, and no priority for elements • All these reasons essentially forced us doing proximity search and focusing on building useful data collecting procedures with analyzing capabilities • Technical conclusions: • A set of heuristics for the matcher • For example - If the last element equals'ID' or 'UUID then glue it to the previous element, separate it by uppercase, wrap it with a "second order proximity search syntax”. • Noise is the problem! • Scoring can help • We presented easily configurable and testable mechanisms for improving the integrity • For example - the extended graph presents potential for calculating lot’s of meaningful statistical data which can be used to calculate the score, or to define roles to different wsdls

Demonstration • Story: A technician who repaired damaged equipment wants to report his work and to confirm the service order operation in the ERP system. The only information that he has is 'material id' and he wants to find the service execution order. At the end he wants to confirm the service operation and to receive the service confirmation and service execution order ID to save it in his records. • Start: • Input keywords: “material id” • Free text: find AND "service execution order“ • End: • Output keywords: "service confirmation" AND "Service Execution Order ID" • Demonstration of web based UI • Search for wsdls + explanation on results • Search for wsdl compositions + explanation on results + visualization • Automatic testing and summary • Demonstrate the start of documentation of the project • User manual • Developers manual • Inline documentation of the code - JavaDoc

Test Summary

FlexViz snapshot

UI Backup

Semantically Enabled Search Supervisor: Victor Shafran Students: Anna Burla Lev Haikin