1 / 51

Framework for annotation and composition of web space analysis tools

Framework for annotation and composition of web space analysis tools. Vojtěch Svátek Department of Information and Knowledge Engineering University of Economics, Prague, Czech Republic. Pre-cursor KEG talks related to the Rainbow project. 6 November 2002 “Projekt Rainbow” (Sv átek)

anissae
Download Presentation

Framework for annotation and composition of web space analysis tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Framework for annotation and composition of web space analysis tools Vojtěch Svátek Department of Information and Knowledge Engineering University of Economics, Prague, Czech Republic

  2. Pre-cursor KEG talks related to the Rainbow project • 6 November 2002 “Projekt Rainbow” (Svátek) • overview of tools developed in the project to date • 10 March 2005 „Machine Learning and the Semantic Web“(Labský) • focus on information extraction using a statistical technique • Current topic • focus on integration of different techniques and tools V. Svátek, KEG 29.9.2005

  3. Related publications(see http://rainbow.vse.cz) • Svátek et al., ISM 2002 • Labský and Svátek, DATESO 2003 • Svátek and Vacura, WWW 2003 (Poster Track) • Svátek, Labský and Vacura, EKAW 2004 • Svátek, ten Teije and Vacura, Znalosti 2005 • Svátek and Vacura, RAWS 2005 V. Svátek, KEG 29.9.2005

  4. Agenda • Overview of web space analysis landscape (5’) • TODD annotation framework and Rainbow collection of web space ontologies (15’) • Rainbow collection of problem solving methods and re-description of real applications (10’) • Parametric design as starting point for automated composition of classification services (10’) • Simulation of service composition and execution (15’) • Rainbow and the semantic web service cycle (5’) • Future work (5’) V. Svátek, KEG 29.9.2005

  5. Web space analysis communities • Classification of documents (or images) • Information extraction from text • Web graph analysis • Web document or image retrieval • Clustering of documents or other web objects • Discovery of associations among web pages... • Natural language understanding on the web • ... and many others V. Svátek, KEG 29.9.2005

  6. Web space as subject of analysis • Brings unprecedented heterogeneity to the art of data analysis • free text, structured tables and lists, hyperlink topologies, images, meta-data, URL conventions... • different data types and representations potentially provide complementary as well as supplementary information • Most web analysis methods reduce this heterogeneity in one or more aspects • inevitably leads to information loss • Is it possible to analyse the web ‘in virgin state’? • only solution: by combination of multiple methods based on different principles V. Svátek, KEG 29.9.2005

  7. Method combination: technically as web service composition... • Building sophisticated monolithic applications would be impractical; they would be hard to understand and difficult to maintain... • A better solution seems to be to implement individual tools as web services and to combine them using state-of-the-art web servicecomposition techniques V. Svátek, KEG 29.9.2005

  8. Agenda • Overview of web space analysis landscape (5’) • TODD annotation framework and Rainbow collection of web space ontologies (15’) • Rainbow collection of problem solving methods and re-description of real applications (10’) • Parametric design as starting point for automated composition of classification services (10’) • Simulation of service composition and execution (15’) • Rainbow and the semantic web service cycle (5’) • Future work (5’) V. Svátek, KEG 29.9.2005

  9. TODD framework for service annotation • Four dimensions of web space analysis methods/tools: • Abstract task accomplished by the tool • Type/class and identity of web objects that appear on I/O • Type/representation of underlying data (also called ‘web view’) • Problem domain • Presumably covers all what can be said about an arbitrary method/tool V. Svátek, KEG 29.9.2005

  10. ‘Task’ dimension • Classification • Retrieval • Extraction • Clustering • Association discovery (and other inductive tasks) V. Svátek, KEG 29.9.2005

  11. Task: Classification • Classify: assign a web object one of predefined classes • table look-up (lowest level of task hierarchy) • use information from related objects: adjacent, sub- and super-objects etc. (higher levels) • Examples: • a page is a hub, product catalogue, form-based page… • a link is upward, outward, dictionary... • a HTML structure is menu, product catalogue item... V. Svátek, KEG 29.9.2005

  12. Task: Retrieval • Retrieve: find (locations of) objects satisfying given conditions • syntactical retrieval - find objects of certain type, with certain relation of a given object (lowest level of task hierarchy) • semantic retrieval - objects also have to belong to certain classes (higher levels) • direct retrieval • index-based retrieval • Examples: • find all pages belonging to same website as given start-up page • find all outward links in a page • find all phrases describing the company history, in a page V. Svátek, KEG 29.9.2005

  13. Task: Extraction • Extract: access some (textual) content within a web object • ‘dump’ the whole content of a web object (lowest level of task hierarchy) • extract specific information from a larger object (higher levels), by either explicit or implicit decomposition into sub-objects • Examples: • extract the sentence starting at position <XPointerExp> • extract the name of company from the homepage • extract the values of ‘keywords’ in META tags • extract the codes and prices of products from catalogue V. Svátek, KEG 29.9.2005

  14. ‘Object’ dimension • Identity of objects: variables allowing to bind available outputs with required inputs and vice versa • Types of objects: defined in the Upper Web Ontology • are assumed to be known a priori for a given object • (‘Semantic’) Classes of objects: defined in more specific ontologies subordinated to UWO V. Svátek, KEG 29.9.2005

  15. Upper Web Ontology V. Svátek, KEG 29.9.2005

  16. ‘Data’ dimension • Captures the representation of the web space that is used by the given tool as input data • text in sentences; HTML code; URL strings; images; explicit metadata; link topology... • Correlated but not identical with the ‘object’ dimension (the same object can be represented in different ways!) • Also called ‘web view’; each can define its own taxonomy of objects’ classes • An FCA-based method for integration of view-specific taxonomies developed (Labský & Svátek, 2003) V. Svátek, KEG 29.9.2005

  17. Fragments of ‘view’-specific ontologies (HTML vs. links) V. Svátek, KEG 29.9.2005

  18. Merging class taxonomies V. Svátek, KEG 29.9.2005

  19. ‘Domain’ dimension • Generic web space analysis tools • Tools specialised in web pornography • Tools specialised in company sites offering any products/services • Tools specialised in company sites offering bicycle products V. Svátek, KEG 29.9.2005

  20. Agenda • Overview of web space analysis landscape (5’) • TODD annotation framework and Rainbow collection of web space ontologies (15’) • Rainbow collection of problem solving methods and re-description of real applications (10’) • Parametric design as starting point for automated composition of classification services (10’) • Simulation of service composition and execution (15’) • Rainbow and the semantic web service cycle (5’) • Future work (5’) V. Svátek, KEG 29.9.2005

  21. Knowledge modelling and PSMs • Original concept of ‘knowledge-level model’ formulated by Newell (1982) • Capture the conceptual nature of a knowledge-based system independent of implementation and data representation • From 1985 on: models of various AI reasoning tasks as well as methods to solve them • called PSMs, for Problem Solving Methods • Collected into libraries (e.g. in CommonKADS) • dichotomy: system analysis (classification, diagnosis, assessment, monitoring...) vs. system synthesis (design, configuration, planning, scheduling...) V. Svátek, KEG 29.9.2005

  22. Tentative PSMs for web analysis • Classification task • Look-up based Classification • Compact Classification • Structural Classification • Extraction task • Overall Extraction • Compact Extraction • Structural Extraction • Retrieval Task • Direct Retrieval • Index-based Retrieval V. Svátek, KEG 29.9.2005

  23. CommonKADS inference model of Direct Retrieval V. Svátek, KEG 29.9.2005

  24. CommonKADS inference model of Index-based Retrieval V. Svátek, KEG 29.9.2005

  25. Describing web space analysis applications with PSMs • Relevant for those exploiting multiple views of web • two varieties of Rainbow-based architecture for acquisition of bicycle product (and associated) information • multi-way recognition of web pornography • three projects by other groups • First in ad hoc, Prolog-like pseudo-code • to demonstrate adequacy of the TODD framework and the PSMs for pre-existing applications • not meant to be operational V. Svátek, KEG 29.9.2005

  26. Example description: name collection application by Armadillo (Ciravegna 2003) ExtS(DC, DocCollection, _, CSDept, [names]) :- RetD(P1, Phrase, text, General, [P1 part-of DC, PotentPName(P1)]), % named entity recognition for person names ClaC(P1, Phrase, text, General, [PName,@other]), % use of public search tools over papers and homepages RetI(P2, Phrase, freq, Biblio, P1 part-of P2, PaperCitation(P2)]), RetI(D, Document, freq, General, [P1 part-of D, D part-of DC, PHomepage(D)]), RetD(DF1, DocFragment, freq, General, [Heading(DF1), DF1 part-of D, P1 part-of DF1), ExtO(P1, Phrase, text, General, [names]), % co-occurrence-based extraction RetD(DF2, DocFragment, html, General, [ListItem(DF2), DF2 part-of DC, P1 part-of DF2]), RetD(DF3, DocFragment, html, General, [ListItem(DF3), (DF3 below DF2; DF2 below DF3)]), ExtS(DF3, DocFragment, text, General, [names]), RetD(DF4, DocFragment, html, General, [TableField(DF4), DF4 part-of DC, P1 part-of DF4]), RetD(Q, DocFragment, html, General, [TableField(DF5), (DF5 below DF4; DF4 below DF5)]), ExtS(DF5, DocFragment, text, General, [names]), % extraction from links RetD(DF5, DocFragment, html, General, [IntraSiteLinkElement(DF5), DF5 part-of DC]), ExtS(DF5, DocFragment, text, General, [names]), ... % extraction of potential person names from document fragments ExtS(DF, DocFragment, text, General, [names]) :- RetD(P, Phrase, text, General, [DF contains P, PotentialPersonName(P)]), ExtO(P, Phrase, text, General, [names]). V. Svátek, KEG 29.9.2005

  27. Towards generation of service compositions • Next step: attempt to generate the control code for composed service automatically, and then execute over a collection of available tools! • Would in fact be web service composition if used in real environment and with real services... • So far only simulated experiments; application: pornography recognition (cf. PhD thesis by Vacura) V. Svátek, KEG 29.9.2005

  28. Web service composition • Most popular approaches to WS composition (also called configuration, choreography etc.) • manual composition in workflow-inspired languages such as BPEL4WS: actually “programming in the large”  popular with industries • fully automated (‘semantic’) composition based on pre-/post-condition reasoning, e.g. OWL-S ... actually AI planning  popular with academics • Here: middle-way approach based on automated filling (and possibly folding/unfolding) of templates • Also sort of ‘semantic’ but less ambitious V. Svátek, KEG 29.9.2005

  29. Agenda • Overview of web space analysis landscape (5’) • TODD annotation framework and Rainbow collection of web space ontologies (15’) • Rainbow collection of problem solving methods and re-description of real applications (10’) • Parametric design as starting point for automated composition of classification services (10’) • Simulation of service composition and execution (15’) • Rainbow and the semantic web service cycle (5’) • Future work (5’) V. Svátek, KEG 29.9.2005

  30. Parametric design (PD) • A reasoning task well-examined by the knowledge modelling community • Setting values to a set of parameters in a template, while considering constraints and preferences • Classical problem solving method (PSM) for PD: Propose - Critique - Modify (PCM) • Propose an initial configuration • Verify the required properties (if satisfied then Stop) • (Critique:) Analyse reasons for failure in the Verify step • (Modify:) Change the values • return to Verify step V. Svátek, KEG 29.9.2005

  31. PD as model of WS composition • If a suitable template can be identified, WS composition will merely amount to filling in (concrete or abstract) services as “values” for “parameters”, in the PCM cycle • The filling can be carried out based on either: • generic pre-/post-condition reasoning over complete functional descriptions of individual services • dedicated broker equipped with a method-specific knowledge base (“PCM knowledge”) • The latter option examined by ten Teije (2004): experiments with using the classification PSM as template to be filled in V. Svátek, KEG 29.9.2005

  32. Filling in a classification template • A large proportion of to-date composite web services indeed have the nature of classification (e.g. credit assignment applications) • Although multiple PSMs for classification have been identified in literature, they can be combined into a single template (Motta & Lu 2000) • Testing domain: assignment of reviewers to conference papers, according to paper topics • PCM cycle carried out in several iterations • changes in criteria fulfilment recorded during the updates of the template (re-setting of parameter values) V. Svátek, KEG 29.9.2005

  33. Classification template Observations Knowledge Legal Observations Scored Observations MicroMatch Check Aggregate Aggregated Scores Candidate Solutions Solutions Selection Admissibility V. Svátek, KEG 29.9.2005

  34. Examples of “broker” knowledge • Propose knowledge for the Admissibility parameter: if many {feature,value} pairs are irrelevant then do not use strong-coverage • Modify knowledge for the Admissibility parameter: if the solution set has to increased (reduced) in size, then the value for the Admissibility parameter has to be moved down (up) in the following partial ordering:weak-coverage  strong-coverage  strong-explanative V. Svátek, KEG 29.9.2005

  35. Web analysis services’ specifics • Large number of diverse tools potentially available manual programming would be cumbersome • Can be tested in real environment with low or zero cost (unlike e.g. business IS or medical applications) experiments with automated composition might be relatively ambitious • Abstract descriptions of analysis services can (in addition to task characteristics) explicitly include characteristics related to analysed data (e.g. those enforced by mark-up languages such as HTML) V. Svátek, KEG 29.9.2005

  36. Service combination in Rainbow • Currently, the only option used for building more complex applications is conventional programming • control routines for bicycle application, by O. Šváb (in Java), which calls individual web services and integrates results • Some more flexible solution needed; ideally, it should be capable of including an unforeseen component (with appropriate semantic description) use of PSMs and ontologies? V. Svátek, KEG 29.9.2005

  37. Can parametric design scenario be applied here? • Web analysis PSMs can be viewed as templates • Filling unforeseen services to slots rather than just choosing among the known values of parameters! • To capture the connectivity and multiple views over the web space, the object-feature-value view (from traditional knowledge engineering) does not suffice: an object-relation-object view is more appropriate! • Structural classification/extraction is thus recursive, i.e. slots could be replaced with further templates! V. Svátek, KEG 29.9.2005

  38. Possible adaptations of PD • Simple: pre-set template versions for degrees of recursion and combinations with non-recursive versions • Advanced: in addition to setting values for attributes, the Propose and Modify steps could also fold/unfold slots to/from templates V. Svátek, KEG 29.9.2005

  39. Some tentative broker knowledge • Templates with lower number of distinct objects and non-recursive templates should be preferred • Look-up classification should be preferred to compact classification • Default partial ordering of data types with respect to object classification, for Document object:frequency > URL > topology free text > metadata • URL-based or topology-based classification should never be used alone • Default partial ordering of types of relations (@rel):part-of > is-part > adjacent V. Svátek, KEG 29.9.2005

  40. Agenda • Overview of web space analysis landscape (5’) • TODD annotation framework and Rainbow collection of web space ontologies (15’) • Rainbow collection of problem solving methods and re-description of real applications (10’) • Parametric design as starting point for automated composition of classification services (10’) • Simulation of service composition and execution (15’) • Rainbow and the semantic web service cycle (5’) • Future work (5’) V. Svátek, KEG 29.9.2005

  41. Limits of the current Prolog simulation • Data: Prolog facts with high level of abstraction, instead of real data • Only the classification task covered; only binary classification (with certainty factor) • Only six service ‘mock-ups’ implemented so far • though writing a new one is a matter of 20-30 minutes • Only the initial ‘Propose’ step implemented • Multiple fixed templates rather than un/folding • Broker knowledge not yet implemented • just blind search with checking ‘service signatures’ V. Svátek, KEG 29.9.2005

  42. Example of ‘data’ page(p36). % image page with 1 picture url_of(u36,p36). url_terms(u36,[teen,sex]). % terms in URL part(p36,s3). linkto(p31,p36). textprop(p36,0.0). % proportion of text on page part(f361,p36). html_frag(f361). % fragment of HTML code part(i3611,f361). image(i3611). body_color(i3611,0.4). % proportion of body color V. Svátek, KEG 29.9.2005

  43. Classification template example task type: classification input object same as output object templ(sc1,s(cla,0,0,Tp1,Tp2), [s(cla,0,0,Tp3,Tp4)], [subclasseq(Tp3,Tp1),subclasseq(Tp4,Tp2)]). template header template body with one slot type of input object of ‘lower-level’ service at most as general as type of input object of ‘higher-level’ service template constraints • Simplest, with one slot only • More complex ones have to deal e.g. with aggregation or transformation of certainty factors V. Svátek, KEG 29.9.2005

  44. Classification template example templ(sc5,s(cla,0,0,Tp1,Tp2), [s(cla,0,0,Tp3,Tp4), s(ret,0,1,Tp5,Tp6), s(cla,1,1,Tp7,Tp8), s(tsf,ref(3,1),0,Tp8,Tp4), s(agr,[ref(1,0),ref(4,0)],0,Tp4,Tp4)], [subclasseq(Tp3,Tp1), subclasseq(Tp5,Tp1), rel(part,Tp6,Tp5), subclasseq(Tp6,Tp7), subclasseq(Tp4,Tp2)]). V. Svátek, KEG 29.9.2005

  45. Service description example service identifier meta( cla_por_html, s(cla,document,pornoContentPage), url, pornography, 4). input object type output object type (class) task data type/representation problem domain time cost V. Svátek, KEG 29.9.2005

  46. Sample simulation run ?- propose(cla, doc_coll, porno_coll). Number of solutions: 2 Template: sc3 Configuration: s(ret, 0, 1, doc_coll, localhub, ret_localhub) s(cla, 1, 1, document, pornoContentPage, cla_por_html) s(tsf, ref(2, 1), 0, pornoContentPage, porno_coll, tsf_porno1) Time cost: 15 Template: sc3 Configuration: s(ret, 0, 1, doc_coll, localhub, ret_localhub) s(cla, 1, 1, document, pornoContentPage, cla_por_url) s(tsf, ref(2, 1), 0, pornoContentPage, porno_coll, tsf_porno1) Time cost: 13 V. Svátek, KEG 29.9.2005

  47. Agenda • Overview of web space analysis landscape (5’) • TODD annotation framework and Rainbow collection of web space ontologies (15’) • Rainbow collection of problem solving methods and re-description of real applications (10’) • Parametric design as starting point for automated composition of classification services (10’) • Simulation of service composition and execution (15’) • Rainbow and the semantic web service cycle (5’) • Future work (5’) V. Svátek, KEG 29.9.2005

  48. Coverage of semantic web service cycle • Service annotation with semantic description • Here: TODD framework and ontologies • Service discovery in open and heterogeneous space • Here: not addressed (we rely on a single annotation model and centralised ontology), hence this is not an ‘upper semantic web’ application! • Service composition (‘choreography’) • Here: main focus; template-based (PSM) approach • Composed service execution (‘orchestration’) • Here: extremely simplified V. Svátek, KEG 29.9.2005

  49. Agenda • Overview of web space analysis landscape (5’) • TODD annotation framework and Rainbow collection of web space ontologies (15’) • Rainbow collection of problem solving methods and re-description of real applications (10’) • Parametric design as starting point for automated composition of classification services (10’) • Simulation of service composition and execution (15’) • Rainbow and the semantic web service cycle (5’) • Future work (5’) V. Svátek, KEG 29.9.2005

  50. Ongoing and future work • Implementation of broker knowledge base • Further elaboration of prototype broker: beyond initial template filling: ‘Critique’ and ‘Modify’ phases? • Capture the possible structure of templates (initial proposal as well as modification) with a grammar? • Iterative template refinement with verification on data • Enrichment of the collection of analysis components (by Rainbow team as well as third party) • Implementation of full-fledged broker V. Svátek, KEG 29.9.2005

More Related