XML, RDF and Advanced Search (Semantic Web – Web3.0)

Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen,Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen XML, RDF and Advanced Search (Semantic Web – Web3.0)

What we have covered • What is IR • Evaluation • Tokenization and properties of text • Web crawling • Query models • Vector methods • Measures of similarity • Indexing • Inverted files • Basics of internet and web • Spam and SEO • Search engine design • Google and Link Analysis • This week: metadata, XML, RDF; advanced search, Semantic Web

The importance of data and their rules • Tim Berners-Lee • inventor of the world wide web • Founder of the W3C • Presentation at Ted

“Metadata is data about data” Metadata and Markup Languages • Why is metadata important? • Makes data easier to search • It’s the foundation of the • semantic web • WEB3.0 Metadata often is written in XML

Metadata is semi-structured data conforming to commonlyagreed upon models, providing operational interoperabilityin a heterogeneous environment

What is metadata?Some simple definitions • ‘Structured data about data’. • Dublin Core Metadata Initiative FAQ, 2005 • http://dublincore.org/resources/faq/ • Machine-understandable information about Web resources or other things. • Tim Berners-Lee, W3C, 1997 • http://www.w3.org/DesignIssues/Metadata

"Web resources or other things" • Metadata might be "about"… anything! • HTML documents • digital images • databases • books • museum objects • archival records • metadata records • Web sites • collections • services • physical places • people • organizations • “works” • formats • concepts • events

What might metadata "say"? What is this called? What is this about? Who made this? When was this made? Where do I get (a copy of) this? When does this expire? What format does this use? Who is this intended for? What does this cost? Can I copy this? Can I modify this? What are the component parts of this? What else refers to this? What did "users" think of this? (etc!)

What operations/functions? • resource disclosure & discovery • resource retrieval, use • resource management, including preservation • verification of authenticity • intellectual property rights management • commerce • content-rating • authentication and authorization • personalization and localization of services • (etc!)

What operations/functions? • Different functions for different metadata • Metadata (and metadata standards) sometimes classified according to function • Descriptive: primarily for discovery, retrieval • Administrative: primarily for management • Structural: relationships between component parts of resources • Contextual: relationships between resources • No “one size fits all solution”!

Metadata of a report? • What metadata would you associate with a report or memo?

Metadata importance • “data about data” is about as good as the definition gets... • As a data resource grows, metadata becomes more important • Lack of metadata has different consequences • documentation: metadata can be regenerated automatically, or by hand • datasets, pictures: once lost, can be impossible to regenerate

Types of Metadata • Descriptive • Discovery / description of objects • Title, author, abstract, etc. • Structural • Storage & presentation of objects • 1 pdf file, 1 ppt file, 1 LaTeX file, etc. • Administrative • Managing and preservation of objects • Access control lists, terms and conditions, format descriptions, “meta-metadata” LOC - Library of Congress

Which View is Correct? figure 1 from: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html

Approaches to Metadata • from Ng, Park and Burnett, 1997 (also JASIS, 50(13)) http://www.scils.rutgers.edu/~sypark/asis.html • library science: bibliographic control • “organizing the physical containers of information, by means of bibliographical description, subject analysis, and classification notation construction, so that the container can be efficiently described, identified, located and retrieved” • computer and information science: data management • “not only to store, access and utilize data effectively, but also to provide data security, data sharing, and data integrity” • Domains/areas (chemistry, physics, ..) define their own

Metadata Formats and Implementation • Use markup languages • Interoperable • Extensible • Robust • Permits advance search features When online, the beginning of a semantic web!

What is a markup language? • Textual (i.e. person readable) language where significant elements are indicated by markers • <TITLE>XML</TITLE> • Examples are RTF, HTML, XML, TEX etc. • Easy to process and can be manipulated by a variety of application programs

Standard Generalized Markup Language (SGML) • Based on GML (generalized markup language), developed by IBM in the 1960s • An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document • Can define any document format of any complexity • Enables, extensibility, structure and validation • Too many optional features for the Web • Gave birth to the extensible markup language (XML), W3C recommendation in 1998

The Purpose of SGML • SGML is designed to make your information last longer than the systems that created it. Such longevity also implies immunity to short-term changes -- such as a change from one application program to another -- so SGML is also inherently designed for re-purposing and portability.

What is SGML? • SGML (and it's derivatives, HTML and XML) are ASCII character based representations of electronic data • Remember, it's all bits--meaning is derived from how they are organized… • Think of SGML docs as strings that must be parsed--A web browser parses an HTML doc and uses the markup codes to display the data contained • Since it's all ASCII, these docs can also be handled by non parsing tools (such as vi, emacs, perl, etc.)

SGMLXMLHTML SGML is the “mother tongue” – but is overkill for most common applications. XML is an abbreviated version of SGML • easier to define own document types • easier for programmers to write programs to handle documents (and data) • omits all the options (and most of more complex and less-used parts) of SGML) • HTML is just one of many SGML or XML “applications” – most frequently used on the Web

SGML Components SGML documents have three parts: • Declaration: specifies which characters and delimiters may appear in the application • DTD (document type definition) / style sheet: defines the syntax of markup constructs • Document instance: actual text (with the tag) of the documents • More info could be found: http://www.W3.Org/markup/SGML

World Wide Web (W3C) Consortium

What is XML? • XML – eXtensible Markup Language • designed to improve the functionality of the Web by providing more flexible and adaptable information and identification • “extensible” because not a fixed format like HTML • a language for describing other languages (a meta-language) • design your own customised markup language

The HTML World <body> <h1> XML and Information Retrieval: A SIGIR 2000 Workshop </h1> <p> The workshop was held on 28 July 2000. The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer </p> <h2> XQL and Proximal Nodes </h2> <p> The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. </p> <p> We consider the recently proposed language … </p> <p> The paper references the following papers: <a href=“http://www.acm.org/www8/paper/xmlql”> … </a> … </p> …

The XML World <workshopdate=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paperid=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …

XML • XML is written in SGML – the Standardized General Markup Language, an international standard (ISO 8879) • XML = very simple dialect of SGML • goal = enable generic SGML to be served, received and processed on the Web in ways not possible with HTML

Why use XML? • XML is not just for Web pages • Data management: • store any kind of structured document • enclose/encapsulate information in order to pass it between different computing systems that are otherwise unable to communicate

Key feature of XML An application is free to use XML tagged data in many different ways, e.g. • produce an image • generate a formatted text listing • display the XML document’s markup in pretty colors • restructure the data into a format for storing in a database, transmission over a network, input to another program.

XML Software? • many programs are “XML ready” already today. • xml.coverpages.org covers news of new additions to XML • Find Penn State pages with XML

How do I run or execute an XML file? • You can’t and you don’t ! • XML is not a programming language • XML is a markup specification language • XML files are just data (unicode) (waiting for a program to do something with them) • XML files can be viewed with an XML editor or XML-compatible browser

Things to Remember • XML does not replace HTML – it provides an alternative which allows you to define your own set of markup elements to a published standard: • <?xml version="1.0" standalone="yes"?> • <conversation> • <greeting>Hello, world!</greeting> • <response>Stop the planet, I want to get off!</response> • </conversation>

Things to Remember • All parts of an XML document are case sEnSiTiVe • Element type names are case sensitive, so <BODY> …</body> is out. • Attribute names are case sensitive … • <PIC width=“7cm”/> and • <PIC WIDTH=“6cm”/> • describe different attributes, not just different values for the attribute “PIC width”.

What is XQuery? • XQuery is the language for querying XML data • The best way to explain XQuery is to say that XQuery is to XML what SQL is to database tables. • XQuery uses XPath expressions to extract XML data. • XPath is a language for finding information in an XML document. • XPath is used to navigate through elements and attributes in an XML document. • XQuery is defined by the W3C. • XQuery is supported by all the major database engines (IBM, Oracle, Microsoft, etc.) • XQuery 1.0 W3C Recommendation

Motivation for XML Search • It is becoming increasingly popular to publish data on the Web in the form of XML documents. • xml on the web? • Current search engines, which are an indispensable tool for finding HTML documents, have two main drawbacks when it comes to searching for XML documents. • It is not possible to pose queries that explicitly refer to XML tags. • Search engines return references (i.e. links) to documents and not specific fragments thereof. This is problematic, since large XML documents may contain thousands of elements storing many pieces of information that are not necessarily related to each other.

How would we check to see how much xml is out there?

The HTML World <body> <h1> XML and Information Retrieval: A SIGIR 2000 Workshop </h1> <p> The workshop was held on 28 July 2000. The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer </p> <h2> XQL and Proximal Nodes </h2> <p> The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. </p> <p> We consider the recently proposed language … </p> <p> The paper references the following papers: <a href=“http://www.acm.org/www8/paper/xmlql”> … </a> … </p> …

The XML World <workshopdate=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paperid=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …

Problems with XQuery • A query language for XML, such as XQuery, can be used to extract data from XML documents. • However, such a query language is not an alternative to an XML search engine for several reasons. • The syntax of XQuery is more complicated than the syntax of a standart search query. Hence, it is not appropriate for a naive user. • Extensive knowledge of the document structure is required in order to correctly formulate a query. Thus, queries must be formulated on a per document basis. • XQuery lacks any mechanism for ranking answers. • Solution - XML Search engine

XML Search Tool Design Features? • A simple syntax that can be used by naive users • Search results should include XML fragments and not necessarily full documents • The XML fragments in an answer, should be semantically related • For example, a paper and an author should be in an answer only if the paper was written by this author • Search results should be ranked • Search results should be returned in “reasonable” time

XML Search Engines • Summary of XML engines • Open source ones starting to emerge • Or just use web search engine with filetype:xml • Try Google • Many for commercial use and some in design • Active research area • Web XML is a step in the direction of the semantic web!

XML for Search Engines - Sitemaps • The Sitemaps protocol allows a website to inform search engines about URLs on a website that are available for crawling. • A Sitemap is an XML file that lists the URLs for a site. • includes additional information about each URL • when it was last updated, how often it changes, and how important it is in relation to other URLs in the site • allows search engines to crawl the site more intelligently. • Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol. • Sitemaps are particularly beneficial on websites where: • some areas of the website are not available through the browsable interface • rich Ajax, Silverlight, or Flash content that is not normally processed by search engines. • Site is very large or have a huge amount of pages that are isolated or not well linked together • Website has few external links

Open Source XML Search Engine

XML Schema for Book <xs:element name="Book"><xs:complexType><xs:sequence><xs:element name="Title" type="xs:string"/><xs:element name="Authors"><xs:complexType><xs:sequence><xs:element name="Author" type="xs:string" maxOccurs="5"/></xs:sequence></xs:complexType></xs:element><xs:element name="Date" type="xs:gYear"/><xs:element name="Publisher" minOccurs="0"><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="Springer"/><xs:enumeration value="MIT Press"/><xs:enumeration value="Harvard Press"/></xs:restriction></xs:simpleType></xs:element></xs:sequence></xs:complexType></xs:element>

Equivalent JSON Schema {"$schema":http://json-schema.org/draft-04/schema","type":"object","properties":{"Book":{"type":"object","properties":{"Title":{"type":"string"},"Authors":{"type":"array","minItems":1,"maxItems":5,"items":{"type":"string"}},"Date":{"type":"string","pattern":"^[0-9]{4}$"},"Publisher":{"type":"string","enum":["Springer","MIT Press","Harvard Press"]}},"required":["Title","Authors","Date"],"additionalProperties":false}},"required":["Book"],"additionalProperties":false}

XML, RDF and Advanced Search (Semantic Web – Web3.0)