460 likes | 585 Views
This chapter explores key aspects of text and multimedia languages in information retrieval. It defines a document as a unit of information and discusses its syntax, structure, and semantics. The chapter emphasizes the importance of metadata, including descriptive and semantic metadata, in enhancing information retrieval. It contrasts XML with RDF, highlighting the graph structure of RDF that allows for richer data representation. The chapter also covers how RDF can facilitate connections between data points, enabling automated processing and retrieval of information.
E N D
Modern Information Retreival Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3
Introduction • Text • main form of communicating knowledge. • Document • loosely defined, denote a single unit of information. • can be any physical unit • a file • an email • a Web Page
Introduction • Document • Syntax and structure • Semantics • Information about itself
Introduction • Document Syntax • Implicit, or expressed in a language (e.g, TeX) • Powerful languages: easier to parse, difficult to convert to other formats. • Open languages are better (interchange) • Semantics of texts in natural language are not easy for a computer to understand • Trend: languages which provides information on structure, format and semantics being readable by human and computers
Introduction • New applications are pushing for format such that information can be represented independetly of style. • Style: defined by the author, but the reader may decide part of it • Style can include treatment of other media
Metadata • “Data about the data” • e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. • Descriptive Metadata • Author, source, length • Dublin Core Metadata Element Set • Semantic Metadata • Characterizes the subject matter within the document contents • MEDLINE
Metadata • Metadata information on Web documents • cataloging, content rating, property rights, digital signatures • New standard: Resource Description Framework • description of Web resources to facilitate automated processing of information • nodes and attched atribute/values pairs • Metadescription of non-textual objects • keyword can be used to search the objects
Predicate Statement RDF Model • A model is a collection of statements • Statement := (predicate,subject,object) • Predicate is a resource • Subject is a resource • Object is either a resource or a literal Subject Object
RDF model and natural language • Subject. In grammar, this is the noun or noun phrase that is the doer of the action. In the sentence “The company sells batteries,” the subject is “the company.” • Predicate. In grammar, this is the part of a sentence that modifies the subject and includes the verb phrase. In our sentence, the predicate is the phrase “sells” • Object. In grammar this is a noun that is acted upon by the verb. In our sentence, the object is the noun “batteries.”
XML vs. RDF • RDF is not just an XML dialect. • XML: • Has a tree structure data model. • Only nodes are labeled. • RDF: • Has a graph structure data model. • Both edges (properties) and nodes (subjects/objects) are labeled.
CE Ganji http://ce.sharif.edu Sharif Linking Statements • The subject of one statement can be the object of another • Such collections of statements form a directed, labeled graph studentOF departmentOF hasHomePage
RDF Graph: ‘anonymous’ nodes Person PersonName Literal Person12345 person.name value Jonathan first last value Borden
How can RDF be implemented • Usually RDF/XML syntax • However other notations are possible • e.g. Notation3: • Buddy Belden owns a business. • The business has a Web site accessible at http://www.c2i2.com/~budstv. • Buddy is the father of Lynne. • <#Buddy> <#owns> <#business>. • <#business> <#has-website> <http://www.c2i2.com/~budstv>. • <#Buddy> <#father-of> <#Lynne>.
Converting N3 to RDF • Jena toolkit can do such conversion
XML Syntax for RDF • RDF has an XML syntax that has a specific meaning: • Every Description element describes a resource • Every attribute or nested element inside a Description is apropertyof that Resource • We can refer to resources by using URIs <rdf:Description about="some.uri/person/ganji"> <studentOf resource="some.uri/Sharif/CE"/> </Description> <Description about="some.uri/Sharif/CE"> <hasHomePage>http://ce.sharif.edu</hasHomePage> <departmentOf resource="some.uri/~Sharif"/> </rdf:Description>
RDF type • RDF predifined property • Its value – a resource that represent a category or class • Its subject – Instance of that category or class prefix ex: URI: http://www.example.org/terms
Containers • Containers are collections • they allow grouping of resources (or literal values) • It is possible to make statements about the container (as a whole) or about its members individually • It is also possible to create collections based on URI patterns • for example, all files in a particular web site
RDF containers • Bag: (A resource having type rdf:Bag) • Represents an unordered list of resources or literals • Duplicated values are prermitted • Sequence: (A resource having type rdf:Seq) • Represents ordered list of resources or literal • Duplicated values are permitted • Alternatives: (A resource having type rdf:Alt) • Represents group of resources or literals that are alternatives
http://www.w3.org/TR/REC-rdf-syntax dc:Creator rdf:Type rdf:Seq rdf:_1 rdf:_2 “Ora Lassila” “Ralph Swick” Sequence example
RDF Schema (RDFS) • RDF gives a formalism for meta data annotation, and a way to write it down in XML, but it does not give any special meaning to vocabulary such as subClassOf or type • RDF Schema allows you to define vocabulary terms and the relations between those terms • it gives “extra meaning” to particular RDF predicates and resources • this “extra meaning”, or semantics, specifies how a term should be interpreted
Core Classes & Properties rdfs:Resource rdfs:Literal rdfs:XMLLiteral rdfs:Class rdfs:Property Core Classes rdfs:Type rdfs:SubClassOf rdfs:SubPropertyOf rdfs:Domain rdfs:Range rdfs:Label rdfs:Comment Core Properties
RDFS Examples <Person,type,Class> <hasColleague,type,Property> <Professor,subClassOf,Person> <Carole,type,Professor> <hasColleague,range,Person> <hasColleague,domain,Person>
RDF/RDFS “Liberality” • No distinction between classes and instances (individuals) <Species,type,Class> <Lion,type,Species> <Leo,type,Lion> • Properties can themselves have properties <hasDaughter,subPropertyOf,hasChild> <hasDaughter,type,familyProperty> • No distinction between language constructors and ontology vocabulary, so constructors can be applied to themselves/each other <type,range,Class> <Property,type,Class> <type,subPropertyOf,subClassOf>
Problems with RDFS • RDFS too weak to describe resources in sufficient detail • No localised range and domain constraints • Can’t say that the range of hasChild is person when applied to persons and elephant when applied to elephants • No existence/cardinality constraints • Can’t say that all instances of person have a mother that is also a person, or that persons have exactly 2 parents • No transitive, inverse or symmetrical properties • Can’t say that isPartOf is a transitive property, that hasPart is the inverse of isPartOf or that touches is symmetrical • … • Difficult to provide reasoning support • No “native” reasoners for non-standard semantics • May be possible to reason via FO axiomatisation
RDF(S) tools • Read RDF data • Parsers: Jena, Redland, SWI-Prolog • Validators: W3C RDF validation service • Editors: IsaViz, RDF Author, RDFEd, InferEd • Store RDF data (XML format, tripples or relational/oo DB) • Sesame, RSSDB, RDFLib • Use RDF data (applications, RSS news, etc.) • Manipulate RDF data (inference, query, etc.) • Jena RDQL, etc. • Example: SELECT ?person, ?knows WHERE (?x <http://xmlns.com/foap/knows> ?z), (?x <http://xmlns.com/foap/name> ?person), (?z <http://xmlns.com/foap/name> ?knows)
RDF Validators • RDF Validation Service • http://www.w3.org/RDF/Validator/ • In general all the RDF parsers do some kind of validation
References • RDF Resource Guide: • http://www.ilrt.bris.ac.uk/discovery/rdf/resources/ • http://www.w3.org/RDF • http://www.w3.org/RDF/Validator/
Text • Text coding in bits • EBCDIC, ASCII • Initially, 7 bits. Later, 8 bits • Unicode • 16 bits, to accommodate oriental languages
Text • Formats • No single format exists • IR system should retrieve information from different formats • Past: IR systems convert the documents • Today: IR systems use filters
Text • Formats • Formats for document interchange (RTF) • Formats for displaying (PDF, PostScript) • Formats for encode email (MIME) • Compressed files • uuencode/uudecode, binhex
Text • Information Theory • Amount of information is related to the distribution of symbols in the document. • Entropy: • Definition of entropy depends on the probabilities of each symbol. • Text models are used to obtain those probabilites
Text • Example - Entropy • 001001011011
Text • Example - Entropy • 111111111111
Text • Modeling Natural Language • Symbols: separate words or belong to words • Symbols are not uniformly distributed • binomial model • Dependency of previous symbols • k-order markovian model • We can take words as symbols
Text • Modeling Natural Language • Words distribution inside documents • Zipf´s Law: i-th most frequent word appears 1/i times of the most frequent word, hence i-th frequent word appears: • Real data fits better with between 1.5 and 2.0
Text • Modeling Natural Language • Example - word distibution (Zipf’s Law) • V=1000, = 2 • most frequent word: n=300 • 2nd most frequent: n=76 • 3rd most frequent: n=33 • 4th most frequent: n=19
Text • Modeling Natural Language • Number of distinct words • Heaps’ Law: • Set of different words is fixed by a constant, but the limit is too high
Text • Modeling Natural Language • Heaps’ Law example • k between 10 and 100, is less than 1 • example: n=400000, = 0.5 • K=25, V=15811 • K=35, V=22135
Text • Modeling Natural Language • Length of the words • defines total space needed for vocabulary • Heaps’ Law: length increases logarithmically with text size. • In practice, a finit-state model is used • space has p=0.2 • space cannot apear twice subsequently • there are 26 letters
Text • Similarity Models • Distance Function • Should be symmetric and satisfy triangle inequality • Hamming Distance • number of positions that have different characters reverse receive
Text • Similarity Models • Edit (Levenshtein) Distance • minimum number of operations needed to make strings equal survey surgery • superior for modeling syntatic errors • extensions: weights, transpositions, etc
Text • Similarity Models • Longest Common Subsequence (LCS) survey - surgery LCS: surey • Documents: lines as symbols (diff in Unix) • time consuming
Conclusions • Text is the main form of communicating knowledge. • Documents have syntax, structure and semantics • Metadata: information about data • Formats of text • Modeling Natural Language • Entropy • Distribution of symbols • Similarity