Modern Information Retreival

Modern Information Retreival Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3

Introduction • Text • main form of communicating knowledge. • Document • loosely defined, denote a single unit of information. • can be any physical unit • a file • an email • a Web Page

Introduction • Document • Syntax and structure • Semantics • Information about itself

Introduction • Document Syntax • Implicit, or expressed in a language (e.g, TeX) • Powerful languages: easier to parse, difficult to convert to other formats. • Open languages are better (interchange) • Semantics of texts in natural language are not easy for a computer to understand • Trend: languages which provides information on structure, format and semantics being readable by human and computers

Introduction • New applications are pushing for format such that information can be represented independetly of style. • Style: defined by the author, but the reader may decide part of it • Style can include treatment of other media

Metadata • “Data about the data” • e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. • Descriptive Metadata • Author, source, length • Dublin Core Metadata Element Set • Semantic Metadata • Characterizes the subject matter within the document contents • MEDLINE

Metadata • Metadata information on Web documents • cataloging, content rating, property rights, digital signatures • New standard: Resource Description Framework • description of Web resources to facilitate automated processing of information • nodes and attched atribute/values pairs • Metadescription of non-textual objects • keyword can be used to search the objects

Predicate Statement RDF Model • A model is a collection of statements • Statement := (predicate,subject,object) • Predicate is a resource • Subject is a resource • Object is either a resource or a literal Subject Object

Example shown in triples view

RDF model and natural language • Subject. In grammar, this is the noun or noun phrase that is the doer of the action. In the sentence “The company sells batteries,” the subject is “the company.” • Predicate. In grammar, this is the part of a sentence that modifies the subject and includes the verb phrase. In our sentence, the predicate is the phrase “sells” • Object. In grammar this is a noun that is acted upon by the verb. In our sentence, the object is the noun “batteries.”

XML vs. RDF • RDF is not just an XML dialect. • XML: • Has a tree structure data model. • Only nodes are labeled. • RDF: • Has a graph structure data model. • Both edges (properties) and nodes (subjects/objects) are labeled.

CE Ganji http://ce.sharif.edu Sharif Linking Statements • The subject of one statement can be the object of another • Such collections of statements form a directed, labeled graph studentOF departmentOF hasHomePage

RDF Graph: ‘anonymous’ nodes Person PersonName Literal Person12345 person.name value Jonathan first last value Borden

How can RDF be implemented • Usually RDF/XML syntax • However other notations are possible • e.g. Notation3: • Buddy Belden owns a business. • The business has a Web site accessible at http://www.c2i2.com/~budstv. • Buddy is the father of Lynne. • <#Buddy> <#owns> <#business>. • <#business> <#has-website> <http://www.c2i2.com/~budstv>. • <#Buddy> <#father-of> <#Lynne>.

Converting N3 to RDF • Jena toolkit can do such conversion

XML Syntax for RDF • RDF has an XML syntax that has a specific meaning: • Every Description element describes a resource • Every attribute or nested element inside a Description is apropertyof that Resource • We can refer to resources by using URIs <rdf:Description about="some.uri/person/ganji"> <studentOf resource="some.uri/Sharif/CE"/> </Description> <Description about="some.uri/Sharif/CE"> <hasHomePage>http://ce.sharif.edu</hasHomePage> <departmentOf resource="some.uri/~Sharif"/> </rdf:Description>

RDF type • RDF predifined property • Its value – a resource that represent a category or class • Its subject – Instance of that category or class prefix ex: URI: http://www.example.org/terms

Containers • Containers are collections • they allow grouping of resources (or literal values) • It is possible to make statements about the container (as a whole) or about its members individually • It is also possible to create collections based on URI patterns • for example, all files in a particular web site

RDF containers • Bag: (A resource having type rdf:Bag) • Represents an unordered list of resources or literals • Duplicated values are prermitted • Sequence: (A resource having type rdf:Seq) • Represents ordered list of resources or literal • Duplicated values are permitted • Alternatives: (A resource having type rdf:Alt) • Represents group of resources or literals that are alternatives

http://www.w3.org/TR/REC-rdf-syntax dc:Creator rdf:Type rdf:Seq rdf:_1 rdf:_2 “Ora Lassila” “Ralph Swick” Sequence example

Bag example

RDF Schema (RDFS) • RDF gives a formalism for meta data annotation, and a way to write it down in XML, but it does not give any special meaning to vocabulary such as subClassOf or type • RDF Schema allows you to define vocabulary terms and the relations between those terms • it gives “extra meaning” to particular RDF predicates and resources • this “extra meaning”, or semantics, specifies how a term should be interpreted

Core Classes & Properties rdfs:Resource rdfs:Literal rdfs:XMLLiteral rdfs:Class rdfs:Property Core Classes rdfs:Type rdfs:SubClassOf rdfs:SubPropertyOf rdfs:Domain rdfs:Range rdfs:Label rdfs:Comment Core Properties

RDFS Examples <Person,type,Class> <hasColleague,type,Property> <Professor,subClassOf,Person> <Carole,type,Professor> <hasColleague,range,Person> <hasColleague,domain,Person>

RDF/RDFS “Liberality” • No distinction between classes and instances (individuals) <Species,type,Class> <Lion,type,Species> <Leo,type,Lion> • Properties can themselves have properties <hasDaughter,subPropertyOf,hasChild> <hasDaughter,type,familyProperty> • No distinction between language constructors and ontology vocabulary, so constructors can be applied to themselves/each other <type,range,Class> <Property,type,Class> <type,subPropertyOf,subClassOf>

Problems with RDFS • RDFS too weak to describe resources in sufficient detail • No localised range and domain constraints • Can’t say that the range of hasChild is person when applied to persons and elephant when applied to elephants • No existence/cardinality constraints • Can’t say that all instances of person have a mother that is also a person, or that persons have exactly 2 parents • No transitive, inverse or symmetrical properties • Can’t say that isPartOf is a transitive property, that hasPart is the inverse of isPartOf or that touches is symmetrical • … • Difficult to provide reasoning support • No “native” reasoners for non-standard semantics • May be possible to reason via FO axiomatisation

RDF(S) tools • Read RDF data • Parsers: Jena, Redland, SWI-Prolog • Validators: W3C RDF validation service • Editors: IsaViz, RDF Author, RDFEd, InferEd • Store RDF data (XML format, tripples or relational/oo DB) • Sesame, RSSDB, RDFLib • Use RDF data (applications, RSS news, etc.) • Manipulate RDF data (inference, query, etc.) • Jena RDQL, etc. • Example: SELECT ?person, ?knows WHERE (?x <http://xmlns.com/foap/knows> ?z), (?x <http://xmlns.com/foap/name> ?person), (?z <http://xmlns.com/foap/name> ?knows)

RDF Validators • RDF Validation Service • http://www.w3.org/RDF/Validator/ • In general all the RDF parsers do some kind of validation

References • RDF Resource Guide: • http://www.ilrt.bris.ac.uk/discovery/rdf/resources/ • http://www.w3.org/RDF • http://www.w3.org/RDF/Validator/

Text • Text coding in bits • EBCDIC, ASCII • Initially, 7 bits. Later, 8 bits • Unicode • 16 bits, to accommodate oriental languages

Text • Formats • No single format exists • IR system should retrieve information from different formats • Past: IR systems convert the documents • Today: IR systems use filters

Text • Formats • Formats for document interchange (RTF) • Formats for displaying (PDF, PostScript) • Formats for encode email (MIME) • Compressed files • uuencode/uudecode, binhex

Text • Information Theory • Amount of information is related to the distribution of symbols in the document. • Entropy: • Definition of entropy depends on the probabilities of each symbol. • Text models are used to obtain those probabilites

Text • Example - Entropy • 001001011011

Text • Example - Entropy • 111111111111

Text • Modeling Natural Language • Symbols: separate words or belong to words • Symbols are not uniformly distributed • binomial model • Dependency of previous symbols • k-order markovian model • We can take words as symbols

Text • Modeling Natural Language • Words distribution inside documents • Zipf´s Law: i-th most frequent word appears 1/i times of the most frequent word, hence i-th frequent word appears: • Real data fits better with  between 1.5 and 2.0

Text • Modeling Natural Language • Example - word distibution (Zipf’s Law) • V=1000,  = 2 • most frequent word: n=300 • 2nd most frequent: n=76 • 3rd most frequent: n=33 • 4th most frequent: n=19

Text • Modeling Natural Language • Number of distinct words • Heaps’ Law: • Set of different words is fixed by a constant, but the limit is too high

Text • Modeling Natural Language • Heaps’ Law example • k between 10 and 100,  is less than 1 • example: n=400000,  = 0.5 • K=25, V=15811 • K=35, V=22135

Text • Modeling Natural Language • Length of the words • defines total space needed for vocabulary • Heaps’ Law: length increases logarithmically with text size. • In practice, a finit-state model is used • space has p=0.2 • space cannot apear twice subsequently • there are 26 letters

Text • Similarity Models • Distance Function • Should be symmetric and satisfy triangle inequality • Hamming Distance • number of positions that have different characters reverse receive

Text • Similarity Models • Edit (Levenshtein) Distance • minimum number of operations needed to make strings equal survey surgery • superior for modeling syntatic errors • extensions: weights, transpositions, etc

Text • Similarity Models • Longest Common Subsequence (LCS) survey - surgery LCS: surey • Documents: lines as symbols (diff in Unix) • time consuming

Conclusions • Text is the main form of communicating knowledge. • Documents have syntax, structure and semantics • Metadata: information about data • Formats of text • Modeling Natural Language • Entropy • Distribution of symbols • Similarity

Modern Information Retreival

Modern Information Retreival

Presentation Transcript

Modern Information Retrieval

Modern Information Retrieval

Modern Information Retrieval

Advanced information retreival

Modern Information Retrieval

Modern Information Retrieval

Modern Information Retrieval

Modern Information Retrieval

Modern information retrieval

XML Information Retreival

Modern information retreival

Modern information retrieval

Modern information retrieval

Modern Information Retrieval

Modern Information Retrieval

Modern Information Retrieval

Modern Information Retrieval

Advanced Information Retreival