content types text and metadata n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Content Types: Text and Metadata PowerPoint Presentation
Download Presentation
Content Types: Text and Metadata

Loading in 2 Seconds...

play fullscreen
1 / 23

Content Types: Text and Metadata - PowerPoint PPT Presentation


  • 72 Views
  • Uploaded on

Content Types: Text and Metadata. Introduction. Text documents come in many forms Article (news, conference, journal, etc.) Email, memo, … Book, manual, manuscript, transcript, … Any part of one of the above Syntax can express Structure Presentation style Semantics (e.g. software code).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Content Types: Text and Metadata


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
introduction
Introduction
  • Text documents come in many forms
    • Article (news, conference, journal, etc.)
    • Email, memo, …
    • Book, manual, manuscript, transcript, …
    • Any part of one of the above
  • Syntax can express
    • Structure
    • Presentation style
    • Semantics (e.g. software code)
metadata
Metadata
  • Metadata – data about data
  • Descriptive metadata
    • External to meaning of document
    • Author, publication date, document source, document length, document genre, file type, bits per second, frame rate, etc.
  • Semantic metadata
    • Characterizes semantic content of document
    • LoC subject heading, keywords, subject headings from ontologies (e.g. MESH), etc.
metadata formats
Metadata Formats
  • Machine Readable Cataloging Record (MARC)
    • Used by most libraries
    • Fields include title, author, etc.
  • Resource Description Framework (RDF)
    • Used for Web resources
    • Node and attribute / value pairs
    • Node ID is any Uniform Resource Identifier (URI), which could be a URL
metadata sets
Metadata Sets
  • Dublin Core Metadata Elements
    • Contributor – entities contributing to the content
    • Coverage – extent or scope of content (spatial area, temporal period, …)
    • Creator – entity primarily responsible for making the content
    • Date – date associated with event (e.g. publication) for resource
    • Description – abstract, table of contents, …
    • Format – media (file) type, dimensions (size, duration), hardware needed
    • Identifier – unique identifier
    • Language – language of content
    • Publisher – entity responsible for making resource available
    • Relation – reference to related resource(s)
    • Rights – information about rights held in/over resource
    • Source – resource from which content is derived
    • Subject – keywords, key phrases, classification code, etc.
    • Title – name of the resource
    • Type – nature or genre of content
text formats
Text Formats
  • Coding schemes
    • EBCDIC (7 bit, one of first coding schemes)
    • ASCII (initially 7 bit, extended to 8 bit)
    • Unicode (16 bit for large alphabets)
  • Additional Formats
    • RTF (format-oriented document exchange)
    • PDF and PostScript (display-oriented representation)
    • Multipurpose Internet Mail Exchange (MIME) (multiple character sets, languages, media)
information theory
Information Theory
  • How can we predict information value of components of a document?
  • Entropy – attempts to model information content (information uncertainty)
  • E = - Sum all symbols in alphabet (pi log2 pi)

pi is the probability of symbol I (symbol frequency over number of symbols)

Need a text model for real language

  • Also important for compression as E acts as a limit of how much a text can be compressed.
modeling character strings
Modeling Character Strings
  • Symbols in NL are not evenly distributed
    • Some symbols are not part of words (often used for syntax)
    • Symbols in words are not evenly distributed
  • Models
    • Binomial model uses distribution of symbols in language
      • But previous symbols influence probabilities of later symbols
      • (what letter will appear after a q?)
    • Finite context or Markovian models used for this dependency
      • k-order where k is the number of previous characters taken into account by the model
      • Thus, the binomial model is a 0-order model
word distribution in documents
Word Distribution in Documents
  • How frequent are words within documents?
  • Zipf’s Law
    • Frequency of the ith most frequent word is 1/itheta * frequency of most frequent word
    • The value of theta depends on the text (value of 1 is logarithmic distribution)
    • Theta values of 1.5 to 2.0 best model real texts
  • In practice, a few hundred words make up 50% of most texts
    • Frequent words provide less information
    • Thus, many search strategies involve ignoring stopwords (a, an, the, is, of, by, …)
word distribution in collections
Word Distribution in Collections
  • Simplest to assume uniform distribution of words in documents
    • But not true
  • Better models built on negative binomial distributions or Poisson distributions
vocabulary size for documents and collections
Vocabulary Size for Documents and Collections
  • Heap’s Law
    • Vocabulary size (V) grows with number of words (n)
      • V = Knb
      • Experimentally,
        • K is between 10 and 100
        • B is between 0.4 and 0.6
    • So vocabulary grows proportionally with the square root of the size of the document or collection in words
    • Works best for large documents & collections
string similarity models
String Similarity Models
  • Similarity is measured by a distance function
  • Hamming distance – number of characters different in strings
  • Levenshtein distance – minimum number of insertions, deletions, and substitutions needed to make strings equal
    • color to colour is 1
    • survey to surgery is 2
  • Can be extended to documents
    • UNIX diff treats each line as a character
introduction1
Introduction
  • Markup languages use extra textual syntax to encode:
    • Formatting / display information
    • Structure information
    • Descriptive metadata
    • Semantic metadata
  • Marks are often called tags
    • The act of adding markup is called tagging
    • Most markup languages use initial and ending tags surrounding the marked text
standard generalized markup language sgml
Standard Generalized Markup Language (SGML)
  • Metalanguage for markup.
    • Includes rules for defining markup language
    • Use of SGML includes
      • Description of structure of markup
      • Text marked with tags
  • Document Type Declaration (DTD)
    • Describes and names tags and how they are related
    • Comments used to express interpretation of tags (meaning, presentation, …)
sgml dtd example
SGML DTD Example
  • <!– SGML DTD for electronic messages - - >
  • <! ELEMENT e-mail - - (prolog, contents) >
  • <! ELEMENT prolog - - (sender, address+ , subject?, Cc*) >
  • <! ELEMENT (sender | address | subject | Cc) - 0 (#PCDATA) >
  • <! ELEMENT contents - - (par | image | audio)+ >
  • <! ELEMENT par - 0 (ref | #PCDATA)+>
  • <! ELEMENT ref - 0 EMPTY >
  • <! ELEMENT (image | audio) - - (#NDATA) >
  • <! ATTLIST e-mail
  • id ID #REQUIRED
  • date_sent DATE #REQUIRED
  • status (secret | public ) public >
  • <! ATTLIST ref
  • id IDREF #REQUIRED >
  • <! ATTLIST (image | audio)
  • id IDREF #REQUIRED >
sgml example
SGML Example
  • <!– DOCTYPE e-mail SYSTEM “e-mail.dtd”>
  • <e-mail id=94108rby date_sent=02101998>
  • <prolog>
  • <sender> Pablo Neruda</sender>
  • <address> Federico Garcia Lorca</address>
  • <address> Ernest Hemingway</address>
  • <subject> Picture of my house in Isla
  • <Cc> Gabriel Garcia Marquez</Cc>
  • </prolog>
  • <contents>
  • <par>
  • Here are two photos. One is of the view (photo <ref idref=F2>).
  • </par>
  • <image id=F1> “photo1.gif” </image>
  • <image id=F2> “photo2.jpg” </image>
  • </contents>
  • </e-mail>
sgml characteristics
SGML Characteristics
  • DTD provides ability to determine if a given document is well-formed.
  • SGML generally does not specify presentation/appearance.
  • Output specification standards:
    • DSSSL (Document Style Semantic Specification Language)
    • FOSI (Formatted Output Specification Instance)
hypertext markup language html
HyperText Markup Language (HTML)
  • Based on SGML
    • HTML DTD not explicitly referenced by documents
  • HTML documents can have documents embedded within them
    • Images or audio
    • Frames with other HTML documents
  • When programs are included, it is referred to as Dynamic HTML
  • Strict HTML includes only non-presentational markup.
    • Cascade Style Sheets (CSS) used to define presentation
  • In reality, presentational and structural markup are blended by HTML authoring applications.
original html limitations
(Original) HTML Limitations
  • In contrast to SGML:
    • Users cannot specify their own tags or attributes.
    • No support for nested structures that can represent database schemas or object-oriented hierarchies.
    • No support for validation of document by consuming applications.
extensible markup language xml
eXtensible Markup Language (XML)
  • XML is a simplified subset of SGML
    • XML is a meta-language
    • XML designed for semantic markup that is both human and machine readable
    • No DTD is required
    • All tags must be closed
  • Extensible Style sheet Language (XSL)
    • XML equivalent of CSS
    • Can be used to convert XML into HTML and CSS
multimedia
Multimedia
  • Lots of data file formats for non-textual data
    • Images
      • BMP, GIF, JPEG (JPG), TIFF
    • Audio
      • AU, MIDI, WAVE, MP3
    • Video
      • MPEG, AVI, QuickTime
    • Graphics / Virtual Environments
      • CGM, VRML, OpenGL
audio and video
Audio and Video
  • Data files often have:
    • Header
      • Indicates time granularity, number of channels, bits per channel
      • Somewhat like a DTD
    • Data
      • The signal
  • Data may be compressed
    • Data may be in frequency domain rather than time domain
    • Data may be encoded as sequence of differences between consecutive time segments.