Wmes3103 information retrieval
Sponsored Links
This presentation is the property of its rightful owner.
1 / 17

WMES3103 : INFORMATION RETRIEVAL PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

WMES3103 : INFORMATION RETRIEVAL. TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES. INTRODUCTION. Text - main form of communicating data and information Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive

Download Presentation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript




  • Text - main form of communicating data and information

  • Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive

  • Website with a combination ot text and multimedia will be visited by many as compared to one which is text-based only

  • IRS - text and multimedia is depicted via special languages.


  • New concept on information – metadata

  • Information about data arrangement, data domain and relationship between the two

  • Data about data

  • 2 types – descriptive and semantic

  • descriptive Metadata – metadata which explain about document or one unit of information

  • Commonly used Metadata :

    • Authors

    • Date of publication

    • Source of publication

    • Length of document

    • Type of document


  • semantic Metadata –resembles subject that can be obtain from the contents of the document – subjects heading

  • Keywords

  • LC Code


  • With computers, we need to code text into binary digits

  • First coding schemes – EBCDIC and ASCII – 7 bits to code each symbol

  • Then, ASCII changed to 8 bits to accommodate other languages, accents and diacritical marks

  • Oriental languages – Unicode – 16 bits



  • No one single format for a text document

  • Good IRS system should be able to retrieve information from any format

  • Initially, IRS will convert a document to an internal format but this had a lot of disadvantages

  • Now, many new format has been developed for document interchange


  • RTF – Rich Text Format for word processing

  • PDF – Portable Document Format for displaying and printing documents

  • Postscript – powerful programming language for drawing

  • MIMT – Multipurpose Internet Mail Exchange to encode e-mail

  • Files are compressed – Compress (Unix), ARJ (PCs), ZIP

  • Convert binary files to ASCII text –uuencode/uudecode, binhex


  • Markup = extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc.

  • Formal markup languages are more structured

  • Marks = tags - initial and ending tag surrounding the marked text

  • Standard metalanguage = SGML

  • New metalanguange for Web = XML (eXtensible Markup Language) = subset of SGML

  • Most popular markup language used for the Web = HTML (HyperText Markup Language)


  • Applications that handle different types of digital data originating from distinct types of media

  • Text, sound, images, video

  • Digital data distinct and different in volume, format, and processing requirements

  • Different types of formats necessary for storing each type of media


  • Different formats used commonly on the Web and in digital libraries

    • Images

    • Audio

    • Moving Images

    • Textual Images

    • Graphics and Virtual Reality


  • XBM, BMP, PCX – direct representation of a bit-mapped (or pixel-based)

  • GIF (Graphic Interchange Format) – includes compression and good for black or white or with small number of clours or gray levels (256)

  • JPEG (Joint Photographic Experts Group) – includes compression

  • TIFF (Tagged Image File Format) – used to exchange different documents between different applications and different computer platforms

  • TGA (Television Targa image file) – associated with video game boards

  • Various other image formats


  • Must be digitized before storage

  • AU, MIDI (standard format to interchange music between electronic instruments and computers), WAVE – for small pieces of digital audio

  • Audio libraries – RealAudio or CD formats

  • Animation or moving pictures

    • MPEG (Moving Pictures Expert Group) – related to JPEG

    • Others – AVI, FLI, QuickTime


  • Images that contain mainly typed or typeset text

  • Obtained by scanning the documents

  • For archival purposes

  • Saved as images but with further compression

  • Textual and non-textual stored and compressed separately and when neded can be combined and displayed together


  • 3-dimensional graphics found on Web

  • CGM (Computer Graphics Metafile) standard

  • Metafile = collection of elements

  • CGM standard specifies which elements are allowed to occur in which positions in a metafile

  • VRML (Virtual Reality Modeling Language) – file format for describing interactive 3D objects and worlds - universal interchange format for 3D graphics and multimedia - can be used for various applications


  • HyTime = Hyper/Time-based Structuring Language – standard defined for multimedia documents markup

  • SGML architecture which specifies the generic hypermedia structure of documents

  • Login