1 / 35

Text Languages

Text Languages. J. H. Wang Mar. 4, 2008. Text. User Interface. 4, 10. user need. Text. Text Operations. 6, 7. logical view. logical view. Query Operations. DB Manager Module. Indexing. user feedback. 5. 8. inverted file. query. Searching. Index. 8. retrieved docs.

guri
Download Presentation

Text Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Languages J. H. Wang Mar. 4, 2008

  2. Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 8 inverted file query Searching Index 8 retrieved docs Text Database Ranking ranked docs 2 The Retrieval Process

  3. Text Languages (Ch. 6) • Metadata • Text • Markup Languages • Multimedia

  4. Introduction • Text • Main form of communicating knowledge • Document • Loosely defined, denote a single unit of information • Can be any physical unit • a file • an email • a Web Page

  5. Introduction • Document • Syntax and structure • Semantics • Information about itself (metadata)

  6. Introduction • Document syntax • Implicit, or expressed in a language (e.g, TeX) • Powerful languages: easier to parse, difficult to convert to other formats • Open languages are better (interchange) • Semantics of texts in natural language are not easy for a computer to understand • Trend: languages which provide information on structure, format and semantics and being readable by human and computers (e.g. SGML)

  7. Introduction • New applications are pushing for format such that information can be represented independetly of style • Style: defined by the author, but the reader may decide part of it • Style can include treatment of other media

  8. Metadata • “Data about the data” • e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. • Descriptive Metadata • Author, source , length • Dublin Core Metadata Element Set • Semantic Metadata • Characterizes the subject matter within the document contents • MEDLINE

  9. Metadata • MARC (Machine Readable Cataloging Record) 100 0020 1 $aHagler, Ronald. 245 0074 14$aThe bibliographic... 250 0012 $a3rd. Ed. 260 0052 $aChicago :$bALA, $c1997

  10. Metadata • Metadata information on Web documents • Cataloging, content rating, property rights, digital signatures • New standard: Resource Description Framework (RDF) • Description of Web resources to facilitate automated processing of information • Nodes and attched atribute/values pairs • Metadescription of non-textual objects • Keyword can be used to search the objects

  11. Metadata • RDF Example <RDF:RDF> <RDF:Description RDF:HREF = “page.html”> <DC:Creator> John Smith </DC:Creator> <DC:Title> John’s Home Page </DC:Title> </RDF:Description> </RDF:RDF>

  12. Metadata • RDF Schema Exemple

  13. Text • Text coding in bits • EBCDIC, ASCII • Initially, 7 bits. Later, 8 bits • Unicode • 16 bits, to accommodate oriental languages

  14. Text • Formats • No single format exists • IR system should retrieve information from different formats • Past: IR systems convert the documents • Today: IR systems use filters

  15. Text • Formats • Formats for document interchange (RTF) • Formats for displaying (PDF, PostScript) • Formats for encoding email (MIME) • Compressed files • Uuencode/uudecode, binhex

  16. Text • Information Theory • Amount of information is related to the distribution of symbols in the document • Entropy: • Definition of entropy depends on the probabilities of each symbol • Text models are used to obtain those probabilites

  17. Text • Example – Entropy • 001001011011

  18. Text • Example – Entropy • 111111111111

  19. Text • Modeling Natural Language • Symbols: separate words or belong to words • Symbols are not uniformly distributed • binomial model • Dependency of previous symbols • k-order markovian model • We can take words as symbols

  20. Text • Modeling Natural Language • Words distribution inside documents • Zipf’s Law: i-th most frequent word appears 1/i times of the most frequent word • Real data fits better with  between 1.5 and 2.0

  21. Text • Modeling Natural Language • Example – word distibution (Zipf’s Law) • V=1000,  = 2 • Most frequent word: n=300 • 2nd most frequent: n=76 • 3rd most frequent: n=33 • 4th most frequent: n=19

  22. Text • Modeling Natural Language • Skewed distribution – stopwords • Distribution of words in the documents • binomial distribution • Poisson distribution

  23. Text • Modeling Natural Language • Number of distinct words (vocabulary) • Heaps’ Law: • Set of different words is fixed by a constant, but the limit is too high

  24. Text • Modeling Natural Language • Heaps’ Law example • k between 10 and 100,  is less than 1 • Example: n=400000,  = 0.5 • K=25, V=15811 • K=35, V=22135

  25. Text • Modeling Natural Language • Length of the words • defines total space needed for vocabulary • Heaps’ Law: length increases logarithmically with text size • In practice, a finit-state model is used • Space has p=0.2 • Space cannot apear twice subsequently • There are 26 letters

  26. Text • Similarity Models • Distance Function • Should be symmetric and satisfy triangle inequality • Hamming Distance • Number of positions that have different characters reverse receive

  27. Text • Similarity Models • Edit (Levenshtein) Distance • Minimum number of operations needed to make strings equal survey surgery • Superior for modeling syntatic errors • Extensions: weights, transpositions, etc

  28. Text • Similarity Models • Longest Common Subsequence (LCS) survey – surgery LCS: surey • Documents: lines as symbols (diff in Unix) • time consuming • similar lines • Fingerprints • Visual tools

  29. Markup Languages • Markup: formatting actions, structure information, text semantics, attributes, … • Tags: formatting commands • SGML: standard metalanguage for markup • XML: a subset • HTML: an instance of SGML

  30. SGML • Standard Generalized Markup Language (ISO 8879) • A description of the document structure • The text marked with tags which describe the structure • DTD (Document Type Declaration) • Does not define the semantics (meaning, presentation, and behavior) • Tags: denoted by angle brackets (<tag>)

  31. Output specifications are often added to SGML documents • DSSSL (Document Style Semantic Specification Language), FOSI (Formatted Output Specification Instance)

  32. HTML • HyperText Markup Language • Created in 1992, version 4.0 in 1997 • CSS (Cascade Style Sheets) were introduced in 1997 to create visual effects • SGML: generic; it’s possible to define your own formats, handle large and complex documents, and manage large information repositories • not need for Web applications

  33. XML • eXtensible Markup Language • It allows a human-readable semantic markup, which is also machine-readable • It enables automatic authoring, parsing, and processing of networked data • XSL (Extensible Style sheet Language) • XML counterpart of CSS • XLL (Extensible Linking Language) • Defines different types of links • Recent uses: MathML, SMIL, RDF, …

  34. Multimedia • Images • Bit-mapped: XBM, BMP, PCX, PNG • Compressed: GIF, JPEG, TIFF • Audio • AU, MIDI, WAVE • Video • MPEG, AVI, …

  35. Summary • Text is the main form of communicating knowledge • Documents have syntax, structure and semantics • Metadata: information about data • Formats of text • Modeling Natural Language • Entropy • Distribution of symbols • Similarity

More Related