860 likes | 1.25k Views
Chapter 6 Text and Multimedia Languages and Properties. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University. Part of the materials in the following is selected from Dr Kuang-hua Chen’s
E N D
Chapter 6Text and Multimedia Languages and Properties Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Part of the materials in the following is selected from Dr Kuang-hua Chen’s talk on XML and RDF (Department of Library Information Science, National Taiwan University)
what is a document • document: a single unit of information • complete logical unit • research paper, book, manual • part of a larger text • paragraph, passage, an entry in a dictionary, … • a physical unit • file, email, Web page
characteristics of a document How a document is displayed or printed Document Presentation Style Text + Structure + Other Media Syntax Semantics Express structure, presentation style, or even external actions Author implicit, or expressed in a language Creator
Metadata(元資料,超資料,中介資料,中間資料,後設資料,詮釋資料)Metadata(元資料,超資料,中介資料,中間資料,後設資料,詮釋資料) • Definition • Data about the data, e.g., schema in a DBMS • describe other information based on some rules or policies • Type • Descriptive Metadata • Metadata that is external to the meaning of the document • Dublin Core • Semantic Metadata • Metadata that can be found within the document’s content • Library of Congress subject codes
Dublin Core • Metadata Element Set (15) • 主題和關鍵詞(Subject) • 資源的主題,即敘述資源主題或內容的關鍵字或片語,包括控制詞彙或分類架構 • 題名(Title) • 由創造者或出版者給予資源的名稱 • 著者(Creator) • 創造資源內容的個人、組織或機構 • 簡述(descriptions) • 資源內容的文字描述,包括文件的摘要或是影像資源概述 • 出版者(Publisher) • 發表資源的組織,例如出版社、大學部門、團體或組織
Dublin Core (Continued) • 其他參與者(Contributors) • 其他對資源的創造有貢獻的個人或組織,例如編者、譯者或插畫者 • 出版日期(Date) • 資源發表的日期 • 資源類型(Type) • 資源的種類,例如首頁、小說、詩、技術報告、字典等 • 資料格式(Format) • 資源的檔案格式,例如text/html、ASCII、或是JPEG影像檔等 • 資源識別代號(Identifier) • 用來標示資源唯一性的字串或數字,例如網路資源URL或URN,以及ISBN或其他正式名稱
Dublin Core (Continued) • 關連(Relation) • 與其他資源的關連,例如所屬的系列或其他關係 • 來源(Source) • 作品是由何處衍生而來 • 語言(Language) • 資源內容所採用的語文 • 涵蓋時空(Coverage) • 資源的時間與空間特性 • 版權規範(Rights) • 資源版權聲明以及版權管理使用之規範
器物的例子 <?xml version="1.0"?><dc-record> <type>器物</type> <format>銅、琺瑯</format> <format>掐絲</format> <title>景泰掐絲琺瑯番蓮紋盒</title> <title>cloisonnie box with lotus-spray decoration</title> <description>1400/1500</description> <description>銅胎,蓋與器身鑄成浮雕式八瓣蓮花形</description> <description>高63.cm 口徑12.4cm 重634.6克</description> <description>陳夏生,明清琺瑯器展覽圖錄。台北:國立故宮博物院,民88年2月。</description>
器物的例子(續) <subject>景泰掐絲琺瑯番蓮紋盒</subject> <subject>日常生活</subject> <subject>容器</subject> <subject>銅、琺瑯</subject> <subject>掐絲</subject> <subject>地區(社的座落位置)(r) place</subject> <date>1400/1500</date> <coverage>地區(社的座落位置)(r) place</coverage> <rights>臺灣,故宮</rights> </dc-record>
紙本水墨的例子 <?xml version="1.0"?><dc-record> <type>紙本水墨</type> <type>原件</type> <title>古木流泉</title> <description>全文</description> <description>紙本水墨</description> <description>30*48.7</description> <description>蓼塘。楊世家藏。神。品。項元汴印。項子京家珍藏。項墨林鑑賞章。墨林秘玩。?李項氏士家寶玩。張澤之。柯亭文房之印。乾隆御覽之寶。石渠寶笈。重華宮鑑藏寶。樂善堂圖書記。</description>
紙本水墨的例子(續) <description>1127/1189</description> <description>國立故宮博物院編輯委員會,宋代書畫冊頁名品特展。台北:國立故宮博物院,民84年9月。</description> <subject>風景</subject> <creator>馬和之</creator> <date>1127/1189</date> <language>zh</language> <right>臺灣,故宮</right> </dc-record>
MARC • Machine-Readable Cataloging Record • The most used format for library records • An Example (NTU Lib)書名 公共藝術年鑑 Public art in Taiwan eng 何政廣 總編輯出版項 臺北市 行政院文化建設委員會 民88-出版項 1999.稽核項 冊 彩圖 29公分附註 據民87年書目資料著錄中文標題 csh 公共藝術 -- 年鑑其他作者 何 政廣控制號 100982322.控制號 100982322.國際標準號 957-02-4468-2 平裝 NT$500.國會卡片號 cw 88008821.
Web Metadata • purposes • cataloging (e.g., BibTex) • content rating • Protect children from reading some type of documents • intellectual property rights • digital signatures (for authentication) • privacy levels • applications to electronic commerce • … • RDF (Resource Description Framework)
RDF • description of nodes and attached attribute/value pairs • nodes: any Web resource • attributes: properties of nodes • values: text strings or other nodes (Web resources or metadata instances)
RDF基本模型 Resource Property Value Subject Predicate Object Statement
作者 籐子不二雄 型態 漫畫 範例一 機器貓 小叮噹
RDF結構模型 Resource Resource Property Property Property value value
範例二 機器貓 小叮噹 Dummy 作者 電子郵件 姓名 Tenzi@ac.jp 籐子不二雄
Name Space • 提供使用其他機構控制詞彙的機制 • 提供各權威機構制定控制詞彙的機制 • 範例 <RDF xmlns=“http://www.w3.org/TR/WD-rdf-syntax/” xmlns:dc=“http://purl.org/dc/elements/1.0/”> Name Space Dublin Core
DC in RDF dc:type dc:coverage Resource dc:title dc:creator dc:subject dc:contributor dc:description dc:publisher dc:identifier dc:date dc:rights dc:relation dc:language dc:format dc:source
A DC Example in RDF http://x.html Kevin Chen dc:creator <RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://x.html”> <dc:creator> Kevin Chen </dc:creator> </Description> </RDF>
RDF語法 dc:title http://www.lis.ntu.edu.tw/~khchen/ “The Magic Shelter” “Kuang-hua Chen” dc:creator <RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://www.lis.ntu.edu.tw/~khchen/”> <dc:Title> The Magic Shelter </dc:Title> <dc:Creator> Kuang-hua Chen </dc:Creator> </Description> </RDF>
Text • Formats • Basic form • ASCII, … • Document interchange • Rich Text Format (RTF): used by word processors • Portable Document Format (PDF) and Postcript: used for display or printing documents • MIME (Multipurpose Internet Mail Exchange): support multiple character sets, multiple languages, and multiple media
Text (Continued) • compress • Compress (Unix) • ARJ (PCs) • ZIP (gzip in Unix and Winzip in Windows)
Information Theory • entropy • Measure information content or information uncertainty where is the number of symbols in the alphabet pi is a probability for symbol i
Modeling Natural Language • Issue 1: how a word is formulated • symbols (separate-words and belong-to-words) • Vowels are more frequent than most consonants • Binomial model (0-order Markov model): each symbol is generated with a certain probability • k-order Markov model • Extension: how a sentence is formulated • 5-order Markov model in Bible • finite-state model (regular languages) • grammar model (context free and other languages)
Modeling Natural Language(Continued) • Issue 2: how different words are distributed inside each document • Zipf’s law • The frequency of the i-th most frequent word is 1/i times that of the most frequent word • In a text of n words with a vocabulary of V words, the i-th most frequent word appears n /(iHV()) =1.5~2.0
V F Text size Words There are a few hundred words which take up 50% of the text. Words (stopwords) that are too frequent can be disregarded.
Modeling Natural Language(Continued) • Issue 3: the distribution of words in the documents of a collection • Negative binomial distribution • The fraction of documents containing a word k times p=9.24 and =0.42 for word “said” in Brown corpus where p and depend on the word and the document collection
Modeling Natural Language(Continued) • Issue 4: number of distinct words in a document (document vocabulary) • Heaps’ Law • The vocabulary of a text of size n words isV = Knwhere K and depend on the particular textK: between 10 and 100: a positive value less than 1 (e.g., 0.4 < < 0.6)
Modeling Natural Language(Continued) • Issue 5: average length of words • Heaps’ law • The length of the words in the vocabulary increases logarithmically with the text size
Similarity Model • distance function • symmetric: distance(a,b)=distance(b,a) • triangle inequality:distance(a,c)distance(a,b)+distance(b,c) • measure • Edit distance: minimum number of character insertions, deletions, and substitutionse.g., Edit-distance(color, colour)=1, Edit-distance(survey, surgery)=2 • Longest common subsequence: only deletion is allowede.g., LCS(survey, surgery)=surey (non-common is deleted) • Longest common sequence of lines between two files: e.g., diff command in Unix
Markup Languages • Definition • Textual syntax that describes formatting actions, structure information, text semantics, attributes, etc. • Types • Procedural Markup • Descriptive Markup
描述性標示的特色 • 將文件內容與呈現格式區分開來 • 針對文件的語意結構進行標誌
SGML(Standard Generalized Markup Language) • 1986年 ISO 所制定的標準-ISO 8879 • 屬於描述性標示。 • 是一種 Meta-language • HTML 是 SGML 的應用。
SGML 的特色 • 有彈性 (flexibility) • 能描述任何資訊結構與任何複雜文件。 • 非專屬性 (non-proprietary)、平台獨立性(platform-independence) 與系統獨立性 (system-independence) • 利於文件的交換與長期保存。 • 資訊再利用性 (re-usability)
SGML文件的組成 • SGML declaration • 指定文件所使用的字集,及特定的選項功能。 • DTD (Document Type Definition) • 定義文獻所包含的 elements。 • 定義 elements 的內容與屬性。 • ... • DI (Document Instance) • 加上標示的文件。
SGML Declaration • 指定 SGML 文件使用的字元集,及特定的選項功能。 • 可以不特別指定 SGML declaration,文件會採用SGML 預設的字元集與功能設定。 • <!SGML “ISO 8879-1986” ...
Example : Email 的文件結構 Email From Body Date Subject To
An SGML DTD for Email starting and ending tags compulsory(-) or optional (O) comment <!-- Elements Min Content --> <!-- ----------- ----- ---------------------------------- --> <!ELEMENT Email -- (From,Date,To+,Subject, Body?)> <!ELEMENT From -O (#PCDATA)> <!ELEMENT Date -O (#PCDATA)> <!ELEMENT To -- (#PCDATA)> <!ELEMENT Subject -O (#PCDATA)> <!ELEMENT Body -- (#PCDATA)> <!-- End of Email DTD --> ,: concatenation |: logical or ?: 0 or 1 occurrence *: 0 or 1 occurrences +: 1 occurrences PCDATA: ASCII characters NDATA: binary data EMPTY
An SGML DI for Email DTD <!DOCTYPE Email SYSTEM “c:\temp\email.dtd”> <Email> <From>Joe <Date>1999-7-14 AM 09:20 <To>Jay</To> <To>Jennifer</To> <Subject>Learning XML <Body>XML 將在 Web 上大放異彩,趕快學喔! …</Body> </Email> user defined (vs. PUBLIC) The ending tag is optional
SGML, DTDs, Document Instances, and Presentation Instances SGML …. DTD DTD …. DI DI DI …. 印刷版本 盲人點字版本 Hypertext版本 DSSSL (Document Style Semantic Specification Language) FOSI (Formatted Output Specification Instance)
SGML發展的限制 • SGML應用程式不易開發 • SGML文件不易在Web上傳佈 • 缺乏廠商的支援
HTML (Hypertext Markup Language) • 是 SGML 的應用: • HTML 2.0 DTD • HTML 3.2 DTD • HTML 4.0 DTD • 目前 Web 上寫作網頁的標準資料格式 • 簡單易學 • 具可攜性 (portable) • 可結合超連結 (hyperlink) 與多媒體 Most HTML instances do not explicitly make reference to the DTD
HTML的特性 • HTML DTD 的設計主要是滿足線上顯示的需求 • HTML有內建的樣式 (style) • HTML引用SGML的標示最簡化特徵 (markup minimization feature) • HTML沒有採用 SGML 的超連結機制
HTML的限制 • 結構上的限制 • 資訊再利用的限制 • 資料交換的限制 • 自動文件處理的限制 • 無法支援較精確的查詢 • 各家廠商推出的 HTML Extension 不相容
XML (eXtensible Markup Language) • W3C Recommendation 10-February-1998 • XML 1.0 • 大廠支持:Microsoft、Netscape、Sun 、... • XML is SGML-- rather than HTML++ • 取 SGML之長,補 HTML之短 • 允許使用者依據需求,自行定義 tags • 能在 Web 上傳遞