1 / 27

Notes about my studies of Information Engineering and Natural Language Processing

Notes about my studies of Information Engineering and Natural Language Processing. by Changhua Yang, 04/09, 2003. Outline. Knowledge Management information from images Classification Problem SVM A Chinese product 漢字基因 SVM Tool Demo. Knowledge E poch. Data? Information? Knowledge?

vinaya
Download Presentation

Notes about my studies of Information Engineering and Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Notes about my studies ofInformation Engineering andNatural Language Processing by Changhua Yang, 04/09, 2003.

  2. Outline • Knowledge Management • information from images • Classification Problem • SVM • A Chinese product • 漢字基因 • SVM Tool Demo

  3. Knowledge Epoch • Data? • Information? • Knowledge? • Data Processing->Information Engineering->Knowledge Management

  4. Data: a compressed JPEG file a {(x, y, color)}-bit mapping • Metadata: data describing data • Information: • A dog(狐狸狗) on grassland • Knowledge: • Daytime photograph • A easy case for outlining the objects

  5. Problem • From Data to Information • An Search (Match) problem - relevance • A Classification problem • A Decision problem Feature X Feature Y

  6. Problem Conversion • 問這是不是狗 • 從Knowledge中形成一個temporary classifier {dog ,!dog} • 這裡面有沒有狗 • Phase 1: Identify all objects • Phase 2: for each object, determine {dog, !dog} • 這裡面有沒有狐狸狗 • Option 1: a classifier for {狐狸狗,!狐狸狗} from the training sets of all objects • Option 2: one of those from all dogs

  7. Shallow Semantic Parsing using Support Vector Machines Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James H. Martin, & Dan Jurafsky HLT-NAACL 2004

  8. using PropBank • PropBank (Kingsbury et al., 2002) • a 300k-word corpus • Wall Street Journal (WSJ) part of the Penn Tree-Bank (Marcus et al., 1994) (hand-corrected parses) • predicate argument relations are marked for part of the verbs • The arguments of a verb are labeled ARG0 to ARG5 • ARG0 is the PROTOAGENT (usually the subject) • ARG1 is the PROTO-PATIENT (usually its direct object) • PB attempts to treat semantically related verbs consistently • In addition to these CORE ARGUMENTS, additional ADJUNCTIVE ARGUMENTS, referred to as ARGMs are marked • Some examples are ARGMLOC, for locatives, and ARGM-TMP, for temporals

  9. Problem Description- Shallow Semantic Parsing • Argument Identification • the process of identifying parsed constituents in the sentence that represent semantic arguments of a given predicate • Argument Classification • Given constituents known to represent arguments of a predicate, assign the appropriate argument labels to them • Argument Identification and Classification • A combination of the above two tasks

  10. Baseline Features • Predicate • Path NP↑S↓VP↓VBD • Phrase Type (NP, PP, S) • Position • Voice • Head Word • Syntactic head • Sub-categorization VP->VBD-PP

  11. Classifier and Implementation • SVM – binary classifiers • One vs ALL (OVA) formalism • training n binary classifiers for a n-class problem • Converted multi-class problem • 80% of the nodes have NULL labels • a binary NULL vs NON-NULL classifier • remaining data for training OVA classifiers • Tool • TinySVM • YamCha

  12. New Features • NE • Headword POS • Verb Clustering • Partial Path • Verb Sense Info • Head of PP • First and Last W/P • Ordinal position • Tree Distance • Relative Features • Temporal cue words • Dynamic class context

  13. Technology • 中文倉頡輸入法發明人朱邦復領導的「香港文化傳信」正與IBM公司聯手開發中文嵌入式處理器V-Dragon(飛龍),希望結合Linux作業系統讓個人電腦售價降為目前的三分之一,打破英特爾和微軟的Wintel架構。 • 「文化傳信」的V-Dragon是一款中文CPU(中央處理器),內建3萬2000個中文字,並採用Linux作業系統Midori Linux。

  14. UCLA Report Confirms Culturecom Processor for Chinese Character Generation • The SCS 1610 can generate about 32,000 characters in three fonts and sizes ranging from 11x11 to 127x127 pixels. • The display quality of the characters is optimized aesthetically for sizes generated. • The code and data for the generation algorithm and the character representations occupy no more than 256KB • The speed of character generation is good. 這個技術是中文及其他非拼音文字 最有效的解決方案

  15. 采用中文CPU,完全的中文环境,中文字型皆以向量方式由CPU产生,可产生多种字体并可自由放大、缩小,不需使用Mask-ROM存放字型。同时也完全支持英文。

  16. 漢字基因(1/2) • 漢字 • 百分之九十是形聲字 • 聲符之外,形聲字尚有「假借」的 機能 • 也就是說,字首代表分類,字身可作定義之用 • 對檢字法的要求,是以字義的理解為第一訴求 • 以字根觀念產生「向量字形產生器」 • 漢字概念,發現有「字碼、字序、字形、字辨、字音、字義」六大功能

  17. 漢字基因(2/2) • 字碼 倉頡25碼 • 字序 倉頡24個漢字字母排序 • 字形 • 向量筆形9個,字根64個,供 字庫組字用 • 僅佔160kb系統空間,可組成各種字形近一千萬個,採用無級次放大,可選用各種已知之字體變化,組字速度,p450為例,16*16之字形,每秒可生成及顯示四萬六千字 • 字辨 73(9+64)類字形基因特徵,轉換之字碼 • 字音 六書「形聲」為本的波形追蹤法 • 字義 字義基因512個 • 1/3 from 宋儒之「體用因果」 • 1/4 of 「常識定義」

  18. TinySVM • Support standard C-SVR and C-SVM • Uses sparse vector representation • Can handle several ten-thousands of training examples, and hundred-thousands of feature dimension • Fast optimization algorithms stemming from SVM_light

  19. +1 1:0.5 2:0.5 +1 1:1 2:1 +1 1:2 2:2 +1 1:3 2:2 +1 1:4 2:2 -1 1:2 2:1 -1 1:2 2:1.5 -1 1:3 2:1.5 -1 1:4 2:1.5

  20. Steps • Define Feature space • Get feature values from [training|testing] set • Create Model from f-values of training set • svm_learn -t 1 -d 2 -c 1 news.trainnews_model • Verify the testing set with f-values • svm_classify -V news.testnews_model

  21. My Trial • 13 Features are defined • Training Set: 4 articles • 2 are annotated – advantage of the government • 2 are annotated negative • 2 test articles

More Related