1 / 14

Kyoungryol Kim

Extracting Schedule Information from Korean Email. Kyoungryol Kim. Table of Contents. Introduction Methods and Experiments Discussion Schedule. Introduction. Goal. To extract schedule information, Meeting location and Speaker, automatically from Email. 안녕하세요 ,

sancho
Download Presentation

Kyoungryol Kim

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Schedule Informationfrom Korean Email Kyoungryol Kim

  2. Table of Contents • Introduction • Methods and Experiments • Discussion • Schedule

  3. Introduction

  4. Goal • To extract schedule information, Meeting location and Speaker, automatically from Email 안녕하세요, 금주 수요일 오후 2시~4시에, 1층 세미나실에서 세미나를 진행합니다. CI LAB과 TC LAB 이 공동으로 주관하는 세미나이며, 지도교수님께서 참석하실 예정입니다. 석사과정학생들은 꼭 참석바랍니다. 발표자는 김 아나톨리, 박광희학생이니 준비해주십시오. 문의사항은 박상원 학생에게 문의바랍니다. 감사합니다. Extract

  5. Methods and Experiments

  6. Proposed Architecture NER 3 시 에 29 동 106 호 시청각실 에서 합니다 . 이번주 발표자 는 김지영 , 김도희 , 조지윤 입니다 3 시 에 29 B 동 I 106 I 호 I 시청각실 I 에서 합니다 . 이번주 발표자 는 김지영 B , 김도희 B , 조지윤 B 입니다 3 시 에 29 동 106 호 시청각실 에서 합니다 . 이번주 발표자 는 김지영 , 김도희 , 조지윤 입니다 3 시 에 29 동 106 호 시청각실 에서 합니다 . 이번주 발표자 는 김지영 , 김도희 , 조지윤 입니다 안녕하세요, 다음주 랩미팅 공지입니다. 7월 19일 목요일 오후 3시에 29동 106호 시청각실에서 합니다. 이번주 발표자는 김지영, 김도희, 조지윤 입니다. Location isHeldAt hasSpeaker Person hasSpeaker hasSpeaker INPUT TEXT NE-type Classifica-tion Relation-type Classifica-tion OUTPUT Boundary Detection TemplateGeneration Tokenization

  7. Baseline system • [Min et al 2005] Information Extraction Using Context and Position • Corpus : 245 meeting announcement email • Target : Attendee, Meeting Location, Time, Date • Performance (F-measure) : • Attendee : 36%, Meeting Location : 57%, Time : 92.5%, Date : 91% • Method • Sentence to LSP • NE Recognition • ME, NN, Pattern-selection • Instance Disambiguation • ML : Naive Bayes • Score calculation

  8. Reference for NER tagging • [Lee et al. 2006] Fine-grained Named Entity Recognition using Conditional Random Fields for Question Answering • Performance : • Precision 85.8%, Recall 81.1%, F1 83.4% • Boundary tags : IBO2 model (B-I-O) • NE-classes : 147 types • Domain of Corpus: • Encyclopedia documents (Training : 8,037 docs, Test : 100 docs) • Features : • Lexical feature -2,-1,0,1,2 • Suffix -2,-1,0,1,2 • POStag -2,-1,0,1,2 • POStag + length • Position of Morpheme in Eojeol (Start /Center /End) • NE dictionary (true or false) + length • NE dictionary feature (index) + length • 15 regular expressions : [A-Z]*, [0-9]*, [0-9][0-9], [0-9][0-9][0-9][0-9], [A-Za-z0-0]*, ---. Boundary Detection (CRFs) 3 classes NE-type Classification (ME) 147 classes

  9. NER - Boundary Detection NER • Boundary Tagset : IOB2 • Features • Linguistic • {-2,-1,0,1,2} POS-level word, {-2,-1,0,1,2} POS-tag, POS-tag + length of the word • Orthographic : 18 types of the word • isKorean, isAlpha, isAlnum, 2DigitNum, ... • Gazetteer: • Person/Location Pronoun dictionary (ETRI 99) • from Training corpus : • Heading words, Surrounding words, NE words • External resources : • Person : Chosun/Joins.com Person DB (64,042) • Location : Nate Local DB 35,335, Sigaji.com 8,193, Ofood 43,390BusStop 19,431, Address,B/D 23,365, Subway 1,288,Hotel (Auction accomodation, hotelnjoy) 884,Country/Place name 11,946, School(Elementary~University) 21,957 • Syntactic : • Position of the POS-level word in the chunk (relative:S/C/E, absolute) • Position of the chunk in the sentence (relative:S/SC/CE/E, absolute) • Position of the sentence in the document (relative:S/SC/CE/E, absolute) • TF-IDF 3 시 에 29 B 동 I 106 I 호 I 시청각실 I 에서 합니다 . 이번주 발표자 는 김지영 B , 김도희 B , 조지윤 B 입니다 3 시 에 29 동 106 호 시청각실 에서 합니다 . 이번주 발표자 는 김지영 , 김도희 , 조지윤 입니다 Location Person NER NE-type Classifica-tion Boundary Detection

  10. External Resources (1) • Location : • Shop Name (80,436) • Nate Local DB (3~10 chars.) (http://localinfo.nate.com) • Sigaji.com Shop DB (3~10 chars.) (http://sigaji.com/location/) • oFood (http://ofood.co.kr) • Hotel Name (884) • Auction Accomodation (http://accommodations.auction.co.kr) • Hotelnjoy(http://www.hotelnjoy.com) • Public Transportation (20,719) • Subway stations • Bus-Stop names • Address (from Zipcode DB) (23,365) • Si/do, Gu/gun, Dong/myun/ri, B/D names

  11. External Resources (2) • Person • Chosun Person DB, Joins Person DB • 64,042 people • Name combination feature from collected person DB. • assume length of the name is 3 • # 1st char : 177, #2nd char : 351, #3rd char: 475 • possible combinations : 29,510,325e.g.) + + = 갈영남

  12. Experiment : NER - Boundary Detection • Boundary Detection • 948 emails including 'Person' or 'Location' • CRFs Model, 10-fold cross validation, Exact Matching

  13. Discussion • Refining NE dictionary should be important • Discovering appropriate feature set from collected DB • Find more available database. • Data refinement : • splitting compound words using the word in the DB 한국방송공사 + 대전방송총사 한국방송통신대학교 + 광주전남지역대학

  14. Schedule Plan • ~March 18: • Finish implementing NER module with NE type classification. • Performance evaluation comparing with Dr.Lee's NER on our corpus. • ~March 25: • Finish implementing relation type extraction module. • ~March 31: • System refinement. • Start to writing paper.

More Related