è‡ªç„¶è¨€èªžã®å‡¦ç†ã¨ç†è§£ã®ç ”ç©¶

自然言語の処理と理解の研究 辻井　潤一東京大学大学院理学系研究科情報科学専攻

プロジェクトの目的 １．学術的な目的　　構造的な言語処理と確率・統計的な言語処理の融合　　　　　理論からのアプローチからの工学への貢献　　言語処理と知識処理２．社会的なインパクト　　ネットワーク時代の言語処理　　　　　テキストからの知識獲得、情報検索、対話システム３．国際的な情報の発信　　積極的な国際的な共同研究　　　　　焦点を絞った、実質的なGoalを持った国際Workshop 　　　　　緊密な研究協力体制の構築

理論言語学からの妥当な文法枠組み タイプ付素性構造に基づく文法枠組み処理効率耐性文法記述の偏り：　現実テキストへの適用　　　　　　　　　　　　系統的な文法の拡充

処理効率　 Abstract Machine for Unification (T.Makino, et.al.) Prolog with Typed Feature Structure (LiLFes) Coling 98, JNE-00 CFG Approximation (K.Torisawa, et.al) Multi-staged Parsing (TNT) Coling 98, JNE-00 Preventing Combinatorial Explosion (Y.Miyao) Packing of FSs ACL 99

Abstract Machine (Carpenter and Qu, 1995) nelist Abstract machine code of a TFS REST FIRST PUSH FIRSTADDNEW listUNIFYVAR 1POP PUSH RESTUNIFYVAR1POP list nelist nelist nelist FIRST FIRST FIRST REST REST REST bot list nelist nelist nelist FIRST FIRST FIRST REST REST REST foo foo foo list list list 1 STR nelist2 VAR bot3 PTR 44 STR nelist5 VAR foo6 VAR list 1 STR nelist2 VAR list3 PTR 44 STR nelist5 VAR foo6 VAR list 1 STR nelist2 PTR 43 PTR 44 STR nelist5 VAR foo6 VAR list TFS data on memory

FASTER LiLFeS: Performance (2/2) Comparison with other inference engines for typed feature structures LiLFeS: Native Code Compiler LiLFeS: Byte Code Emulator ProFIT on SICStus Emulator ALE 3.1 on SICStus Emulator Intel Pentium II 400Mhz Grammar : a small grammar distributed with ALE

Filtering with CFG (1/5) • 2-phased parsing • Approximate HPSG with CFG with keeping important constraints. • Obtained CFG might over-generate, but can be used in filtering. • Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on. Feature Structures HPSG + Compile CFG Input Sentences Built-in CFG Parser LiLFeS Unification Parsing Output Complete parse trees

Grammar Corpus(average length: words） Naïve parser TNTparser LKB Parser（Stanford: DFKI) LinGO csli（5.8） 0.68 0.12 0.23 LinGO aged（8.4） 1.72 0.31 0.61 LinGO blend （11） 14.71 1.90 3.10 XHPSG ATIS (7.42) 14.27 0.30 SLUNG EDR(20.5) 0.88 0.38 Evaluation of HPSG ParsersDFKI, Stanford, U-Tokyo Processing time per sentence (sec) Sun UltraSparc, 336 mhz, 6GB main memory

Packed Feature Structure • Each dependency function for one of the input feature structures A set of feature structures Packed feature structure verb indicative VMODE indicative past_part false PASSIVE verb TENSE past false VMODE verb 1 true PASSIVE 2 VMODE past_part TENSE 3 past PASSIVE true TENSE tense tense verb VMODE past_part PASSIVE false TENSE tense

Experimental Results (1) • Execution time for unification • Packing achieved a considerable speed-up in unification Unpacked (msec.) Packed (msec.) Test data # of LEs Improvement credited 37 36.5 5.7 6.4 walked 79 77.2 9.2 8.4

大規模な文法の構成 英語文法スタンフォード大学、DFKIとの共同：　 LinGO文法（HPSG) ペンシルベニア大学との共同：　XTAG文法の変換　　　　　　手作業が介在する変換（XHPSG) 　　　　　　２つの文法枠組みの自動変換日本語文法 SLUNG:Underspecified　な日本語文法 KNP：　係り受け解析、高耐性の日本語文法（京都大学）

③ GENIA ② query Learning Information Extraction Terminology Databases • Pre‐processing • Named entity • Template element • Scenario template ① A researcher with a question Corpora ⑤ answer to the question Ontology ④ information extracted WWW Links Thesaurus Information Retrieval Overview of GENIA Project

CSNDB（国立衛生研究所) • A data- and knowledge- base for signaling pathways of human cells. • It compiles the information on biological molecules, sequences, structures, functions, and biological reactions which transfer the cellular signals. • Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically. • CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC. • Final goal is to make a computerized model for various biological phenomena.

Example. 3 Excerpted @[Takai98] • Signal_Reaction: “Ah receptor + HSP90 ” • Component “Ah receptor” “HSP90” • Effect “activation dissociation” • Interaction “PAS domain” “of Ah receptor” • Activity “inactivation of Ah receptor” • Reference [Powell-Coffman_1998] • A Polymerization Reaction

Syntax/Semantics An active phorbol ester must therefore, presumably by activation of protein kinase C, cause dissociation of a cytoplasmic complex of NF-kappa B and I kappa B by modifying I kappa B. E1: An active phorbol ester activates protein kinase C. E2: The active phorbol ester modifies I kappa B. E3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B. Part-Whole

言語と知識処理：　理解へ

REACTION3attribute1attribute2 : REACTION5 attribute1 attribute2 : Event Ontology REACTION1attribute1attribute2 : REACTION2 attribute1 attribute2 : • substance ACTIVATE substance • substance ACTIVATE protein • protein ACTIVATE pathway • PHOSPHORYLATE • INHIBIT • REGULATE REACTION4attribute1attribute2 :

Example of NE Annotation UI - 85146267 TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">. AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class" cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU-26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.

TIMS – Tag Information Management System – Ｓ Will/TreeEdit XML Tree Viewer / XML Tree Editor LiLFeS/XHPSG HPSG-based Syntactic/ Semantic Parser JTAG Manual Tagging Aid Interface XML Data XML Data XML Data XML Document Management XML Data Mining TIMS VTAG Automatic Tagging Workbench XML Data XML Database

アブストラクト400件に対するタグ付け • 文章数：約４，０００文 • 単語数：約１００，０００語 • タグ付けされた項目の数 • 計　約１２，０００個所 • SOURCE 　３１２３ • DNA 　９４５ • RNA １００ • PROTEIN 　２６３９ • その他　５１８０

36 semantic subclasses

CLASSの頻度分布

アブストラクトに頻出する動詞 • CSNDB(国衛研)の９２５件の出現回数　　　　　　　　(Have, be動詞以外）

show • NP show that-clause • researcher show conclusion • experiment show conclusion • NP show NP • structure show property • NP be shown to-infinitive • substance be shown to reaction • it is shown that-clause • it is shown conclusion

inhibit • NP inhibit NP • substance inhibit reaction • substance inhibit pathway • substance inhibit substance • substance inhibit source • reaction inhibit substance • reaction inhibit reaction • structure inhibit pathway

頻出動詞の構文・意味パターン • 辞書のエントリーが何種類必要か

“indicate” の意味表現の例（LiLFeS） semantic_primitive(Tnx0Vnx1, indicate, SYNSEM\LOCAL\CONX\IND\(transitive & ARG1\chem_struct & ARG2\mechanism)). semantic_primitive(Tnx0Vnx1, indicate, SYNSEM\LOCAL\CONX\IND\(transitive & ARG1\research & ARG2\$OBJ)). semantic_primitive(Tnx0Vs1, indicate, SYNSEM\LOCAL\CONX\IND\(transitive & ARG1\research & ARG2\$OBJ)). the structureindicatesmechanisms these findingsindicatean unexpected role of … the dataindicate that …

180 sentences from abstracts in MEDLINE Theaverage parse time per sentence: 2.7 sec by a naïve parser (This has been improved by the multi-stage parser by 10 times) Experiment (A.Yakushiji et.al, PSB2001) XHPSG: HPSG-like Grammar translated from XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98) Terms (Compound nouns) are chunked beforehand.

68% Argument Frame Extractor 133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences Extracted Uniquely 31 Extracted with ambiguity 32 Extractable from pp’s 26 Parsing Failures Not extractable 27 Memory limitation,etc 17

企業や ＫＮＰ金融機関に PARA 不良債権の早急な処理を促し、 特に金融機関には「この過程で従来のような横並びの決算や 配当が PARA 維持されるのではなく、 経営格差を顕在化させる PARA 覚悟を精度：約９０％求めたい」としている。 PARA

システムの概要 ユーザ Mewでメールを送信する方法は？ユーザインタフェイス（WWWブラウザ）メール送信部入力解析部知識データベース対話管理部

対話データの評価 成功：38％失敗：知識　　　：約30％　減少傾向失敗：対話管理：約5％　増加傾向

研究成果（東京工業大学） • 再現率の改善 • 複数のシソーラスを利用した検索質問拡張（1998-1999） • クラスタベースの情報検索（1997） • 大規模テキストクラスタリング（1996-1997） • 精度の改善 • 格フレームを利用した情報検索（1996） • 索引語の洗練と選択的利用（1999-2000） • 再現率と精度の両立 • 多段階検索モデル（1999-2000）

シナリオ（東京工業大学） Query Therauri Query Expansion Final Result Expanded Query Intermediate Result Initial Retrieval Index term Refinement Revised Query Second Retrieval Document Collection

Workshops 初年度：　立ち上げのためのClosedWS 二年度：　IRなど応用に焦点　（日立基礎研と協賛）三年度：　理論と応用の関係　（日立中研と協賛）四年度：　ParsingStrategy（ドイツ）　　　　　　（DFKI,Stanford大学と協賛）　　　　　　（論文誌のSpecialIssue,CSLIからの本） Tutorials NLP for Bio-Informtaics: Eureka Groupと共同（PSB2001) Eureka, TIDESと共同（ISMB2001) 共同研究スタンフォード、DFKI、ペンシルベニア、UMIST、ローマ大

将来の研究課題 １．構造処理と確率的な処理　　　意味空間まで含めた、豊かな確率空間での処理２．文法記述間の相互変換、等価性の理論的基礎　　　言語資源の共有、理論言語学への寄与３．大規模素性構造のデータベース XMLデータベースとの相互関連４．制御された、教師なし学習の機構　　　意味クラスの同定、データからの文法学習５．間テキストでの文脈処理

è‡ªç„¶è¨€èªžã®å‡¦ç†ã¨ç†è§£ã®ç ”ç©¶