nlp research at internet age an overview of nlp at microsoft research asia n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia PowerPoint Presentation
Download Presentation
NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia

Loading in 2 Seconds...

play fullscreen
1 / 42

NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia . Ming Zhou Manager of Natural Language Group Microsoft Research Asia. Trends of Internet Services. Eco system to work with third party’s apps Apple Apps, Facebook , Twitter, Baidu , Sina , QQ

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'NLP Research at Internet Age An Overview of NLP at Microsoft Research Asia' - kimi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
nlp research at internet age an overview of nlp at microsoft research asia

NLP Research at Internet AgeAn Overview of NLP at Microsoft Research Asia

Ming Zhou

Manager of Natural Language Group

Microsoft Research Asia

trends of internet services
Trends of Internet Services
  • Eco system to work with third party’s apps
    • Apple Apps, Facebook, Twitter, Baidu, Sina, QQ
  • Real time content collection and search
    • Twitter, Facebook, Del.ici.ous, NYT, YouTube
  • Mobile search
    • Contextual intent understanding
    • Towards decision making and action taking
  • Social power
    • Social tags (like) for general search engines
    • Search engines in SNS
    • Social QA
impact and challenge to nlp research
Impact and Challenge to NLP Research
  • Impact
    • Biggest database ever – connects data
    • Biggest social network – connects people
    • Harnessing collective intelligence
    • Contextual information processing: User, user’s social network, location, time
    • Real-time information processing: Collection, index, operation without delay
  • Challenge
    • How to leverage data, people, contextual information to reach real-time information processing?
problems of traditional nlp approaches nlp 1 0
Problems of Traditional NLP Approaches (NLP 1.0)
  • Deep in individual component technologies but reach upper bounds
  • Less consider scenarios, user’s need, market need
  • Serious data sparseness with human annotation
  • Evaluation bottleneck
  • Slow deployment
  • Lack effective framework to involve users’ feedback
new strategy of nlp nlp2 0
New Strategy of NLP (NLP2.0)
  • Data collection from the web
  • Domain specific and open-IE
  • Contextual NLP
  • Maximize on the system level not on the individual component
  • Earlier deployment on Internet
  • Make best use of social factors
our vision and task
Our Vision and Task

Understand user and document in any language, for any device and any applications

  • Advanced NLP technologies
    • Word breaker, POS tagging, chunking, syntactic parser, semantic role labeling, speller, query suggestion, summarization
    • Chinese, Japanese, English
  • Multi-language information access
    • Statistical machine translation
    • Multi-language search
  • Semantic computing
    • Sentiment analysis, event extraction, ontology learning
    • Understanding query intent and document
    • Contextual NLP
slide7

MSRA NLP Research Overview

Translation evaluation

paraphrasing

Tran. know. acquisition

Vertical search

WEB mining for MT

Cross language IR

NLP enriched Indexing

and search

SMT

MRD

Balanced corpus

Query-doc relevance

MRD

Bilingual corpus

Parsing lexicon

Tagged corpus

Translation

lexicon

Bilingual tagged

corpus

Text mining

Applications

Chinese IME

English writing wizard

News Search

Comparison Shopping

Japanese IME

Pocket translator

Twitter Search

Chatbot

Query speller

Couplet generation

Resume Routing

General web search

Component techs

Text analysis

Machine Translation

Information Extraction

Information Retrieval

Skeleton parser

Meta data extraction

Term extraction

Named entity identification

Annotation tool

Pos tagging

Machine learning

SLM

Data

NLP (C, J, E)

MT (C, J, E)

IR and IE (C,J,E)

research accomplishment
Research Accomplishment
  • Awards
    • MSRA Best Research Team(2010)
    • Finalist of WSJ Asian Innovation Awards (2010)
    • MS ARD Best Project (Engkoo)
    • MSRA Best Innovation (1998-2008): IME and Chinese couplets
  • Academic impact
    • Best result in NIST 2008 SMT, CWMT 2008 and CWMT 2009
    • Best result in SIGHAN 2006 bake off on Chinese word segmentation
    • Best result in cross language information retrieval in TREC-9, NTCIR-III
    • 40 ACL papers, 9 SIGIR, 17 Coling papers (2000-2010)
    • PC Chair, area chair of ACL
  • Collaboration with universities
    • HIT Joint lab on NLP, Speech and Search, Tsinghua Joint lab on Media and Network
    • 400 interns in 12 years
    • Summer schools since 2001
    • PhD supervisors at universities
summer school on information extraction harbin june 2005
Summer School on Information Extraction (Harbin, June, 2005)

Cheng Niu: Information extraction

Frank Seide: Speech information extraction and search

Hwee Tou Ng: Advanced topics of information extraction

Chin-Yew Lin: Information extraction for automatic summarization

projects based on nlp 2 0
Projects based on NLP 2.0
  • Engkoo: Web-based English learning service
    • Data mining from the web
  • Chinese couplets
    • Include user’s power into system evolvement
  • Semantic analysis and search of micro-blogging
    • Move to SNS, mobile
engkoo

Engkoo

Parallel data mining from the web

Video:

http://video.sina.com.cn/v/b/37417609-1286528122.html

rapidly changing language
Rapidly Changing Language
  • Approximately 1.5 billion people speak English as a primary, secondary or business language
  • China: The largest “English speaking” country with 250 million English learners and USD 60 billion annual expenses
  • Problem: Live language: new words, new meanings

Key Insight:

With billions of translated web pages and sharable repositories of language data growing every day, the Internet holds the sum of human language knowledge

www engkoo com
www.engkoo.com

Major Features:

Microsoft Products:

Endless Lexicon with Native Definitions

Bing

Human-Like TTS & Phonetic Search

Office

State-of-the-Art Machine Translation

(NIST OpenMT Winner)

MSN

Real-time Interactive Alignment

knowlege mining pipeline

1. word’s idiomatic usage

  • Verb~Noun (decline~offer)
  • Verb~Adv (greatly~improve)
  • Adj~Noun (arduous~task)
  • Adv~Adj (extremely~bad)
  • 2. paraphrasing
  • turn_on~light, switch_on~light
  • laborious~task, hard~task
  • deeply~moved, deeply~touched
  • 3. collocation translations
  • 订~计划,make~plan
  • 订~旅馆, book~room
  • 订~杂志,
  • subscribe to ~magazine
Knowlege Mining Pipeline

tokenizing: he could hardly afford to waste that golden time.

他 无法 浪费 那样 的 好 时光。

skeleton parsing: (Tsub~he~afford) (ModAdv~hardly~afford) (Tobj~waste~afford)

(Tobj~time~waste) (AdjAttrib~golden~time)

(Tsub~他~浪费) (ModAdv~无法~浪费)(Tobj~浪费~时光)

(AdjAttrib~好~时光)

alignment: he(他) could hardly afford to(无法) waste(浪费) that(那样的)

golden(好) time(时光)

  • single word
  • “he”, “could”, “hardly”, “afford” etc.
  • “他”, “无法”, ”浪费“ etc.
  • 2. single word with its POS
  • “he_Pron”, “could_Verb”,“hardly_Adv” etc.
  • “他_Pron”, “无法_Adv”, ”浪费_Verb“ etc.
  • 3. collocation
  • “Tsub~he~afford ”, “Tobj~time~waste” etc.
  • “Tsub~他~浪费”, “ModAdv~无法~浪费” etc.

Parallel Sentence:

He could hardly afford to waste that golden time.

他无法浪费那样的好时光。

Machine Translation Model

Paraphrasing Model

Mined

Data

Parsed Data

Indexed Data

Linguistic Parsing

Web Mining

Linguistic Knowledge

Knowledge Mining

Multi-level Indexing

chinese couplets

Chinese Couplets

Include user‘s power into system evolvement

chinese couplets http duilian msra cn

Demo

Chinese Couplets (http://duilian.msra.cn)

http://video.sina.com.cn/v/b/10937201-1452530713.html

fs and ss share the same style
FS and SS Share the Same Style

Repetition of pronunciations(音韵联)

风(wind)----------------水(water)

吹 (blow) ---------------使 (make)

荞(buckwheat) -- ------舟(ship)

动(wave)----------------流 (go)

桥(bridge) -------------洲(island)

未 (not) -----------------不 (not)

动(wave) ---------------流(go)

fs and ss share the same style1
FS and SS Share the Same Style

Decomposition of characters (拆字联)

有(have)-----------------缺(lack)

子(son) -------------------鱼(fish)

有 (have) ------------------缺(lack)

女(daughter)-------------羊(mutton)

方 (so) ---------------------敢 (dare)

称 (call) --------------------叫 (call)

好(good) -------------------鲜(fresh)

女 子

鱼 羊

fs and ss share the same style2
FS and SS Share the Same Style

Person name

(人名联)

Palindrome

(回文联)

板桥(Banqiao)----------------东坡(Dongpo)

造(produce) -------------------居 (live)

桥(bridge) ---------------------坡 (mountain)

板(board)----------------------东(east)

  • Banqiao(板桥) and Dongpo(东坡) are famous litterateurs
  • Reading from top to down is identical to down to top
ss generation process
SS Generation Process

海 阔 凭 鱼 跃

Sea wide allow fish jump

Linguistic

filtering

SMT decoding

Reranking

hill

high

insect

fly

山高任鸟飞

天高任鸟鸣

天高任鸟飞

山深任鸟飞

天高任花香

天高任鸟舞

山高任花香

……

山高任鸟飞

天高任鸟鸣

天高任鸟飞

山高靠虎啸

山高任虎啸

山深任鸟飞

天高任花香

……

天高任鸟飞

山高任鸟飞

天高任鸟鸣

天高任鸟舞

山深任鸟飞

山高任花香

天高任花香

……

bird

dance

permit

sky

deep

tiger

tweedle

天 高

sky high

鸟 飞

bird fly

depend

虎 啸

tiger roar

山 高

hill high

ss generation approach
SS Generation Approach

FS input

  • A multi-phase SMT approach
    • Phase1: a phrase-based log-linear model
    • Phase2: some linguistic filters
    • Phase3: a Ranking SVM

Phrase-based log-linear model

N-best candidates

Linguistic filters

Ranking SVM model

SS output

great examples
Great Examples
  • FS:月落乌啼霜满天
  • SS:风吹雁过雨连宵
  • FS:千江有水千江月
  • SS:万里无云万里星
  • FS:秦淮河桨声灯影
  • SS:松花江水色月光
  • FS:此木为柴山山出 (此+木=柴;山+山=出)
  • SS:白水作泉日日昌 (白+水=泉;日+日=昌)
user log for model enhancement
User log for Model Enhancement
  • Motivation
    • Training data is not adequate
    • While user log is big(60k/m), increasing, diverse
  • What logs we record
    • User inputs
    • User finalized couplets
      • Second sentences selected out of the candidates provided by our system
      • User modified second sentences
user s log analysis
User’s Log Analysis
  • Data Source
    • Log from http://couplet.msra.cn
  • Time period
    • Aug. 31-Oct. 9, 2006
new framework with log data
New Framework with Log Data

First sentence input

Translation model

Translation model

Source-Channel model

Language model

Language model

Training data

Log data

N-best candidates

Mutual information

Re-ranking

Mutual information

Second sentence output

User operation

twitter search

Twitter Search

Move to social internet and mobile

slide41

A collection of tweets

Tweets

Tweets Cluster

Statistical Relationship Learning

News & Images Link Extraction

Community Extraction

User Influence Measure

Multi-level Indexing

Hot tag, topic Extraction

Popular Tweet Extraction

Semantic Search

Noise Filtering

Top video, music, artists Extraction

Individual tweet

Semantic Role Labeling

Sentiment Analysis

NE Recognition

Dependency Parsing

Co-reference

Sentence Boundary Detection

Text Normalization

Classification

Raw Data

conclusion
Conclusion
  • Internet trends and impacts to NLP
  • NLP2.0 strategy
  • Web data mining: Engkoo
  • User’s power: Couplets
  • SNS and mobile: Twitter search