computational linguistics at osu
Skip this Video
Download Presentation
Computational Linguistics at OSU

Loading in 2 Seconds...

play fullscreen
1 / 26

Computational Linguistics at OSU - PowerPoint PPT Presentation

  • Uploaded on

Computational Linguistics at OSU. Chris Brew Linguistics, Cognitive Science and CSE The Ohio State University. Who am I?. Chris Brew, Associate Professor Full-time in NLP since about 1984. B.Sc Chemistry (Bristol) Masters and Ph.D (Sussex) NLP done in a Psychology department!

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Computational Linguistics at OSU' - tamira

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
computational linguistics at osu
Computational Linguistics at OSU

Chris Brew

Linguistics, Cognitive Science and CSE

The Ohio State University

who am i
Who am I?
  • Chris Brew, Associate Professor
  • Full-time in NLP since about 1984.
    • B.Sc Chemistry (Bristol)
    • Masters and Ph.D (Sussex)
      • NLP done in a Psychology department!
    • Research positions at Sussex, Edinburgh and in industry (Sharp)
    • Faculty in Linguistics at Ohio State since 2000
    • Joint appointment in CSE
what i ve done
What I’ve done
  • Parsing and Dialogue
  • Machine Translation (teaching class now)
  • XML and corpus annotation
  • Learning word meanings from large datasets
  • Sound/Meaning relations
  • Other stuff…
  • Linguistics is the scientific study of language and communication.
  • Linguists run experiments, do surveys, build simulations, do proofs.
  • Linguistics at OSU is:
    • In the top 10 nationally
    • Diverse and open-minded
strengths of linguistics at osu
Strengths of Linguistics at OSU
  • Syntax, Semantics, Pragmatics
  • Phonetics: the study of how people make and perceive the sounds of language
  • Psycholinguistics: the study of how people process sounds, words, sentences, intonation
  • Sociolinguistics: the study of how society and social situations change the way we speak.
  • Computational Linguistics and NLP
computational linguistics at osu1
Computational Linguistics at OSU
  • 3 faculty members and 20 students based in Linguistics (Oxley Hall)
    • Detmar Meurers (Parsing, Corpus Annotation, Computer-aided Language Learning)
    • Chris Brew (Statistical NLP)
    • Michael White (Natural Language Generation)
  • Close ties with Drs. Byron and Fosler-Lussier in CSE.
  • We are willing and able to advise or co-advise on research, and have projects that cross the departmental boundaries.
computational linguistics
Computational Linguistics
  • Data Intensive Linguistics: using large datasets to answer questions about language
    • How do children learn language?
    • How do technical terms get their meanings?
    • Why do people have so little difficulty understanding what each other are saying?
    • How are words stored in the brain?
computational linguistics1
Computational Linguistics
  • Machine understanding: building machines that read, write, converse using natural language.
  • Several well-known subtasks
    • Tokenization:
    • Parsing: building syntax trees
    • Building meaning representations (MR)
    • Generating language from MR
computational linguistics2
Computational Linguistics
  • NLP: building systems that do useful or interesting things with language
    • Summarization
    • Machine Translation
    • Question Answering
    • Document Understanding
relation to cse
Relation to CSE
  • Challenging problems in working with large datasets.
    • Document classification is large along three dimensions
      • Large number of available predictive features (104 different words in typical collections)
      • Many instances (1000s or millions of sentences)
      • Many possible outputs (e.g. classify against the 100s of labels in the DMOZ hierarchy)
relation to cse1
Relation to CSE
  • Consumer of CS tools
    • Tokenization, Parsing
      • Could use lex and yacc (javacc/antlr), but beware ambiguity
      • Many special purpose parsers, taggers, chunkers that use machine learning to achieve robustness
    • Machine understanding
      • AI-complete
      • Prolog and other PL innovations caused by NL research
why the world cares
Why the world cares
  • 1700 biology papers per day. Nobody can keep up UNDERSTAND/SUMMARIZE
  • Ad placement in search engines. Perhaps you can spot a search for flights to Paris, place a successful sidebar ad for expensive and elegant evening wear. INTENT
  • Automated essay grading CLASSIFICATION
  • Too many emails to monitor. Spooks can’t keep up.
  • Especially in Arabic
there is demand
There is demand…
  • Develop language-independent algorithms, techniques, and methodologies to support rapid development of the basic resources … for any arbitrary language with a written form. Corpus-based unsupervised and lightly-supervised methods are acceptable, as are lightweight elicitation methodologies from untrained native speakers or other generally available (in the US) informants. Research on English and Foreign Language EXploitation (REFLEX)Broad Agency Announcement (BAA)BAA 04-01-FH15 March 2004
current work
Current work
  • NSF Career project
    • Key idea: dimensionality reduction for linguistic data.
    • Hypothesis: neighborhood structure is more important and cognitively salient than (for example) preserving detail of long-distance relationships
    • Compare: min-cut, LLE, SNE, LSI
paul davis
Paul Davis
  • Statistical Machine Translation
  • Is there a simple and flexible architecture for Statistical MT?
  • Why: current systems are all built on an IBM design.
    • they all mess up
    • they all mess up in much the same way
    • Alternatives are needed.
  • Graduated 2002:now at Motorola Research
martin jansche
Martin Jansche
  • Learning String-to-String Transductions (mostly for text-to-speech)
  • Bucks -> /b u k z/
  • Why: People were doing lots of this, but the theory, the evaluation criteria and the quality of the resulting systems left much to be desired.
  • Graduated 2003: now at Columbia Center for Machine Learning as research faculty
nathan vaillette
Nathan Vaillette
  • Formally verified string-to-string transductions.
  • Rule: aa -> b
  • Input aaacaa. What is the output?
  • bbcb ?
  • bacb ?
  • abcb ?
  • Why: rules like these are used a lot, but no convincing account of exactly what they mean.
nathan vaillette1
… Nathan Vaillette
  • Used technology from hardware verification (!) to build and implement formal model of string rewriting process.
  • First ever implementation of this widely used component for which the specification is clear and the correspondence between specification and implementation provably correct.
  • Graduated 2003 Now teaching AI at Hampshire College
sabine schulte im walde
Sabine Schulte im Walde
  • Inducing German Verb Classes from Corpus Data.
  • Why: build better dictionaries automatically
  • Why: difficult large dataset
  • Technology: k-means, spectral clustering
  • Graduated:2003 from University of Stuttgart Language Technology Manager with Duden dictionaries, then research staff University of Saarbrücken
kyuchul yoon
Kyuchul Yoon
  • Grapheme to Phoneme conversion for Korean
  • Why: words of foreign origin need special treatment, existing machine learning approaches are too knowledge-free
  • Graduated 2005 Now at Pusan University
anna feldman
Anna Feldman
  • Using Czech language resources to bootstrap resources for Russian
    • Why: Czech and Russian are supposed to be related, but can we use this fact technologically?
    • Yes. Works, but not perfectly.
  • Same thing, for Spanish and Portugese
anton rytting
Anton Rytting
  • Computational and experimental studies of spoken language, emphasis on word segmentation strategies that might be useful to infants
  • Why: infants should be able to learn any language.
medical informatics very new
Medical Informatics (very new)
  • Collaboration with John Pestian, Cincinnati Hospital Children\'s Medical Center
  • Why: doctors provide discharge summaries (i.e. text), we want information (mundanely: ICD-9 terms as billing codes)
  • How: neural networks, careful encoding of domain knowledge. Tuning of ICD-9 to include/exclude terms that do/don\'t occur in radiology summaries
what i d like to do more of
What I’d like to do more of
  • Very large scale work
  • Unsupervised and lightly supervised learning
  • Cute applications of machine learning
  • Distributed and parallel NLP
what i am looking for
What I am looking for?
  • People who can take an idea about learning from data and turn it into a Master’s thesis. Especially people who have side expertise in an application area, such as medicine, biology, business, lion-taming.
  • Might have funding for the right person, though Linguistics Ph.D students take precedence.
what i am looking for1
What I am looking for?
  • People who can take an idea about learning from data and turn it into a Master’s thesis. Especially people who have side expertise in an application area, such as medicine, biology, business, lion-taming.
  • People with very good communication and programming skills who could collaborate with a Linguistics student to make something better than either could alone. Cognitive Science summer fellowships.
  • Interesting new problems that can be learned from data.