1 / 10

Computational Linguistics

Computational Linguistics. WTLAB ( Web Technology Laboratory ) Mohsen Kamyar. Computer Science. Main “Data Source” in recent years is “World Wide Web” or other sources of text data Autonomous data generation We can’t force people to a specific format for data

calla
Download Presentation

Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Linguistics WTLAB (Web Technology Laboratory) Mohsen Kamyar

  2. Computer Science • Main “Data Source” in recent years is “World Wide Web” or other sources of text data • Autonomous data generation • We can’t force people to a specific format for data • People want to present data with fewer words as possible. • We will see structures that are illegal in language grammar or even they are not words. • We will see rapid language changes, so we can’t use static models for language.

  3. Computer Science (cont.) • In this view computing the precision of language processing is based on frequency of words (on the other hand in Linguistics we have distinct words). • Some examples of such applications • American governmental programs: • Total Information Awareness (TIA) during 2003 • Computer-Assisted Passenger Prescreening System (CAPPS II) till 2004 and assigns a color to each passenger • Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE) during 2004-2006 and as a component of a program with $47 million budget.

  4. Computer Science (cont.) • Multistate Anti-Terrorism Information Exchange (MATRIX) till 2005. • And many software vendors (based on 2008 reports) • Angoss Software, Infor CRM Epiphany, Kxen, Portrait Software, SAS, SPSS, ThinkAnalytics, Unica, Viscovery, … • Although, we have applications that are more similar to Linguistics: • Machine Translation • Human-Computer interaction applications • Text to Speech • Text Simplification

  5. Data Mining • As “Text Data” view, Data Mining has three main steps: • Pre-processing • Preparing a representation for data that is suitable for next steps. • Data Mining • Indicating relevance of data in following views • Classification: arranging the data in predefined groups • Clustering: arranging the data in groups, but in this case we should find groups and they aren’t predefined.

  6. Data Mining (cont.) • Regression: finding an equation that can describe the data model • Association Rule Learning: finding relations between concepts or main objects in data model. • Interpreting the results • We can guess that common research areas between “Computer Science” and “Linguistics” in this process are steps 1 and 3 (mainly step 1). • In an example we can highlight it.

  7. Web Web Cache Crawler Ranking WordNet URL Queue Indexer Stemmer Indexes Search Engine

  8. Search Engine (cont.) • It is the most popular application, most important example of using the data mining, one of high technologies and … . • In pre-processing we have following tasks in search engines that focus on linguistic aspects of data: • Computing importance factor of a word in a document • Frequency • TFIDF (Vector Space Model)

  9. Search Engine (cont.) • Stemming • There are two main categories of approaches: Dictionary based and non-Dictionary based. • Using tagging a word in a sentence for stemming • Related words (works such as WordNet) • Synonyms: Same meaning • Hypernyms and hyponyms: General concepts and sub concepts. • Homonyms: Same spelling but different meaning • Acronyms: Abbreviations

  10. Semantic Search Engine • In a “Semantic Search Engine” main differences are as below: • Indexing is not based on words, but on “Ontology” • Ontology Extraction • Latent Semantic Indexing • Ranking is not based on “Web Links”, but on “Similarity Between Pages”.

More Related