Computational Linguistics

Computational Linguistics WTLAB (Web Technology Laboratory) Mohsen Kamyar

Computer Science • Main “Data Source” in recent years is “World Wide Web” or other sources of text data • Autonomous data generation • We can’t force people to a specific format for data • People want to present data with fewer words as possible. • We will see structures that are illegal in language grammar or even they are not words. • We will see rapid language changes, so we can’t use static models for language.

Computer Science (cont.) • In this view computing the precision of language processing is based on frequency of words (on the other hand in Linguistics we have distinct words). • Some examples of such applications • American governmental programs: • Total Information Awareness (TIA) during 2003 • Computer-Assisted Passenger Prescreening System (CAPPS II) till 2004 and assigns a color to each passenger • Analysis, Dissemination, Visualization, Insight, Semantic Enhancement (ADVISE) during 2004-2006 and as a component of a program with $47 million budget.

Computer Science (cont.) • Multistate Anti-Terrorism Information Exchange (MATRIX) till 2005. • And many software vendors (based on 2008 reports) • Angoss Software, Infor CRM Epiphany, Kxen, Portrait Software, SAS, SPSS, ThinkAnalytics, Unica, Viscovery, … • Although, we have applications that are more similar to Linguistics: • Machine Translation • Human-Computer interaction applications • Text to Speech • Text Simplification

Data Mining • As “Text Data” view, Data Mining has three main steps: • Pre-processing • Preparing a representation for data that is suitable for next steps. • Data Mining • Indicating relevance of data in following views • Classification: arranging the data in predefined groups • Clustering: arranging the data in groups, but in this case we should find groups and they aren’t predefined.

Data Mining (cont.) • Regression: finding an equation that can describe the data model • Association Rule Learning: finding relations between concepts or main objects in data model. • Interpreting the results • We can guess that common research areas between “Computer Science” and “Linguistics” in this process are steps 1 and 3 (mainly step 1). • In an example we can highlight it.

Web Web Cache Crawler Ranking WordNet URL Queue Indexer Stemmer Indexes Search Engine

Search Engine (cont.) • It is the most popular application, most important example of using the data mining, one of high technologies and … . • In pre-processing we have following tasks in search engines that focus on linguistic aspects of data: • Computing importance factor of a word in a document • Frequency • TFIDF (Vector Space Model)

Search Engine (cont.) • Stemming • There are two main categories of approaches: Dictionary based and non-Dictionary based. • Using tagging a word in a sentence for stemming • Related words (works such as WordNet) • Synonyms: Same meaning • Hypernyms and hyponyms: General concepts and sub concepts. • Homonyms: Same spelling but different meaning • Acronyms: Abbreviations

Semantic Search Engine • In a “Semantic Search Engine” main differences are as below: • Indexing is not based on words, but on “Ontology” • Ontology Extraction • Latent Semantic Indexing • Ranking is not based on “Web Links”, but on “Similarity Between Pages”.

Computational Linguistics