1 / 15

Text segmentation

Text segmentation. Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation, numbers. This process is called tokenization and segmented units are called word tokens. Ex: In addition, she was there.

neylan
Download Presentation

Text segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text segmentation AmanyAlKhayat

  2. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation, numbers. • This process is called tokenization and segmented units are called word tokens. • Ex: In addition, she was there. • After segmentation: In addition , she was there .

  3. Tokenization • Tokenization and sentence splitting can be described as ‘low-level’ segmentation which is performed at the initial level of text processing. The tasks are handled by reg. ex. Written in perlor any other programming language.

  4. Tokenization II • High-level text segmentation or intrasenetential segmentation involves segmentation of linguistic groups such as named entities, segmentation of noun groups. • Inter-sentential segmentation involves grouping of sentences and paragraphs into discourse topics which are also called text tiles.

  5. Word segmentation • Multiple occurrence of words in a text. • Word types are word of vocabulary. • Ex. If Shakespeare’s works included more than 8oo,ooo word tokens, it has 31,000 types of vocabulary

  6. Tokenizing sentences • It is tiresome to tokenize sentences by adding white space. Moreover, if you tokenize sentences they cannot be put back to normal. • SGML or XML are cleaner strategies for tokenization to revert it easily to original text. • Ex. <w c=w> it</w> <w c=w> is </w> <w c=w> here </w> <w c=p>. </w>

  7. Sentence segmentation • Important for many text processing apps: syntactic parsing, information extraction, text alignment, Machine translation…etc.

  8. Accurate splitting is known as sentence boundary disambiguation (SBD) requires analysis of the local context around the periods and othe punctuations • Compare: • He stopped to see Dr. White. • He stopped at Meadows Dr. Whie falcon was still open. Which period is sentence internal and which one is sentence terminal?

  9. Simplist algorithm for sentence boundary disambiguation • ‘period- space- capital letter’ • It marks all periods, exclamation marks and q marks that are followed by a space and a capital letter. • Regex: • [.?!][ ()”]+[A-Z]

  10. Part of speech tagging • Criteria: • 1- syntactic distribution • 2- syntactic function • 3- morphological and syntactic classes that different parts of speech can be assigned to.

  11. Applications • Preprocessors • Large tagged text corpora (see Mark Davies Corpus) • Info technology apps: text indexing and retrieval (nouns and adjectives are better candidates for good indexing than adverbs, verbs and pronouns

  12. Parsing • See Stanford university parser online (http://nlp.stanford.edu:8080/parser/index.jsp) • Using grammar to assign syntactic analysis to a string of words. • Shallow parsing: partition of the input into chunks identifying the headword of each chunk.

  13. Dependency parsing

  14. CFP context free parsing • Context-free grammars are important in linguistics for describing the structure of sentences and words in natural language, and in computer science for describing the structure of programming languages and other formal languages. (wikipedia)

  15. Thank you

More Related