Processing of large document collections 1
This presentation is the property of its rightful owner.
Sponsored Links
1 / 69

Processing of Large Document Collections 1 PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Processing of Large Document Collections 1. Helena Ahonen-Myka University of Helsinki. Organization of the course. Classes: 17.9., 22.10., 23.10., 26.11. lectures (Helena Ahonen-Myka): 10-12,13-15 exercise sessions (Lili Aunimo): 15-17 required presence: 75%

Download Presentation

Processing of Large Document Collections 1

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Processing of large document collections 1

Processing of Large Document Collections 1

Helena Ahonen-Myka

University of Helsinki

Organization of the course

Organization of the course

  • Classes: 17.9., 22.10., 23.10., 26.11.

    • lectures (Helena Ahonen-Myka): 10-12,13-15

    • exercise sessions (Lili Aunimo): 15-17

    • required presence: 75%

  • Exercises are given (and returned) each week

    • required: 75%

  • Exam: 4.12. at 16-20, Auditorio

  • Points: Exam 30 pts, exercises 30 pts



  • 17.9. Character sets, preprocessing of text, text categorization

  • 22.10. Text summarization

  • 23.10. Text compression

  • 26.11. … to be announced…

  • self-study: basic transformations for text data, using linguistic tools, etc.

In this part

In this part...

  • Character sets

  • preprocessing of text

  • text categorization

1 character sets

1. Character sets

  • Abstract character vs. its graphical representation

  • abstract characters are grouped into alphabets

    • each alphabet forms the basis of the written form of a certain language or a set of languages

Character sets

Character sets

  • For instance

    • for English:

      • uppercase letters A-Z

      • lowercase letters a-z

      • punctuation marks

      • digits 0-9

      • common symbols: +, =

    • ideographic symbols of Chinese and Japanese

    • phonetic letters of Western languages

Character sets1

Character sets

  • To represent text digitally, we need a mapping between (abstract) characters and values stored digitally (integers)

  • this mapping is a character set

  • the domain of the character set is called a character repertoire (= the alphabet for which the mapping is defined)

Character sets2

Character sets

  • For each character in the character repertoire, the character set defines a code value in the set of code points

  • in English:

    • 26 letters in both lower- and uppercase

    • ten digits + some punctuation marks

  • in Russian: cyrillic letters

  • both could use the same set of code points (if not a bilingual document)

  • in Japanese: could be over 6000 characters

Character sets3

Character sets

  • The mere existence of a character set supports operations like editing and searching of text

  • usually character sets have some structure

    • e.g. integers within a small range

    • all lower-case (resp. upper-case) letters have code values that are consecutive integers (simplifies sorting etc.)

Character sets standars

Character sets: standars

  • Character sets can be arbitrary, but in practice standardization is needed for interoperability (between computers, programs,...)

  • early standards were designed for English only, or for a small group of languages at a time

Character sets standards

Character sets: standards


  • ISO-8859 (e.g. ISO Latin1)

  • Unicode

  • UTF-8, UTF-16



  • American Standard Code for Information Interchange

  • A seven bit code -> 128 code points

  • actually 95 printable characters only

    • code points 0-31 and 128 are assigned to control characters (mostly outdated)

  • ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols)



  • With 7 bits, the set of code points is too small for anything else than American English

  • solution:

    • 8 bits brings more code points (256)

    • ASCII character repertoire is mapped to the values 0-127

    • additional symbols are mapped to other values

Extended ascii

Extended ASCII

  • Problem:

    • different manufacturers each developed their own 8-bit extensions to ASCII

      • different character repertoires -> translation between them is not always possible

    • also 256 code values is not enough to represent all the alphabets -> different variants for different languages

Iso 8859

ISO 8859

  • Standardization of 8-bit character sets

  • In the 80´s: multipart standard ISO 8859 was produced

  • defines a collection of 8-bit character sets, each designed for a group of languages

  • the first part: ISO 8859-1 (ISO Latin1)

    • covers most Western European languages

    • 0-127: identical to ASCII, 128-159 (mostly) unused, 96 code values for accented letters and symbols



  • 256 is not enough code points

    • for ideographically represented languages (Chinese, Japanese…)

    • for simultaneous use of several languages

  • solution: more than one byte for each code value

  • a 16-bit character set has 65,536 code points



  • 16-bit character set, e.g. 65,536 code points

  • not sufficient for all the characters required for Chinese, Japanese, and Korean scripts in distinct positions

    • CJK-consolidation: characters of these scripts are given the same value if they look the same



  • Code values for all the characters used to write contemporary ’major’ languages

    • also the classical forms of some languages

    • Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan

    • Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts



  • punctuation marks

  • technical and mathematical symbols

  • arrows

  • dingbats (pointing hands, stars, …)

  • both accented letters and separate diacritical marks (accents, tildes…) are included, with a mechanism for building composite characters

    • can also create problems: two characters that look the same may have different code values

    • ->normalization may be necessary



  • Code values for nearly 39,000 symbols are provided

  • some part is reserved for an expansion method (see later)

  • 6,400 code points are reserved for private use

    • they will never be assigned to any character by the standard, so they will not conflict with the standard

Unicode encodings

Unicode: encodings

  • Encoding is a mapping that transforms a code value into a sequence of bytes for storage and transmission

  • identity mapping for a 8-bit code?

    • it may be necessary to encode 8-bit characters as sequences of 7-bit (ASCII) characters

    • e.g. Quoted-Printable (QP)

      • code values 128-255 as a sequence of 3 bytes

      • 1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the value

      • 233 -> E9 -> =E9

Unicode encodings1

Unicode: encodings

  • UTF-8

    • ASCII code values are likely to be more common in most text than any other values

      • in UTF-9 encoding ASCII characters are sent themselves (high-order bit 0)

      • other characters (two bytes) are encoded using up to six bytes (high-order bit is set to 1)

Unicode encodings2

Unicode: encodings

  • UTF-16: expansion method

    • two 16-bit values are combined to a 32-bit value -> a million characters available

2 preprocessing of text

2. Preprocessing of text

  • Text cannot be directly interpreted by the many document processing applications

  • an indexing procedure is needed

    • mapping of a text into a compact representation of its content

  • which are the meaningful units of text?

  • how these units should be combined?

    • usually not ”important”

Vector model

Vector model

  • A document is usually represented as a vector of term weights

  • the vector has as many dimensions as there are terms (or features) in the whole collection of documents

  • the weight represents how much the term contributes to the semantics of the document

Vector model1

Vector model

  • Different approaches:

    • different ways to understand what a term is

    • different ways to compute term weights



  • Words

    • typical choice

    • set of words, bag of words

  • phrases

    • syntactical phrases

    • statistical phrases

    • usefulness not yet known?



  • Part of the text is not considered as terms

    • very common words (function words):

      • articles, prepositions, conjunctions

    • numerals

  • these words are pruned

    • stopword list

  • other preprocessing possible

    • stemming, base words

Weights of terms

Weights of terms

  • Weights usually range between 0 and 1

  • binary weights may be used

    • 1 denotes presence, 0 absence of the term in the document

  • often the tfidf function is used

    • higher weight, if the term occurs often in the document

    • lower weight, if the term occurs in many documents



  • Either the full text of the document or selected parts of it are indexed

  • e.g. in a patent categorization application

    • title, abstract, the first 20 lines of the summary, and the section containing the claims of novelty of the described invention

  • some parts may be considered more important

    • e.g. higher weight for the terms in the title

Dimensionality reduction

Dimensionality reduction

  • Many algorithms cannot handle high dimensionality of the term space (= large number of terms)

  • usually dimensionality reduction is applied

  • dimensionality reduction also reduces overfitting

    • classifier that overfits the training data is good at re-classifying the training data but worse at classifying previously unseen data

Dimensionality reduction1

Dimensionality reduction

  • Local dimensionality reduction

    • for each category, a reduced set of terms is chosen for classification that category

    • hence, different subsets are used when working with different categories

  • global dimensionality reduction

    • a reduced set of terms is chosen for the classification under all categories

Dimensionality reduction2

Dimensionality reduction

  • Dimensionality reduction by term selection

    • the terms of the reduced term set are a subset of the original term set

  • Dimensionality reduction by term extraction

    • the terms are not the same type of the terms in the original term set, but are obtained by combinations and transformations of the original ones

Dimensionality reduction by term selection

Dimensionality reduction by term selection

  • Goal: select terms that, when used for document indexing, yields the highest effectiveness in the given application

  • wrapper approach

    • the reduced set of terms is found iteratively and tested with the application

  • filtering approach

    • keep the terms that receive the highest score according to a function that measures the ”importance” of the term for the task

Dimensionality reduction by term selection1

Dimensionality reduction by term selection

  • Many functions available

    • document frequency: keep the high frequency terms

      • stopwords have been already removed

      • 50% of the words occur only once in the document collection

      • e.g. remove all terms occurring in at most 3 documents

Dimensionality reduction by term selection2

Dimensionality reduction by term selection

  • Information-theoretic term selection functions, e.g.

    • chi-square

    • information gain

    • mutual information

    • odds ratio

    • relevancy score

Dimensionality reduction by term extraction

Dimensionality reduction by term extraction

  • Term extraction attempts to generate, from the original term set, a set of ”synthetic” terms that maximize effectiveness

  • due to polysemy, homonymy, and synonymy, the original terms may not be optimal dimensions for document content representation

Dimensionality reduction by term extraction1

Dimensionality reduction by term extraction

  • Term clustering

    • tries to group words with a high degree of pairwise semantic relatedness

    • groups (or their centroids) may be used as dimensions

  • latent semantic indexing

    • compresses document vector into vectors of a lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence

3 text categorization

3. Text categorization

  • Text classification, topic classification/spotting/detection

  • problem setting:

    • assume: a predefined set of categories, a set of documents

    • label each document with one (or more) categories

Text categorization

Text categorization

  • Two major approaches:

    • knowledge engineering -> end of 80’s

      • manually defined set of rules encoding expert knowledge on how to classify documents under the given gategories

    • machine learning, 90’s ->

      • an automatic text classifier is built by learning, from a set of preclassified documents, the characteristics of the categories

Text categorization1

Text categorization

  • Let

    • D: a domain of documents

    • C = {c1, …, c|C|} : a set of predefined categories

    • T = true, F = false

  • The task is to approximate the unknown target function ’: D x C -> {T,F} by means of a function : D x C -> {T,F}, such that the functions ”coincide as much as possible”

  • function ’ : how documents should be classified

  • function : classifier (hypothesis, model…)

We assume

We assume...

  • Categories are just symbolic labels

    • no additional knowledge of their meaning is available

  • No knowledge outside of the documents is available

    • all decisions have to be made on the basis of the knowledge extracted from the documents

    • metadata, e.g., publication date, document type, source etc. is not used

General methods

-> general methods

  • Methods do not depend on any application-dependent knowledge

    • in operational applications all kind of knowledge can be used

  • content-based decisions are necessarily subjective

    • it is often difficult to measure the effectiveness of the classifiers

    • even human classifiers do not always agree

Single label vs multi label

Single-label vs. multi-label

  • Single-label text categorization

    • exactly 1 category must be assigned to each dj D

  • Multi-label text categorization

    • any number of categories may be assigned to the same dj D

  • Special case of single-label: binary

    • each dj must be assigned either to category ci or to its complement ¬ ci

Single label multi label

Single-label, multi-label

  • The binary case (and, hence, the single-label case) is more general than the multi-label

    • an algorithm for binary classification can also be used for multi-label classification

    • the converse is not true

Category pivoted vs document pivoted

Category-pivoted vs. document-pivoted

  • Two different ways for using a text classifier

  • given a document, we want to find all the categories, under which it should be filed -> document-pivoted categorization (DPC)

  • given a category, we want to find all the documents that should be filed under it -> category-pivoted categorization (CPC)

Category pivoted vs document pivoted1

Category-pivoted vs. document-pivoted

  • The distinction is important, since the sets C and D might not be available in their entirety right from the start

  • DPC: suitable when documents become available at different moments in time, e.g. filtering e-mail

  • CPC: suitable when new categories are added after some documents have already been classified (and have to be reclassified)

Category pivoted vs document pivoted2

Category-pivoted vs. document-pivoted

  • Some algorithms may apply to one style and not the other, but most techniques are capable of working in either mode

Hard categorization vs ranking categorization

Hard-categorization vs. ranking categorization

  • Hard categorization

    • the classifier answers T or F

  • Ranking categorization

    • given a document, the classifier might rank the categories according to their estimated appropriateness to the document

    • respectively, given a category, the classifier might rank the documents

Applications of text categorization

Applications of text categorization

  • Automatic indexing for Boolean information retrieval systems

  • document organization

  • text filtering

  • word sense disambiguation

  • hierarchical categorization of Web pages

Automatic indexing for boolean ir systems

Automatic indexing for Boolean IR systems

  • In an information retrieval system, each document is assigned one or more keywords or keyphrases describing its content

    • keywords belong to a finite set called controlled dictionary

  • TC problem: the entries in a controlled dictionary are viewed as categories

    • k1 x  k2 keywords are assigned to each document

    • document-pivoted TC

Document organization

Document organization

  • Indexing with a controlled vocabulary is an intance of the general problem of document base organization

  • e.g. a newspaper office has to classify the incoming ”classified” ads under categories such as Personals, Cars for Sale, Real Estate etc.

  • organization of patents, filing of newspaper articles...

Text filtering

Text filtering

  • Classifying a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer

  • e.g. newsfeed

    • producer: news agency; consumer: newspaper

    • the filtering system should block the delivery of documents the consumer is likely not interested in

Word sense disambiguation

Word sense disambiguation

  • Given the occurrence in a text of an ambiguous word, find the sense of this particular word occurrence

  • E.g.

    • Bank of England

    • the bank of river Thames

    • ”Last week I borrowed some money from the bank.”

Word sense disambiguation1

Word sense disambiguation

  • Indexing by word senses rather than by words

  • text categorization

    • documents: word occurrence contexts

    • categories: word senses

  • also resolving other natural language ambiguities

    • context-sensitive spelling correction, part of speech tagging, prepositional phrase attachment, word choice selection in machine translation

Hierarchical categorization of web pages

Hierarchical categorization of Web pages

  • E.g. Yahoo like web hierarchical catalogues

  • typically, each category should be populated by ”a few” documents

  • new categories are added, obsolete ones removed

  • usage of link structure in classification

  • usage of the hierarchical structure

Knowledge engineering approach

Knowledge engineering approach

  • In the 80´s: knowledge engineering techniques

    • building manually expert systems capable of taking text categorization decisions

    • expert system: consists of a set of rules

      • wheat & farm -> wheat

      • wheat & commodity -> wheat

      • bushels & export -> wheat

      • wheat & winter & ~soft -> wheat

Knowledge engineering approach1

Knowledge engineering approach

  • Drawback: rules must be manually defined by a knowledge engineer with the aid of a domain expert

    • any update necessitates again human intervention

    • totally domain dependent

    • -> expensive and slow process

Machine learning approach

Machine learning approach

  • A general inductive process (learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci or ci by a domain expert

  • from these characteristics the learner gleans the characteristics that a new unseen document should have in order to be classified under ci

  • supervised learning (= supervised by the knowledge of the training documents)

Machine learning approach1

Machine learning approach

  • The learner is domain independent

    • usually available ’off-the-shelf’

  • the inductive process is easily repeated, if the set of categories changes

  • manually classified documents often already available

    • manual process may exist

  • if not, it still easier to manually classify a set of documents than to build and tune a set of rules

Training set test set validation set

Training set, test set, validation set

  • Initial corpus of manually classified documents

    • let dj belong to the initial corpus

    • for each pair <dj, ci> it is known if dj should be filed under ci

  • positive examples, negative examples of a category

Training set test set validation set1

Training set, test set, validation set

  • The initial corpus is divided into two sets

    • a training (and validation) set

    • a test set

  • the training set is used to build the classifier

  • the test set is used for testing the effectiveness of the classifiers

    • each document is fed to the classifier and the decision is compared to the manual category

Training set test set validation set2

Training set, test set, validation set

  • The documents in the test are not used in the construction of the classifier

  • alternative: k-fold cross-validation

    • k different classifiers are built by partitioning the initial corpus into k disjoint sets and then iteratively applying the train-and-test approach on pairs, where k-1 sets construct a training set and 1 set is used as a test set

    • individual results are then averaged

Training set test set validation set3

Training set, test set, validation set

  • Training set can be split to two parts

  • one part is used for optimising parameters

    • test which values of parameters yield the best effectiveness

  • test set and validation set must be kept separate

Inductive construction of classifiers

Inductive construction of classifiers

  • A ranking classifier for a category ci

    • definition of a function that, given a document, returns a categorization status value for it, i.e. a number between 0 and 1

    • documents are ranked according to their categorization status value

Inductive construction of classifiers1

Inductive construction of classifiers

  • A hard classifier for a category

    • definition of a function that returns true or false, or

    • definition of a function that returns a value between 0 and 1, followed by a definition of a threshold

      • if the value is higher than the threshold -> true

      • otherwise -> false



  • Probabilistic classifiers (Naïve Bayes)

  • decision tree classifiers

  • decision rule classifiers

  • regression methods

  • on-line methods

  • neural networks

  • example-based classifiers (k-NN)

  • support vector machines

Rocchio method

Rocchio method

  • Linear classifier method

  • for each category, an explicit profile (or prototypical document) is constructed

    • benefit: profile is understandable even for humans

Rocchio method1

Rocchio method

  • A classifier is a vector of the same dimension as the documents

  • weights:

  • classifying: cosine similarity of the category vector and the document vector

  • Login