processing of large document collections 1 n.
Skip this Video
Download Presentation
Processing of Large Document Collections 1

Loading in 2 Seconds...

play fullscreen
1 / 69

Processing of Large Document Collections 1 - PowerPoint PPT Presentation

  • Uploaded on

Processing of Large Document Collections 1. Helena Ahonen-Myka University of Helsinki. Organization of the course. Classes: 17.9., 22.10., 23.10., 26.11. lectures (Helena Ahonen-Myka): 10-12,13-15 exercise sessions (Lili Aunimo): 15-17 required presence: 75%

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Processing of Large Document Collections 1' - caraf

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
processing of large document collections 1

Processing of Large Document Collections 1

Helena Ahonen-Myka

University of Helsinki

organization of the course
Organization of the course
  • Classes: 17.9., 22.10., 23.10., 26.11.
    • lectures (Helena Ahonen-Myka): 10-12,13-15
    • exercise sessions (Lili Aunimo): 15-17
    • required presence: 75%
  • Exercises are given (and returned) each week
    • required: 75%
  • Exam: 4.12. at 16-20, Auditorio
  • Points: Exam 30 pts, exercises 30 pts
  • 17.9. Character sets, preprocessing of text, text categorization
  • 22.10. Text summarization
  • 23.10. Text compression
  • 26.11. … to be announced…
  • self-study: basic transformations for text data, using linguistic tools, etc.
in this part
In this part...
  • Character sets
  • preprocessing of text
  • text categorization
1 character sets
1. Character sets
  • Abstract character vs. its graphical representation
  • abstract characters are grouped into alphabets
    • each alphabet forms the basis of the written form of a certain language or a set of languages
character sets
Character sets
  • For instance
    • for English:
      • uppercase letters A-Z
      • lowercase letters a-z
      • punctuation marks
      • digits 0-9
      • common symbols: +, =
    • ideographic symbols of Chinese and Japanese
    • phonetic letters of Western languages
character sets1
Character sets
  • To represent text digitally, we need a mapping between (abstract) characters and values stored digitally (integers)
  • this mapping is a character set
  • the domain of the character set is called a character repertoire (= the alphabet for which the mapping is defined)
character sets2
Character sets
  • For each character in the character repertoire, the character set defines a code value in the set of code points
  • in English:
    • 26 letters in both lower- and uppercase
    • ten digits + some punctuation marks
  • in Russian: cyrillic letters
  • both could use the same set of code points (if not a bilingual document)
  • in Japanese: could be over 6000 characters
character sets3
Character sets
  • The mere existence of a character set supports operations like editing and searching of text
  • usually character sets have some structure
    • e.g. integers within a small range
    • all lower-case (resp. upper-case) letters have code values that are consecutive integers (simplifies sorting etc.)
character sets standars
Character sets: standars
  • Character sets can be arbitrary, but in practice standardization is needed for interoperability (between computers, programs,...)
  • early standards were designed for English only, or for a small group of languages at a time
character sets standards
Character sets: standards
  • ISO-8859 (e.g. ISO Latin1)
  • Unicode
  • UTF-8, UTF-16
  • American Standard Code for Information Interchange
  • A seven bit code -> 128 code points
  • actually 95 printable characters only
    • code points 0-31 and 128 are assigned to control characters (mostly outdated)
  • ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols)
  • With 7 bits, the set of code points is too small for anything else than American English
  • solution:
    • 8 bits brings more code points (256)
    • ASCII character repertoire is mapped to the values 0-127
    • additional symbols are mapped to other values
extended ascii
Extended ASCII
  • Problem:
    • different manufacturers each developed their own 8-bit extensions to ASCII
      • different character repertoires -> translation between them is not always possible
    • also 256 code values is not enough to represent all the alphabets -> different variants for different languages
iso 8859
ISO 8859
  • Standardization of 8-bit character sets
  • In the 80´s: multipart standard ISO 8859 was produced
  • defines a collection of 8-bit character sets, each designed for a group of languages
  • the first part: ISO 8859-1 (ISO Latin1)
    • covers most Western European languages
    • 0-127: identical to ASCII, 128-159 (mostly) unused, 96 code values for accented letters and symbols
  • 256 is not enough code points
    • for ideographically represented languages (Chinese, Japanese…)
    • for simultaneous use of several languages
  • solution: more than one byte for each code value
  • a 16-bit character set has 65,536 code points
  • 16-bit character set, e.g. 65,536 code points
  • not sufficient for all the characters required for Chinese, Japanese, and Korean scripts in distinct positions
    • CJK-consolidation: characters of these scripts are given the same value if they look the same
  • Code values for all the characters used to write contemporary ’major’ languages
    • also the classical forms of some languages
    • Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan
    • Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts
  • punctuation marks
  • technical and mathematical symbols
  • arrows
  • dingbats (pointing hands, stars, …)
  • both accented letters and separate diacritical marks (accents, tildes…) are included, with a mechanism for building composite characters
    • can also create problems: two characters that look the same may have different code values
    • ->normalization may be necessary
  • Code values for nearly 39,000 symbols are provided
  • some part is reserved for an expansion method (see later)
  • 6,400 code points are reserved for private use
    • they will never be assigned to any character by the standard, so they will not conflict with the standard
unicode encodings
Unicode: encodings
  • Encoding is a mapping that transforms a code value into a sequence of bytes for storage and transmission
  • identity mapping for a 8-bit code?
    • it may be necessary to encode 8-bit characters as sequences of 7-bit (ASCII) characters
    • e.g. Quoted-Printable (QP)
      • code values 128-255 as a sequence of 3 bytes
      • 1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the value
      • 233 -> E9 -> =E9
unicode encodings1
Unicode: encodings
  • UTF-8
    • ASCII code values are likely to be more common in most text than any other values
      • in UTF-9 encoding ASCII characters are sent themselves (high-order bit 0)
      • other characters (two bytes) are encoded using up to six bytes (high-order bit is set to 1)
unicode encodings2
Unicode: encodings
  • UTF-16: expansion method
    • two 16-bit values are combined to a 32-bit value -> a million characters available
2 preprocessing of text
2. Preprocessing of text
  • Text cannot be directly interpreted by the many document processing applications
  • an indexing procedure is needed
    • mapping of a text into a compact representation of its content
  • which are the meaningful units of text?
  • how these units should be combined?
    • usually not ”important”
vector model
Vector model
  • A document is usually represented as a vector of term weights
  • the vector has as many dimensions as there are terms (or features) in the whole collection of documents
  • the weight represents how much the term contributes to the semantics of the document
vector model1
Vector model
  • Different approaches:
    • different ways to understand what a term is
    • different ways to compute term weights
  • Words
    • typical choice
    • set of words, bag of words
  • phrases
    • syntactical phrases
    • statistical phrases
    • usefulness not yet known?
  • Part of the text is not considered as terms
    • very common words (function words):
      • articles, prepositions, conjunctions
    • numerals
  • these words are pruned
    • stopword list
  • other preprocessing possible
    • stemming, base words
weights of terms
Weights of terms
  • Weights usually range between 0 and 1
  • binary weights may be used
    • 1 denotes presence, 0 absence of the term in the document
  • often the tfidf function is used
    • higher weight, if the term occurs often in the document
    • lower weight, if the term occurs in many documents
  • Either the full text of the document or selected parts of it are indexed
  • e.g. in a patent categorization application
    • title, abstract, the first 20 lines of the summary, and the section containing the claims of novelty of the described invention
  • some parts may be considered more important
    • e.g. higher weight for the terms in the title
dimensionality reduction
Dimensionality reduction
  • Many algorithms cannot handle high dimensionality of the term space (= large number of terms)
  • usually dimensionality reduction is applied
  • dimensionality reduction also reduces overfitting
    • classifier that overfits the training data is good at re-classifying the training data but worse at classifying previously unseen data
dimensionality reduction1
Dimensionality reduction
  • Local dimensionality reduction
    • for each category, a reduced set of terms is chosen for classification that category
    • hence, different subsets are used when working with different categories
  • global dimensionality reduction
    • a reduced set of terms is chosen for the classification under all categories
dimensionality reduction2
Dimensionality reduction
  • Dimensionality reduction by term selection
    • the terms of the reduced term set are a subset of the original term set
  • Dimensionality reduction by term extraction
    • the terms are not the same type of the terms in the original term set, but are obtained by combinations and transformations of the original ones
dimensionality reduction by term selection
Dimensionality reduction by term selection
  • Goal: select terms that, when used for document indexing, yields the highest effectiveness in the given application
  • wrapper approach
    • the reduced set of terms is found iteratively and tested with the application
  • filtering approach
    • keep the terms that receive the highest score according to a function that measures the ”importance” of the term for the task
dimensionality reduction by term selection1
Dimensionality reduction by term selection
  • Many functions available
    • document frequency: keep the high frequency terms
      • stopwords have been already removed
      • 50% of the words occur only once in the document collection
      • e.g. remove all terms occurring in at most 3 documents
dimensionality reduction by term selection2
Dimensionality reduction by term selection
  • Information-theoretic term selection functions, e.g.
    • chi-square
    • information gain
    • mutual information
    • odds ratio
    • relevancy score
dimensionality reduction by term extraction
Dimensionality reduction by term extraction
  • Term extraction attempts to generate, from the original term set, a set of ”synthetic” terms that maximize effectiveness
  • due to polysemy, homonymy, and synonymy, the original terms may not be optimal dimensions for document content representation
dimensionality reduction by term extraction1
Dimensionality reduction by term extraction
  • Term clustering
    • tries to group words with a high degree of pairwise semantic relatedness
    • groups (or their centroids) may be used as dimensions
  • latent semantic indexing
    • compresses document vector into vectors of a lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence
3 text categorization
3. Text categorization
  • Text classification, topic classification/spotting/detection
  • problem setting:
    • assume: a predefined set of categories, a set of documents
    • label each document with one (or more) categories
text categorization
Text categorization
  • Two major approaches:
    • knowledge engineering -> end of 80’s
      • manually defined set of rules encoding expert knowledge on how to classify documents under the given gategories
    • machine learning, 90’s ->
      • an automatic text classifier is built by learning, from a set of preclassified documents, the characteristics of the categories
text categorization1
Text categorization
  • Let
    • D: a domain of documents
    • C = {c1, …, c|C|} : a set of predefined categories
    • T = true, F = false
  • The task is to approximate the unknown target function ’: D x C -> {T,F} by means of a function : D x C -> {T,F}, such that the functions ”coincide as much as possible”
  • function ’ : how documents should be classified
  • function : classifier (hypothesis, model…)
we assume
We assume...
  • Categories are just symbolic labels
    • no additional knowledge of their meaning is available
  • No knowledge outside of the documents is available
    • all decisions have to be made on the basis of the knowledge extracted from the documents
    • metadata, e.g., publication date, document type, source etc. is not used
general methods
-> general methods
  • Methods do not depend on any application-dependent knowledge
    • in operational applications all kind of knowledge can be used
  • content-based decisions are necessarily subjective
    • it is often difficult to measure the effectiveness of the classifiers
    • even human classifiers do not always agree
single label vs multi label
Single-label vs. multi-label
  • Single-label text categorization
    • exactly 1 category must be assigned to each dj D
  • Multi-label text categorization
    • any number of categories may be assigned to the same dj D
  • Special case of single-label: binary
    • each dj must be assigned either to category ci or to its complement ¬ ci
single label multi label
Single-label, multi-label
  • The binary case (and, hence, the single-label case) is more general than the multi-label
    • an algorithm for binary classification can also be used for multi-label classification
    • the converse is not true
category pivoted vs document pivoted
Category-pivoted vs. document-pivoted
  • Two different ways for using a text classifier
  • given a document, we want to find all the categories, under which it should be filed -> document-pivoted categorization (DPC)
  • given a category, we want to find all the documents that should be filed under it -> category-pivoted categorization (CPC)
category pivoted vs document pivoted1
Category-pivoted vs. document-pivoted
  • The distinction is important, since the sets C and D might not be available in their entirety right from the start
  • DPC: suitable when documents become available at different moments in time, e.g. filtering e-mail
  • CPC: suitable when new categories are added after some documents have already been classified (and have to be reclassified)
category pivoted vs document pivoted2
Category-pivoted vs. document-pivoted
  • Some algorithms may apply to one style and not the other, but most techniques are capable of working in either mode
hard categorization vs ranking categorization
Hard-categorization vs. ranking categorization
  • Hard categorization
    • the classifier answers T or F
  • Ranking categorization
    • given a document, the classifier might rank the categories according to their estimated appropriateness to the document
    • respectively, given a category, the classifier might rank the documents
applications of text categorization
Applications of text categorization
  • Automatic indexing for Boolean information retrieval systems
  • document organization
  • text filtering
  • word sense disambiguation
  • hierarchical categorization of Web pages
automatic indexing for boolean ir systems
Automatic indexing for Boolean IR systems
  • In an information retrieval system, each document is assigned one or more keywords or keyphrases describing its content
    • keywords belong to a finite set called controlled dictionary
  • TC problem: the entries in a controlled dictionary are viewed as categories
    • k1 x  k2 keywords are assigned to each document
    • document-pivoted TC
document organization
Document organization
  • Indexing with a controlled vocabulary is an intance of the general problem of document base organization
  • e.g. a newspaper office has to classify the incoming ”classified” ads under categories such as Personals, Cars for Sale, Real Estate etc.
  • organization of patents, filing of newspaper articles...
text filtering
Text filtering
  • Classifying a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer
  • e.g. newsfeed
    • producer: news agency; consumer: newspaper
    • the filtering system should block the delivery of documents the consumer is likely not interested in
word sense disambiguation
Word sense disambiguation
  • Given the occurrence in a text of an ambiguous word, find the sense of this particular word occurrence
  • E.g.
    • Bank of England
    • the bank of river Thames
    • ”Last week I borrowed some money from the bank.”
word sense disambiguation1
Word sense disambiguation
  • Indexing by word senses rather than by words
  • text categorization
    • documents: word occurrence contexts
    • categories: word senses
  • also resolving other natural language ambiguities
    • context-sensitive spelling correction, part of speech tagging, prepositional phrase attachment, word choice selection in machine translation
hierarchical categorization of web pages
Hierarchical categorization of Web pages
  • E.g. Yahoo like web hierarchical catalogues
  • typically, each category should be populated by ”a few” documents
  • new categories are added, obsolete ones removed
  • usage of link structure in classification
  • usage of the hierarchical structure
knowledge engineering approach
Knowledge engineering approach
  • In the 80´s: knowledge engineering techniques
    • building manually expert systems capable of taking text categorization decisions
    • expert system: consists of a set of rules
      • wheat & farm -> wheat
      • wheat & commodity -> wheat
      • bushels & export -> wheat
      • wheat & winter & ~soft -> wheat
knowledge engineering approach1
Knowledge engineering approach
  • Drawback: rules must be manually defined by a knowledge engineer with the aid of a domain expert
    • any update necessitates again human intervention
    • totally domain dependent
    • -> expensive and slow process
machine learning approach
Machine learning approach
  • A general inductive process (learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci or ci by a domain expert
  • from these characteristics the learner gleans the characteristics that a new unseen document should have in order to be classified under ci
  • supervised learning (= supervised by the knowledge of the training documents)
machine learning approach1
Machine learning approach
  • The learner is domain independent
    • usually available ’off-the-shelf’
  • the inductive process is easily repeated, if the set of categories changes
  • manually classified documents often already available
    • manual process may exist
  • if not, it still easier to manually classify a set of documents than to build and tune a set of rules
training set test set validation set
Training set, test set, validation set
  • Initial corpus of manually classified documents
    • let dj belong to the initial corpus
    • for each pair <dj, ci> it is known if dj should be filed under ci
  • positive examples, negative examples of a category
training set test set validation set1
Training set, test set, validation set
  • The initial corpus is divided into two sets
    • a training (and validation) set
    • a test set
  • the training set is used to build the classifier
  • the test set is used for testing the effectiveness of the classifiers
    • each document is fed to the classifier and the decision is compared to the manual category
training set test set validation set2
Training set, test set, validation set
  • The documents in the test are not used in the construction of the classifier
  • alternative: k-fold cross-validation
    • k different classifiers are built by partitioning the initial corpus into k disjoint sets and then iteratively applying the train-and-test approach on pairs, where k-1 sets construct a training set and 1 set is used as a test set
    • individual results are then averaged
training set test set validation set3
Training set, test set, validation set
  • Training set can be split to two parts
  • one part is used for optimising parameters
    • test which values of parameters yield the best effectiveness
  • test set and validation set must be kept separate
inductive construction of classifiers
Inductive construction of classifiers
  • A ranking classifier for a category ci
    • definition of a function that, given a document, returns a categorization status value for it, i.e. a number between 0 and 1
    • documents are ranked according to their categorization status value
inductive construction of classifiers1
Inductive construction of classifiers
  • A hard classifier for a category
    • definition of a function that returns true or false, or
    • definition of a function that returns a value between 0 and 1, followed by a definition of a threshold
      • if the value is higher than the threshold -> true
      • otherwise -> false
  • Probabilistic classifiers (Naïve Bayes)
  • decision tree classifiers
  • decision rule classifiers
  • regression methods
  • on-line methods
  • neural networks
  • example-based classifiers (k-NN)
  • support vector machines
rocchio method
Rocchio method
  • Linear classifier method
  • for each category, an explicit profile (or prototypical document) is constructed
    • benefit: profile is understandable even for humans
rocchio method1
Rocchio method
  • A classifier is a vector of the same dimension as the documents
  • weights:
  • classifying: cosine similarity of the category vector and the document vector