Problem Statement

Problem Statement This research project has developed five methods for unstructured text processing; however, the best combination of methods to extract homogeneous POS clusters has not been determined

Governing Propositions Splitting (S): assign words to multiple clusters by splitting centers by associative and non-associative surrounds Objective: To place multi-POS words into every correct POS (sub) cluster Bidirectional Clustering: cluster surrounds before centers Objective: To reduce the feature dimensionality and improve precision/recall of POS clustering, by aggregating a large number raw surrounds features by similar centers

Governing Propositions Chunking: extract multi-word centers using high-frequency surrounds Objective: To find multi-word centers using previously learned surrounds Distance Function(D): using correlation and correlation confidence or correlation only as a measure of cluster similarity Objective: To measure the distance of surrounds or centers based on co-occurrence counts

Governing Propositions Expression start tag (E): Using surrounds where the ‘fore’ = <expression start> Objective: To enable extracting features from the first word of an expression

Performance Criteria Achieve no less than 70 correctness in assigning a POS clusters

Task Objective Evaluate clustering performance for various combinations of methods using the Children’s Corpus and determine which combination is most suitable for continued research.

Splitting Goals and Hypotheses G1. Determine if POS splitting can accurately place multi-POS words into correct clusters H1.1: For any pair of Centers, using occurrence count correlation, each Surrounds feature can be classified as associative or non-associative H1.2: By incrementally associating only the Center’s associative features with clusters, a multi-POS center can be assigned to multiple subclusters representing homogeneous POS centers

Bidirectional Clustering Goals and Hypotheses G2: Determine if clustering Surrounds before centers (bidirectional clustering) produces a better Center clustering, by POS, than clustering Centers only (unidirectional clustering) H2.1 Clustering Surrounds aggregates Surrounds that capture Centers of the same POS H2.2 Bidirectional clustering increases distance among clusters and decreases distance between clusters when compared to unidirectional clustering

Chunking Goals and Hypotheses G3: Determine if using high-occurrence count Surrounds can be used effectively to extract multi-word entities/chunks H3.1 Surrounds that are effective in extracting a single POS for one-word chunks, can also be effective for extracting multi-word chunks (induction) H3.2 High-occurrence Surrounds indicate a single POS with statistical significance

Distance Function Goals and Hypotheses G4: Determine which product-moment correlation distance function yields the most accurate POS clustering D1: Correlation only: cluster X, Y : argmaxX,Y { pearson(X,Y) } D2: Correlation + Confidence: cluster X, Y : argmaxX,Y { pearson(X,Y) : confidence(pearson(X,Y) , |X|∩|Y|) > conf_threshold} H4.1 Pearson Product Moment Correlation yields an improved POS clustering when applying some correlation confidence threshold ?? Should we apply Brown Clustering as an alternative policy ??

Double-Zero Features Definition Double-Zero Feature: for fi is a Double-Zero Feature if fi, A=fi, B =0

Start Tag Goals and Hypotheses G5: Determine if clusters influenced by Surrounds features having fore = <expression start> (start surrounds) produce less consistent POS clusters than clusters dominated with Surrounds where fore is a distinct word H5.1 Start Surrounds tend to influence a small proportion of final clusters H5.2 The information loss associated with a start tag, produces inconsistencies in POS clustering

Governing Propositions • A corpus with short sentences, simple grammar, and limited vocabulary will produce best results, but the results will inform further research with more advanced linguistic constructs. • Not all Surrounds that are discovered can and should be used for this experiment. The experiment will be restricted to using the 2,000 surrounds with the largest number of occurrence counts. • Not all words in the Lexicon can and should be clustered by POS. The experiment will be restricted to using the 1,000 Centers with the largest number of occurrence counts. • The clustering algorithm will be hierarchical, agglomerative, using a correlation-based distance metric. • For each experiment, only one co-occurrence matrix will be applied; there is no incremental enhancement of the co-occurrence counts.

Performance Metrics • Cluster Correctness: • Cluster POS: Most commonly occurring POS in cluster (winner takes all) • Correct: Number of Centers in cluster where • Incorrect: Number of Centers in cluster where

Solution Pipeline Word Lex Corpus Chunking N-gram extraction Parsing Centers E=T Surrounds Center-Surrounds Counting Co-Occurrence Matrix S=T S=T Surrounds Clustering Center Clustering Center Clusters D=1 D=1 Results (POS, …) Evaluation

Factors Splitting: {Yes, No} Minimum split ratio: ξ=0.3 Tolerance Angle: ϑ= 30 degrees Bidirectional Clustering: {Yes, No} Surrounds Minimum Conf Value: ψ=0.9 Chunking: {Yes, No} Max center word length: v < 4 Number of reference surrounds: τ=200 Distance Function {Correlation Value, Correlation Value + Confidence} Center Minimum Correlation Value: r=0.8 Center Minimum Correlation Confidence: K=0.9 Brown Clustering??? Start Tag Use {Yes, No}

Experiment Configurations (1-16)

Experiment Configurations (17-32)

Corpus

Final Clustering Feature Counts

General Pipeline Word Lex Corpus Chunking N-gram extraction Parsing Centers E=T Surrounds Center-Surrounds Counting Co-Occurrence Matrix S=T S=T Surrounds Clustering Center Clustering Center Clusters D=1 D=1 POS Grade Evaluation

Problem Statement