A Parallel Implemenation of Conditional Random Fields

A Parallel Implemenation of Conditional Random Fields • This was an AUSS/NIP project for the grant Developing an Entity Extractor for the Scalable Constructing of Semantically Rich Socio-Technical Network Data by Jana Diesner of UIUC. • Mostly I worked with Brent Fegley, her research assistant • This is machine learning, and thus NIP.

Motivations For them: some of the problems they want to run take weeks with the serial code. For us: machine learning is an obvious candidate for HPC.

What is CRF? The original paper: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data by Lafferty, McCallum, and Pereira (2001) Goal: (for example) add part-of-speech labels to text Idea: use a hidden Markov process based on previous previous labels. Optimize the transition weights in the model to get the predicted parts of speech as close the the correct values as possible, for example by maximizing the entropy of the solution.

Diesner specifics The goal for Diesner's group is to identify entities within unlabeled text 95 training examples; roughly 2500 labeled paragraphs from WSJ Two schemes: 'BOUNDARY' (5 labels) and 'CATEGORY' (95 labels) (much longer running) <ENAMEX TYPE="ORGANIZATION:CORPORATION">Intel Corp.</ENAMEX> reported a <NUMEX TYPE="PERCENT">50 %</NUMEX> drop in <TIMEX TYPE="DATE:DATE">third-quarter</TIMEX> net income , partly because of a <NUMEX TYPE="CARDINAL">one</NUMEX>-time charge for discontinued operations . The big semiconductor and computer <ENAMEX TYPE="ORG_DESC:CORPORATION">maker</ENAMEX> , said it had net of <NUMEX TYPE="MONEY">$ 72 million</NUMEX> , or <NUMEX TYPE="MONEY">38 cents</NUMEX> , down <NUMEX TYPE="PERCENT">50 %</NUMEX> from <NUMEX TYPE="MONEY">$ 142.7 million</NUMEX> , or <NUMEX TYPE="MONEY">78 cents</NUMEX> a share . The lower net included a charge of <NUMEX TYPE="MONEY">$ 35 million</NUMEX> , equal to <NUMEX TYPE="MONEY">12 cents</NUMEX> a share on an after-tax basis , for the cost of abandoning a computer-systems joint venture with <ENAMEX TYPE="ORGANIZATION:CORPORATION">Siemens AG</ENAMEX> of <ENAMEX TYPE="GPE:COUNTRY">West Germany</ENAMEX> . Earning also fell from the <TIMEX TYPE="DATE:DATE">year-ago</TIMEX> period because of slowing microchip demand .

Sarawagi Implementation of CRF Diesner et al. use a Java implementation developed by Sunita Sarawagi of IITB around 2006 • DataIter iterates over DataSequence instances • so a DataSequence is some labeled text • FeatureGenerator takes a DataSequence and provides a set of features • e.g text tags or parts of speech • CRF.Trainer optimizes the weights

Illinois application RoundRobinEstimator takes each set of 4 from the 5 training sets, trains the CRF, and tests the prediction against the remaining set. Two cases: BOUNDARY (5 labels) and CATEGORY (95 labels) Typical test case: 10 optimization steps for BOUNDARY, 3 for CATEGORY I worked directly from their SVN repo using Eclipse Maven for source version control

Serial Code Architecture Basically the problem is gradient decent in many dimensions. Uses LBFGS • A Java port of the good old Fortran code • Has internal state, which frustrated Fegley's threading efforts Calls alternate between LBFGS (setting next test point) and computeFunctionGradient (evaluating the sum)

(Some of) Java's Support for Parallel Threads A java.util.concurrent.ExecutorService maintains a pool of threads that take tasks from a parallel queue and return values via futures. We can create one using newFixedThreadPool() The threads and tasks have to be customized. Parallel Queue Result Task Result Task Result Task Promise Queue

Parallel Code Architecture Introduce a new derived class: ParallelTrainer extends Trainer In ParallelTrainer.computeFunctionGradient(), training example terms are evaluated across threads. Scalar values get returned as Futures. The gradient vector must be returned via the calling parameter! • Make a new Thread class that accumulates gradient terms over its lifetime • Merge those values at the end of the iteration

Results Breakdown of time vs. task for the BOUNDARY and CATEGORY problems

Results (2) Parallel speed-up vs. thread count for the two tasks. The BOUNDARIES task rapidly exhausts scalability, but the CATEGORIES task is still improving at 64 threads. Training time vs. thread count for CATEGORIES. The blue line is for 16 cores/32 hyperthreads.

Drawbacks The big issue: Sarawagi's API does not provide the FeatureGenerator until it is time to start training. (Very flexible). Internal state from Trainer is shared across the package. This makes it difficult to create the threads early- they would have to have their internals replaced every iteration anyway. Thus we create threads late, starting fresh every iteration. Very inefficient, but the overhead is tiny for realistic cases.

Where are we now? I'm working with a new set of grad students to understand some variability. Rounding error causes drift in the optimization trajectory The need to return the gradient vector makes a fully deterministic version very expensive We've also shared ParallelTrainer with a group at CMU which uses Sarawagi's CRF implementation.

A Parallel Implemenation of Conditional Random Fields