Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding

Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding Ariel Schwartz, Anna Divoli, and Marti Hearst University of California, Berkeley Supported in part by NSF DBI 0317510

Bioscience literature • Rich, complex and fast growing. • Online full text enables new forms of automatic document analysis, including caption search, and citation sentences analysis. • Citances • Nearly every statement in a bioscience journal article is backed up by a citation. • It is common for papers to be cited 30-100 times. • The text around the citation tends to state biological facts from the target paper. • We term these citation sentences, or citances. • Different citances state similar facts in different ways.

Papers are cited for some fact(s) … … until it is the case that many important facts in the field can be found in citation sentences alone!

Using citances • Potential applications of citances • creation of training and testing data for semantic analysis, • synonym set creation, • database curation, • document summarization, • and information retrieval generally. Nakov, Schwartz and Hearst. Citances: Citation Sentences for Semantic Analysis of Bioscience Text, in the SIGIR'04 Workshop on Search and Discovery in Bioinformatics. • All these applications require citance word alignments. • Align together concepts that are semantically related in the context of the target paper. • Related concepts can be expressed in several different ways in the citances. • We focus here on the multiple citance alignment (MCA) problem.

Example of unaligned citances “In response to genotoxic stress, Chk1 and Chk2 phosphorylate Cdc25A on N-terminal sites and target it rapidly for ubiquitin-dependent degradation (Mailand et al, 2000, 2002; Molinari et al, 2000; Falck et al, 2001; Shimuta et al, 2002; Busino et al, 2003), which is thought to be central to the S and G2 cell cycle checkpoints (Bartek and Lukas, 2003; Donzelli and Draetta, 2003).” “Given that Chk1 promotes Cdc25A turnover in response to DNAdamage in vivo (Falck et al. 2001; Sorensen et al. 2003) andthat Chk1 is required for Cdc25A ubiquitination by SCF-TRCP in vitro, we explored the role of Cdc25A phosphorylation inthe ubiquitination process.” “Since activated phosphorylated Chk2-T68 is involved in phosphorylationand degradation of Cdc25A (Falck et al., 2001, Falck et al.,2002; Bartek and Lukas, 2003), we also examined the levels ofCdc25A in 2fTGH and U3A cells exposed to -IR.”

Goal: Align similar concepts responsegenotoxic stressChk1 Chk2phosphorylateCdc25A N terminal sites target rapidly ubiquitindependentdegradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25AturnoverresponseDNA damage vivo Chk1requiredCdc25Aubiquitination SCF beta TRCP vitro explored role Cdc25Aphosphorylationubiquitinationprocess activated phosphorylatedChk2T68 involved phosphorylationdegradationCdc25Aexamined levels Cdc25A 2fTGH U3A cells exposed gamma IR

Multiple citance alignment (MCA) • Goal: Partition the citances’ words/phrases into equivalence classes based on “semantic homology”. • Orthographic similarity is important but does not always entail semantic homology: “phosphorylate” » “phosphorylation” “cell cycle” ¿“U3A cells” “genotoxic stress” » “DNA damage” • Related problems: • Multiple sequence alignment (MSA) in genomics. • Pairwise word alignment in statistical machine translation (SMT).

Formal definition of MCA • Pairwise citance alignment of citances Ci and Cj is an equivalence realtion »ij. • cik»ijcjl means that the kth word in the ith citance is aligned to the jth word in the lth citance. • Multiple citance alignment (MCA) is an equivalence relation ~, which is defined as the transitive closure of the union of all pairwise citance alignments: • The transitive closure ensures that the equivalent classes (colors) are consistent across all pairwise citance alignments.

Algorithm outline • We developed an MCA algorithm based on: • Extension to our posterior decoding algorithm for MSA (AMAP, Schwartz and Pachter ECCB 2006). • Modified version of the SMT pairwise word alignment model of Blunsom & Cohn (ACL 2006) for posterior probabilities calculation.

Posterior probabilities calculation (CRF) Unaligned citances to a target paper Feature extraction Expected utility maximization (posterior decoding) Muliple citance alignment (MCA) Utility function responsegenotoxic stressChk1 Chk2phosphorylateCdc25A N terminal sites target rapidly ubiquitindependentdegradation thought central S G2 cell cycle checkpoints Given Chk1 promotes Cdc25AturnoverresponseDNA damage vivo Chk1requiredCdc25Aubiquitination SCF beta TRCP vitro explored role Cdc25Aphosphorylationubiquitinationprocess activated phosphorylatedChk2T68 involved phosphorylationdegradationCdc25Aexamined levels Cdc25A 2fTGH U3A cells exposed gamma IR Algorithm outline

Utility function for MCA • Requirements for a good utility function: • Correlated to the accuracy measure used for evaluation. • Easily decomposable, for direct optimization using posterior-decoding. • Metric-based (optional): • Captures intuitive notion of distance. • Triangle inequality provides bounds on the search space. • AER and F-measure do not satisfy these criteria.

Alignment Metric Accuracy (AMA) • We extend AMA (Schwartz et al 2006), a utility function for one-to-one MSA, to many-to-many MCA. • Intuitively, UAMA measures the average word-level agreement between the predicted and reference MCAs. • Uset_agreement is a “score” assigned to each word position based on the overlap between the sets of word positions the two alignments align to it. • Can use Dice, Jaccard, or Hamming for example. • We use the Braun-Blanquet coefficient.

Example of AMA for MCA • Every word gets a score between 0 and 1 based on level of agreement with the reference alignment. • AMA is the average word score. • In this exampleAMA = 13.83/ 20 = 0.692. • Sum of pairs is used for multiple alignments.

Controlling the recall/precision tradeoff • In addition, two free parameters (match-factor , and gap-factor ) are added in order to provide control of the recall/precision tradeoff. • The result is the following utility function:

Motivation for using a CRF model • Small annotated sets for training, development, and testing • Main challenge is to perform well on unseen words. • Requires a discriminative model that can • use different overlapping features, • can incorporate contextual information, • allows for computation of posterior probabilities.

CRFs based SMT word alignment • Blunsom and Cohn (ACL 2006) developed a CRF based pairwise word alignment model for SMT. • Directional model – every source word can be mapped to zero or one target words. • Using Viterbi decoding. • Features are functions of the implied source-target word-pairs. • We modified the program to support MCA. • Compute the directional marginal posterior probabilities using the forward-backward algorithm: • Modified features. • Implementation of a posterior-decoding algorithm for MCA instead of the Viterbi decoding for pairwise SMT word alignment.

Posterior decoding algorithm for MCA • For every pair of citances compute the directional posterior probabilities using a CRF. • For every target word w, compute the combination of source words that maximize the expected utility of w. • The (undirected) multiple word alignment is produced by taking the transitive closure of the union of individual word optimal alignments:

Decoding Example Target C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1 Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1Chk2 phosphorylate Cdc25A Source C2: Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 C2:Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 C3: Chk2 T68 involved phosphorylation degradation Cdc25A C3:Chk2 T68 involved phosphorylation degradation Cdc25A Later on in the decoding process … Source C1: response genotoxic stress Chk1Chk2 phosphorylate Cdc25A C1: response genotoxic stress Chk1Chk2 phosphorylate Cdc25A C2:Chk1 promotes Cdc25A turnover response DNA damage vivo Chk1 Target C3:Chk2 T68 involved phosphorylation degradation Cdc25A C3:Chk2 T68 involved phosphorylation degradation Cdc25A

Data sets • 3 sets of citances annotated by a PhD with biological training: • Training set - 4 groups, 10 citances each (180 pairs). • Development set – 51 citances (1275 pairs). • Test set – 45 citances (990 pairs). • Feature engineering using the training and development sets. • Final results based on a model trained on training and development sets combined, and tested on the test set. • Baseline – using only normalized edit distance with a simple cutoff.

Features for MCA • Orthographic features • exact string match, • normalized edit distance, • prefix, suffix match, • word lengths, • capitalization. • Local contextual features • distance between target words of adjacent source words, • Word specific tendency to align like the previous/next word, • Transition to, from, and between (un)aligned words. • Biological ontology based features • Medical Subject Headings (MeSH), • Gene synonyms (Entrez Gene, Uniprot, OMIM). • Lexical features • Wordnet similarity (Lin, 1998)

Results on pairwise alignments • Unlike Viterbi decoding, posterior-decoding (PD) enables refined control of the recall/precision tradeoff. • Viterbi_Union (0.531 recall at 0.913 precision) is comparable to PD with  and  set to 1(0.540 recall at 0.909 precision). • However, PD allows to increasethe recall significantly by increasing  and decreasing  (0.636 recall at 0.517 precision for  = 1.2 and  = 0.1,or 0.742 recall at 0.198 precision for  = 1.5 and  = 0.05).

Results on MCA • The two curves overlap in the range between 0.52 and 0.55 recall (0.84 and 0.9 precision).Orthographic similarity is the dominant feature in this range. • Unlike the baseline the CRF+PD system keeps improving recall without a sharp drop in precision up to 0.636 recall at 0.748 precision. This is due to the incorporation of multiple overlapping features. • The CRF+PD system also achieves better precision than the baseline (0.982 precision at 0.381 recall vs. 0.937 precision at 0.346 recall).

Error analysis • Performed error analysis on MCA with best F-measure (0.690). • Out of 1400 unique errors 1194 (85.3%) are false-negatives, and 206 (14.7%) are false-positives. • Most errors are due to misalignment of • subtypes (cdc, cdc6, cdc25A), • opposites (phosphorylated and unphosphorylated), • and complex entities (cell cycle and cell line). • Many FN errors are due to not aligning entities in only 4 equivalence classes (e.g., 97 FN in the class of motif, site and domain). • Other types of errors: • not aligning plural and singular forms of the same entities, • aligning only part of part of multi-word entities, • and incorrectly aligning orthographically similar entites.

Contributions • Defined the MCA problem. • Developed a posterior-decoding algorithm for MCA. • Advantages of posterior-decoding over Viterbi: • Directly optimize the expected (metric-based) utility. • Control of recall / precision tradeoff. • Developed AMA for MCA • A metric based accuracy measure for MCA. • Balances recall and precision in one measure. • The expected AMA can be optimized directly with posterior-decoding (unlike AER or F-Measure). • Can also be used for SMT alignments.

Multiple Alignment of Citation Sentences with Conditional Random Fields and Posterior Decoding