slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
acl.ldc.upenn (ca. 2002) PowerPoint Presentation
Download Presentation
acl.ldc.upenn (ca. 2002)

Loading in 2 Seconds...

play fullscreen
1 / 14

acl.ldc.upenn (ca. 2002) - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

The ACL ARC A nthology R eference C orpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. Steven Bird 1 , Robert Dale 2 , Bonnie Dorr 3 , Bryan Gibson 4 , Mark T. Joseph 4 , Min-Yen Kan 5 , Dongwon Lee 6 , Brett Powley 2 , Dragomir R. Radev 4 , Yee Fan Tan 5.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

acl.ldc.upenn (ca. 2002)


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
LREC 2008 (Marrakech, Morocco)

The ACL ARCAnthology Reference Corpus:A Reference Dataset for Bibliographic Research in Computational Linguistics

Steven Bird1, Robert Dale2, Bonnie Dorr3, Bryan Gibson4, Mark T. Joseph4, Min-Yen Kan5, Dongwon Lee6, Brett Powley2, Dragomir R. Radev4, Yee Fan Tan5

1 2 3 4 5 6

slide2
LREC 2008 (Marrakech, Morocco)

http://acl.ldc.upenn.edu (ca. 2002)

slide3
LREC 2008 (Marrakech, Morocco)

http://www.aclweb.org/anthology-new

slide4
LREC 2008 (Marrakech, Morocco)

Booktitle

Basic Metadata

Some Detailed Metadata

PDF file

http://www.aclweb.org/anthology-new/P/P05

hot nlp problems
LREC 2008 (Marrakech, Morocco)Hot NLP Problems
  • Graphical Methods for NLP
    • Social network analysis
  • Text categorization
    • Sentence / Citation function
  • Sequence Labeling
    • Reference string parsing
  • Bayesian Models
    • Topic Models
  • Summarization
    • Survey Paper Generation
the anthology as corpus
LREC 2008 (Marrakech, Morocco)The Anthology as Corpus

Why use newswire?

  • Because our funding agencies want it

Let's build a corpus from our own publications!

  • Test domain adaptation techniques
  • Characterize what’s special about scientific discourse
  • Help ourselves and others understand our research better

Start with the largest freelyavailable NLP research archive

the a nthology r eference c orpus
LREC 2008 (Marrakech, Morocco)The Anthology Reference Corpus

Scholars have already been using scientific articles as input

But datasets largely disparate

Results not comparable

Goal: unify such work by agreeing to work on a central dataset (à la TREC)

the acl arc
LREC 2008 (Marrakech, Morocco)The ACL ARC

Consists of most articles available as of February 2006 that have extractable text

Papers 10,921

Total References 152,546

References to articles 38,767 (25.4%)

inside ACL ARC

References to articles 113,779 (74.6%)

outside ACL ARC

what s included now
LREC 2008 (Marrakech, Morocco)What’s included now

Version 2008 03 25:

  • PDFs for all 10,921 articles
  • <Title, Author, Booktitle> metadata tuples
  • Noisy, text extracted output from the PDFs
    • Using non-OCR based extractor (pdfbox)
the road ahead
LREC 2008 (Marrakech, Morocco)The road ahead

Achieving the goals of the Linked Anthology Proposal

  • Improve data quality
  • Establish subsets for smaller experiments
  • Build and release open-source tools
  • Enlarge coverage of newer materials
  • Release major revisions (infrequently)
future data near term
LREC 2008 (Marrakech, Morocco)Future data (near-term)

Inter-document

  • Manually cleaned citation graph from the ACL Anthology Network

Intra-document

  • Citation to reference string matching

Document

  • Automatic keyphrase generation
  • OCR based text extracted output (much cleaner)

R

slide12
LREC 2008 (Marrakech, Morocco)

http://belobog.si.umich.edu/clair/anthology/index.cgi

tools in development by partners
LREC 2008 (Marrakech, Morocco)Tools in development by partners

Automatic Reference Segmentation:

  • ParsCit: Open-source reference string parser; also LREC 08

Automatic Survey Article Generation

  • iOpener: summarization of articles at different expertise levels

Automatic Reference-Article matching

  • Record Linkage: using web data to match articles

Citation Function Classification

  • What’s the purpose of a citation?

Next big application

  • Your work here: please join us – this should be a community wide effort
slide14
LREC 2008 (Marrakech, Morocco)

Thank you!

http://acl-arc.comp.nus.edu.sg/

Web: “acl arc” home pagedAnth Digital Anthologies mailing list