1 / 16

Document Type Recognition and Content Summarization

Document Type Recognition and Content Summarization. William Underwood Persistent Archives Testbed Working Meeting SDSC, La Jolla, CA Feb 17-18, 2005. Overview. Information Extraction Machine learning and recognition of document types Content Extraction

jalene
Download Presentation

Document Type Recognition and Content Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Type Recognition and Content Summarization William Underwood Persistent Archives Testbed Working Meeting SDSC, La Jolla, CA Feb 17-18, 2005

  2. Overview • Information Extraction • Machine learning and recognition of document types • Content Extraction • Summarization (Folder titles and Content Notes) • FOIA Review

  3. Access Restriction Checker

  4. Information Extraction • Information extraction (IE) is a procedure that selects, extracts and combines data from text in order to produce structured information. • The Named entity (NE)Task is to identify all named persons, organizations, locations, dates, times, numeric monetary amounts and percentages in text.

  5. Letter From George Bush to Ronald Reagan

  6. Named Entity Recognition

  7. Content Extraction Tasks • The Template Element (TE) Task is to fill in templates about persons and organizations from an automatic analysis of text. • The Scenario Template (ST) task is to fill in templates about events and their participants (persons, organizations, etc.) from an automatic analysis of text?

  8. Content Extraction Applied to Recognizing Request for Confidential Advice

  9. Action: Request Agent: Person Job_Title: President Object: Analysis of the War Powers Resolution Patient: C Boyden Gray Job_Title: Counsel to the President Presidential_Advisor: C Boyden Gray If Document(X), and Action(X) = Request, and Agent(X) = Y, and (Job_Title(Y) = President, or Presidential_Advisor(Y)) and Patient(X) = Z and Presidential_Advisor(Z) and Object(X) = Information Then Access_Restriction(X) = a(5). Content Extraction and Access Restriction Rules

  10. Some Document Types in Bush Presidential Electronic Records • Agenda • Biographical Information • Briefing Memo • Decision Memo • Executive Order • Information Memo • White House Letter • List of Candidates for Appointment to Federal Office • Mailing List • Minutes of Meeting • Nomination for Appointment to Federal Office • Press Release • Resume • Schedule • Telephone Call Recommendation

  11. Document Type Recognition • Convert document format to ASCII or HTML • Use Information Extraction Technology to Markup Different Document Types. • Machine Learning of Document Type through Grammatical Inference • Evaluate Performance • Use for Recognizing Document Types of other Records

  12. Annotated White House Correspondence <date>March 27, 1990</date> <greeting>Dear</greeting><person>Mr. Allen</person> <p>Thank you very much for your letter of <date>March 15, 1990</date> which stated your concerns and suggestions regarding the Americans with Disabilities Act.</p> <p>In order to fulfill <person>President Bush's</name> campaign promise of bringing Americans with handicaps into the mainstream of American life, the Bush Administration supports the objectives of the A.D.A.</p> <p>As you may know, the bill is still in <organization>House Committee</organization> for consideration and change. You can be sure that your thoughts have been fully noted and are appreciated.</p> <formula of respect>Sincerely,</formula of respect> <person>Doug Wead</person> <job title>Special Assistant to the President for Public Liaison</job title> <address><person>Ray Allen</person>, <job title>President</job title> <organization>American Cultural Traditions</organization> <postal address>P.O. Box 1895</postal address> <location>Washington, D.C.</location> <zipcode>20013</zipcode></address>

  13. Regular Grammar for the Layout of White House Correspondence Letter  <date></date>A A  <greeting></greeting>B B  <p></p>B B  <p></p>C C  <formula of respect></formula of respect>D D  <person></person>E E  <job title></job title>F F  <address></address>

  14. Scope and Content Note for John Sununu’s Files These files contain correspondence from senior level staff in the Executive Office of the President, and from every member of the Cabinet. The material covers issues that faced the Bush Administration from 1989 to 1990, including abortion / fetal research, the Exxon Valdez oil spill, the savings and loan industry, the Clean Air Act, the White House Conference on Global Climate Change, relations with China following the student demonstrations in Tiananmen Square, the National Drug Control Strategy, the 1990 Bipartisan Budget Agreement, the spotted owl issue, the Americans with Disabilities Act, and the nomination of Supreme Court Justice David Souter. It includes correspondence, routine reports, press releases, press clippings, papers produced by organizations outside the Administration, and speech drafts.

  15. Relationship to Persistent Archives Testbed • Information extraction, document type learning and recognition and series summarization will be provided as Archival Services within the NARA Persistent Archives Prototype, and could be provided within the PAT.

  16. Additional Information • http://perpos.gtri.gatech.edu • Archival Processing Tools: User Manual • An Analysis of the Knowledge Required to Perform FOIA and PRA Review, PERPOS Technical Report ITTL/CSITD 04-1,Mar 2004. • PERPOS: Results of Laboratory Experiments and Use by Archivists, Nov 2003 • Recognizing Named Entities in Presidential Electronic Records, PERPOS Technical Report ITTL/CISTD 04-4, June, 2004

More Related