1 / 22

Computer-Assisted Email Appraisal: RATOM SAA Research Forum Archives 2019

This presentation discusses the development of software for selection and appraisal of email collections, focusing on the Review, Appraisal, and Triage of Mail (RATOM) project. It explores the challenges of processing large PST email corpora and the use of machine learning techniques for identifying candidate materials for retention. The development efforts related to parsing and processing PST, OST, and mbox email formats are also discussed.

weberj
Download Presentation

Computer-Assisted Email Appraisal: RATOM SAA Research Forum Archives 2019

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer-Assisted Appraisal of Email: RATOM SAA Research Forum Archives 2019 Austin, TX August 2, 2019 Christopher (Cal) Lee University of North Carolina at Chapel Hill School of Information and Library Science

  2. Motivation – Selection/Appraisal • Despite progress on various technologies to support data management and digital preservation, relatively little progress on software support for the core activities of selection and appraisal • Selection/appraisal decisions are based on various patterns • When patterns can be identified algorithmically, software can assist the process • LAMs frequently want to take actions that reflect contextual relationships • Timeline representations and visualizations can also provide useful, high-level views of materials

  3. Motivation - Email • 48 years of email creation • Hundreds of billions of messages generated every day • Most has little long-term retention value, but some absolutely does • Despite presence of numerous other modalities, email still deeply embedded in activities, serving as massive source of evidence and information • Often found in collections and acquisitions with other types of materials http://hci.stanford.edu/~jheer/projects/enron/v1/

  4. Review, Appraisal and Triage of Mail (RATOM) • Funded by Andrew W. Mellon Foundation (2019-2020) • Developing and repurposing software (including NLP and machine learning) for selection/appraisal in BitCurator environment with hooks and enhancements to TOMES output • Support iterative processing - information discovered at various points in the processing workflow can support further selection, redaction or description actions • Mapping of timestamp, entity, sensitive features and other elements across the tools Ray Tomlinson https://upload.wikimedia.org/wikipedia/commons/0/01/Ray_Tomlinson_%28cropped%29.jpg

  5. NC DAR Personnel UNC SILS Personnel Sangeeta Desai DAR Technical Lead Anusha Suresh Project Manager Antoine De Torcy Software Engineer Jamie Patrick-Burns Investigator Cal Lee PI Kam Woods Technical Lead Camille Tyndall Watson Co-PI

  6. Scope Development of an integrated Python library to simplify parsing and processing PST, OST, and mbox email formats Utilities / wrappers to support entity identification and export entity reports in a format suitable for conducting automated and human-directed redaction actions at scale Utilities to apply machine learning techniques (by training on annotated message collections and/or unsupervised) to recognize candidate materials for retention Development of an interface allowing processing archivists to browse email collections and mark messages as suitable for retention *RATOM is in early development. This presentation examines some development efforts related to the first two goals.

  7. Processing large PST email corpora

  8. Why?

  9. The Personal Storage Table (PST) format is proprietary, complicated, monolithic, inefficient, and insecure. Microsoft started warning enterprise orgs to avoid PST use in the early 2000s. https://github.com/libyal/documentation/blob/master/PFF%20Forensics%20-%20analyzing%20the%20horrible%20reference%20file%20format.pdf https://techtalk.gfi.com/why-is-using-psts-in-2016-a-bad-idea/

  10. Lots of collecting orgs have unprocessed PSTs.

  11. Open source forensic libraries (libpff and libuna), NLP libraries (spaCy), along with some multiprocessing code in Python 3 allow us to process email from various sources - PST, OST, MBOX, and EML, among others - quickly and generate feature files that are verifiable, reproducible, and reusable

  12. To support future ML tasks, we need a tool that: Can reliably and efficiently extract both simple features and those that require language parsing Supports easy swap-in of different models for features that require training for identification (e.g., specific entities) Scales effectively when working with large (1TB+) collections Exports data that supports reuse and comparison

  13. libratom Available on Github at: https://github.com/libratom/libratom Releases on PyPI at: https://pypi.org/project/libratom/ Libratom provides a core library to assist in reading PST, OST, MBOX, and EML sources, along with a CLI. More coming soon!

  14. 883381 | the Department of Environmental Protection | ORG | david_delainey_000_1_2.pst | 2325380 883382 | Cellucci| PERSON | david_delainey_000_1_2.pst | 2325380 883383 | the United States | GPE | david_delainey_000_1_2.pst | 2325380 883384 | five | CARDINAL | david_delainey_000_1_2.pst | 2325380 883385 | six | CARDINAL | david_delainey_000_1_2.pst | 2325380 883386 | Jane Swift | PERSON | david_delainey_000_1_2.pst | 2325380 883387 | the Department of Environmental Protection | ORG | david_delainey_000_1_2.pst | 2325380 883388 | six |CARDINAL | david_delainey_000_1_2.pst | 2325380 883389 | the next few months | DATE | david_delainey_000_1_2.pst | 2325380 883390 | 1.5 | CARDINAL | david_delainey_000_1_2.pst | w2325380 883391 | 3 pounds | QUANTITY | david_delainey_000_1_2.pst | 2325380 883392 | megawatt-hour | TIME | david_delainey_000_1_2.pst | 2325380 883393 | five | CARDINAL | david_delainey_000_1_2.pst | 2325380 883394 | Sithe Energies, Inc. | ORG | david_delainey_000_1_2.pst | 2325380 Model: Spacy en_core_web_sm, trained on OntoNotes 5, below stats for raw / no gold ref text: With current command-line interface, can load different models (including user trained models) on demand for tasks / languages

  15. Processing Governor Jim Hunt Email (1997-2001) Approximately 2.5GB,with41 files, containing 77,818 messages Scanning the directory structure of the PST files for the complete corpus with ratom tool requires ~8 seconds on a 16-core machine. Performing entity extraction (via spaCy using the en_core_web_sm model) requires ~15 minutes on a 16-core machine Memory usage is bounded for the spaCy configuration and number of processes. For 32 processes, accessible memory is ~1.6GB/process, resident memory is ~500MB/process on average. Generates64MB sqlite3 file, containing 1,374,086 entity instances.

  16. Processing Governor Jeb Bush Email Approximately 7.2GB,with11 files, containing 251,509 messages Scanning the directory structure of the PST files for the complete corpus with ratom tool requires ~3 minutes on a 16-core machine. Performing entity extraction (via spaCy using the en_core_web_sm model) requires ~1 hr 15 minutes on a 16-core machine Memory usage is bounded for the spaCy configuration and number of processes. For 32 processes, accessible memory is ~1.6GB/process, resident memory is ~500MB/process on average. Generates a 330MB sqlite3 file, containing 7,655,587 entity instances.

  17. Processing EDRM v1.3 Enron (Redacted) Email Approximately 54GB,with191 files, containing 758,341 messages Scanning the directory structure of the PST files for the complete corpus with ratom tool requires ~4 minutes on a 16-core machine. Performing entity extraction (via spaCy using the en_core_web_sm model) requires ~3 hrs on a 16-core machine Memory usage is bounded for the spaCy configuration and number of processes. For 32 processes, accessible memory is ~1.6GB/process, resident memory is ~500MB/process on average. Generates a 991MB sqlite3 file, containing 18,557,722 entity instances.

  18. Releases and Code Quality The libratom library and supporting utilities are being written in Python 3. They are currently in early development. We’ve engineered the repository to automatically build and test every commit with Travis CI, and validate code quality and coverage with Codacy and Codecov. Development branches that pass these tests and are merged into our master repository. Both development and tagged releases are automatically built and pushed to the Python Package Index (PyPI). Access to our releases (development snapshots and major/minor tags) is direct via PyPI, as soon as they become available on GitHub. https://pypi.org/project/libratom/

  19. ml4arc - Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives (July 26, 2019) http://ratom.web.unc.edu/ml4arc/ml4arc/

  20. Project info, news, and blog posts: https://ratom.web.unc.edu/ Core library: https://github.com/libratom/libratom Demonstration notebooks: https://github.com/libratom/ratom-notebooks @RATOM_Project

More Related