Computer-Assisted Email Appraisal: RATOM SAA Research Forum Archives 2019

Computer-Assisted Appraisal of Email: RATOM SAA Research Forum Archives 2019 Austin, TX August 2, 2019 Christopher (Cal) Lee University of North Carolina at Chapel Hill School of Information and Library Science

Motivation – Selection/Appraisal • Despite progress on various technologies to support data management and digital preservation, relatively little progress on software support for the core activities of selection and appraisal • Selection/appraisal decisions are based on various patterns • When patterns can be identified algorithmically, software can assist the process • LAMs frequently want to take actions that reflect contextual relationships • Timeline representations and visualizations can also provide useful, high-level views of materials

Motivation - Email • 48 years of email creation • Hundreds of billions of messages generated every day • Most has little long-term retention value, but some absolutely does • Despite presence of numerous other modalities, email still deeply embedded in activities, serving as massive source of evidence and information • Often found in collections and acquisitions with other types of materials http://hci.stanford.edu/~jheer/projects/enron/v1/

Review, Appraisal and Triage of Mail (RATOM) • Funded by Andrew W. Mellon Foundation (2019-2020) • Developing and repurposing software (including NLP and machine learning) for selection/appraisal in BitCurator environment with hooks and enhancements to TOMES output • Support iterative processing - information discovered at various points in the processing workflow can support further selection, redaction or description actions • Mapping of timestamp, entity, sensitive features and other elements across the tools Ray Tomlinson https://upload.wikimedia.org/wikipedia/commons/0/01/Ray_Tomlinson_%28cropped%29.jpg

NC DAR Personnel UNC SILS Personnel Sangeeta Desai DAR Technical Lead Anusha Suresh Project Manager Antoine De Torcy Software Engineer Jamie Patrick-Burns Investigator Cal Lee PI Kam Woods Technical Lead Camille Tyndall Watson Co-PI

Scope Development of an integrated Python library to simplify parsing and processing PST, OST, and mbox email formats Utilities / wrappers to support entity identification and export entity reports in a format suitable for conducting automated and human-directed redaction actions at scale Utilities to apply machine learning techniques (by training on annotated message collections and/or unsupervised) to recognize candidate materials for retention Development of an interface allowing processing archivists to browse email collections and mark messages as suitable for retention *RATOM is in early development. This presentation examines some development efforts related to the first two goals.

Processing large PST email corpora

Why?

The Personal Storage Table (PST) format is proprietary, complicated, monolithic, inefficient, and insecure. Microsoft started warning enterprise orgs to avoid PST use in the early 2000s. https://github.com/libyal/documentation/blob/master/PFF%20Forensics%20-%20analyzing%20the%20horrible%20reference%20file%20format.pdf https://techtalk.gfi.com/why-is-using-psts-in-2016-a-bad-idea/

Lots of collecting orgs have unprocessed PSTs.

Open source forensic libraries (libpff and libuna), NLP libraries (spaCy), along with some multiprocessing code in Python 3 allow us to process email from various sources - PST, OST, MBOX, and EML, among others - quickly and generate feature files that are verifiable, reproducible, and reusable

To support future ML tasks, we need a tool that: Can reliably and efficiently extract both simple features and those that require language parsing Supports easy swap-in of different models for features that require training for identification (e.g., specific entities) Scales effectively when working with large (1TB+) collections Exports data that supports reuse and comparison

libratom Available on Github at: https://github.com/libratom/libratom Releases on PyPI at: https://pypi.org/project/libratom/ Libratom provides a core library to assist in reading PST, OST, MBOX, and EML sources, along with a CLI. More coming soon!

Processing Governor Jim Hunt Email (1997-2001) Approximately 2.5GB,with41 files, containing 77,818 messages Scanning the directory structure of the PST files for the complete corpus with ratom tool requires ~8 seconds on a 16-core machine. Performing entity extraction (via spaCy using the en_core_web_sm model) requires ~15 minutes on a 16-core machine Memory usage is bounded for the spaCy configuration and number of processes. For 32 processes, accessible memory is ~1.6GB/process, resident memory is ~500MB/process on average. Generates64MB sqlite3 file, containing 1,374,086 entity instances.

Processing Governor Jeb Bush Email Approximately 7.2GB,with11 files, containing 251,509 messages Scanning the directory structure of the PST files for the complete corpus with ratom tool requires ~3 minutes on a 16-core machine. Performing entity extraction (via spaCy using the en_core_web_sm model) requires ~1 hr 15 minutes on a 16-core machine Memory usage is bounded for the spaCy configuration and number of processes. For 32 processes, accessible memory is ~1.6GB/process, resident memory is ~500MB/process on average. Generates a 330MB sqlite3 file, containing 7,655,587 entity instances.

Processing EDRM v1.3 Enron (Redacted) Email Approximately 54GB,with191 files, containing 758,341 messages Scanning the directory structure of the PST files for the complete corpus with ratom tool requires ~4 minutes on a 16-core machine. Performing entity extraction (via spaCy using the en_core_web_sm model) requires ~3 hrs on a 16-core machine Memory usage is bounded for the spaCy configuration and number of processes. For 32 processes, accessible memory is ~1.6GB/process, resident memory is ~500MB/process on average. Generates a 991MB sqlite3 file, containing 18,557,722 entity instances.

Releases and Code Quality The libratom library and supporting utilities are being written in Python 3. They are currently in early development. We’ve engineered the repository to automatically build and test every commit with Travis CI, and validate code quality and coverage with Codacy and Codecov. Development branches that pass these tests and are merged into our master repository. Both development and tagged releases are automatically built and pushed to the Python Package Index (PyPI). Access to our releases (development snapshots and major/minor tags) is direct via PyPI, as soon as they become available on GitHub. https://pypi.org/project/libratom/

ml4arc - Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives (July 26, 2019) http://ratom.web.unc.edu/ml4arc/ml4arc/

Project info, news, and blog posts: https://ratom.web.unc.edu/ Core library: https://github.com/libratom/libratom Demonstration notebooks: https://github.com/libratom/ratom-notebooks @RATOM_Project

Computer-Assisted Email Appraisal: RATOM SAA Research Forum Archives 2019

Computer-Assisted Email Appraisal: RATOM SAA Research Forum Archives 2019

Presentation Transcript

Computer Assisted Learning/Multimedia

Computer Assisted Tactile Graphics

COMPUTER - ASSISTED INFUSION OF MUSCLE RELAXANTS

Computer-Assisted Learning

Computer Assisted Attention Training

Computer-Assisted Language Learning

A Computer-Assisted Test for Accessible Computer-Assisted Assessment

Computer Assisted Learning [CAL]

Computer Assisted Translation CAT

Computer Assisted Language Learning

Computer Assisted Teaching of Literature

Computer Assisted Sequencing of Cyclic Peptides

Computer assisted assessment of essays

COMPUTER ASSISTED INSTRUCTION

Computer-assisted essay assessment

Computer-Assisted Assessment (CAA)

Computer- Assisted Assessment

Computer Assisted Coding Software

Computer Assisted Assessment

Computer-assisted essay assessment

Benefits of Computer-Assisted Coding Software