1 / 12

USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE

USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE. Machine learning at Aker BP. William Naylor, Masa Nekic , Peder Aursand , Vidar Hjemmeland Brekke 20 th September, 2019. Document search engine customised for oil and gas documents. PrettyPoly. Polygon search Geotagging

sburks
Download Presentation

USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE Machine learning at Aker BP William Naylor, Masa Nekic, PederAursand, Vidar Hjemmeland Brekke 20th September, 2019

  2. Document search engine customised for oil and gas documents PrettyPoly • Polygon search • Geotagging • Advanced query builder • Collaboration/sharing • Admin panel • Document engine • Sensitive content filtering • Document tagging

  3. Pdf woRD Excel ..... PrettyPoly’s document engine Doc type Language Keywords ML tags used for filtering All docs to json

  4. The data: Extremely varied text documents Around 3 million documents Currently 20 classes (with labelled data) Between 60 and 2,000 examples per class Large ‘Undefined’ class • Have some ‘known unknowns’

  5. User feedback: Additional data

  6. Needs to handle: undefined class (open set) growing numbers of classes growing numbers of examples per class (great) multiple languages extremely varied texts • long texts illogical sentence structure • imagine the text from an excel spreadsheet varied class importance Demands on ML classifier

  7. 20 class open set classification Train simple ‘yes/no’ classifierfor EACH CLASS Load json content Apply preprocessing • regex • stemming Tfidf encoding Loop through classifiers percategory predicting probability Pick highest or ‘Unknown’ ifless than 0.6 ML classification Peer review Contracts Mud report ...

  8. Random forest / Decision tree / XGBoost all handle any type of feature For text, keep inputs sparse Adding additional fields

  9. A lot of information lies in the unlabelled data. Labelled data won’t be a representative sample Idea 1: • Take (some) random unlabelled data and label it as “Undefined” in training Idea 2: • Train model initially • Predict on unlabelled data • Add data with probability over 0.8 to target class • Add data with probability under 0.2 to “Undefined” • Retrain Using unlabelled data No additional data * acc: 0.90 * Acc (with ROS): 0.91 1 K added * Acc (with ROS): 0.90 5 K added * ACC: 0.87 * Acc (with ROS): 0.89 HAVEN’T TESTED, DOES WORK IN MANY OTHER CASES. DON’T BELIEVE IT WILL HELP WITH SAMPLING PROBLEM

  10. Overfitting can be a major problem ~4 k training examples ~20 k features (words) Loop over models in training and pick out best against a dev set Frequently a Log Reg or DT overfit Forcing a RF (lower dev accuracy) gives better test results. Supressing overfitting

  11. Topics covered • Document enrichment a part of PrettyPoly • Built an open set classifier for long documents • Has user feedback as part of training loop Not covered (feel free to ask me) • Preprocessing • Encoding schemes • Handling of sensitive classes (contracts) Future ideas / problems • Model evaluation • Numbers and excel spreadsheets • Clustering • Explicit filtering for some classes (regex rules) Summary and future plans

More Related