1 / 20

Project Omniglean

Project Omniglean. Kenny Trytek Joe Briggie Abby Birkett Derek Woods. Advisor: Simanta Mitra Client: Matt Good, Kingland Systems. Problem Statement. Large companies have many layers of corporate hierarchy.

barid
Download Presentation

Project Omniglean

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project Omniglean Kenny Trytek Joe Briggie Abby Birkett Derek Woods Advisor: SimantaMitra Client: Matt Good, Kingland Systems

  2. Problem Statement • Large companies have many layers of corporate hierarchy. • Financial and data records sometimes conflict between various layers/entities. • Accurate and comprehensive company records are needed for auditing and stock conflict resolution. • There is a need for “Data Mastering”, to take multiple conflicting sources of data and determine the reality ofthe matter.

  3. Basic Requirements • System shall autonomously traverse publicly available websites and collect information • System shall store parsed information in a flat file • System shall maintain a normalized database • System shall expose functionality through web services • A single run of system shall complete execution in less than six hours

  4. Design Decisions • Implementation in C# • ASP.NET GUI with jQuery UI widgets • Operable in a Windows environment (XP or later) Risks • Site data structures or hierarchies can change at any time • Reliance on third party PDF text parser, grid control, and AJAX library • Inconsistencies in data

  5. DAL Database ETL Tool Normalized External Client UI Kingland Data Analyst UI Web Svcs. No Conflicts? System Diagram Scraper Tool WWW Data HTML Parser PDF Parser Flat File Create Read Update Delete

  6. Scraper World Wide Web Parser Flat File (XML) PDF Parser HTML Parser Harvester Module • The harvester performs thework of gathering data fromthe external sites • After the data is scraped and parsed,the harvester constructs XMLfiles for each data source • Finally, the ETL is notified the data is ready

  7. Harvester Difficulties • Constructing a POST request to retrieve the PDFs required extracting a complex view state • Difficult to extract text from PDF • Inconsistencies in extracted text • City names were occasionally malformed • Extra formatting characters were present inextracted text

  8. Flat File (XML) ETL Tool DAL ETL (Extract, Transform, Load) • The ETL performscleanup operationson the data fromthe harvester • If there are malformed tags or invalid characters, they are escaped here • Maintains an error log • Loads data into database through DAL (DataAccess Layer)

  9. ETL Difficulties • Implementing multi-threaded execution forbetter performance • Dealing with malformed input

  10. Database DAL Add() Find() Update() Delete() ETL Tool User Interface DAL (Data Access Layer) • Maintains a normalizedMySql database • Provides CRUD operations(Create, Read, Update, Delete) • No particular difficultiesencountered in database creation DAL Difficulties

  11. Services Read() Write() Update() Progress() Delete() Web Services • Expose the DAL for access from external web apps • Accessed by HTTP GET or POST requests • Returns JSON objects containing data • Returning large JSON objects to the UI Web Services Difficulties

  12. GUI (Graphical User Interface)

  13. GUI Difficulties • Implementing auto complete functionality for query efficiency • Progress bar updates • Grid configuration and updating • Retrieving large amounts of data from web services

  14. Overall Test Plan • Test each module individually to ensure independent functionality • As modules are completed, test integration pairs to ensure channel adequacy • When all modules are integrated, test systemend-to-end using web app

  15. Harvester / Parser Test Plan • Ensure harvester can connect to site for scraping and retrieve the appropriate data • Maintain a list of input files that produce specific output after parsing • Define corner cases for sub-function robustness evaluation / testing • Ensure errors are caught and handled appropriately

  16. ETL Test Plan • Maintain a list of input files that produce specific output after data cleanup • Ensure errors are caught and handled appropriately • Confirm ETL can talk to DAL

  17. DAL Test Plan • Ensure database can have records created, read, updated, and deleted • Define corner cases and error handling for invalid database operations • Create list of operations with expected results

  18. Web Services Test Plan • Call each web service with expected input and check return values • Call web services with invalid input and checkreturn values

  19. Project Future • Database model can be generalized to include any number of data sources • Harvester can be separated from ETL so additional data sources will not require ETL change • Optimization / multithreading of harvester and parser for greater efficiency • User access control features in web application • Two separate GUIs: one for Kingland clients, and one for Kingland data analysts

  20. Questions?

More Related