1 / 23

Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems. Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri presented by Thiago Pardo. USP NLP Group and UFSCar Database Group, São Carlos, BR.

ruby
Download Presentation

Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems • Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri • presented by Thiago Pardo • USP NLP Group and UFSCar Database Group, São Carlos, BR

  2. Context and Motivation An Environment for Data Analysis - IEA-AIE2010 • A lot of electronic documents that report experiments • treatment adopted • patients with some kind of disease • number of patients enrolled in the treatment • symptoms and risk factors • positive and negative effects • There are several transactions and journals • e.g., American Journal of Hematology, Blood, and Haematologica

  3. Context and Motivation An Environment for Data Analysis - IEA-AIE2010 Nowadays, researchers and doctors are not able to process this huge number of documents

  4. Context and Motivation An Environment for Data Analysis - IEA-AIE2010 These documents are in unstructured format, i.e., in plain textual form, specially in PDF There is necessary to transform these data from unstructured to structured format in order to submit it to an automatic knowledge discovery process

  5. Goal An Environment for Data Analysis - IEA-AIE2010 • Development of an environment called IEDSS-Bio for analyzing data of biomedical domain, i.e., Sickle Cell Anemia • Support the expert in making decisions: • Extracting relevant information from biomedical documents • Storing the information in a data warehouse (DW) • Mining interesting knowledge from the DW

  6. Contributions An Environment for Data Analysis - IEA-AIE2010 • Theoretical: • Domain Knowledge • Methodology of Information Extraction • Practical: • Resources: collection of documents, dictionary and rules • Tools: Converter, Information Extraction, Data Warehouse, Data Mining systems

  7. The Environment for Data Analysis • How many patients had clinical improvement and were treated with the hydroxyurea drug? A significant amount of patients under treatment with the hydroxyurea drug tend to have marrow depression. An Environment for Data Analysis - IEA-AIE2010

  8. Converter Module An Environment for Data Analysis - IEA-AIE2010

  9. Converter Module An Environment for Data Analysis - IEA-AIE2010

  10. Information Extraction Module • Processed Sections: • Abstract, Results and Discussion (class of positive and negative effects) • All Sections (class of patient) An Environment for Data Analysis - IEA-AIE2010

  11. Training Sentence Classification Test New Text TXT • Negative Effect Several files about complication sentences Positive Effect Several files about benefit sentences ML Techniques Classes Output Others • Set of • sentences • classified • into classes Several files about other sentences An Environment for Data Analysis - IEA-AIE2010

  12. Identification of Relevant Information Dictionary Biomedical Database An Environment for Data Analysis - IEA-AIE2010

  13. Identification of Relevant Information Rules Identification of Information Pipeline Example of Sentences Relevant Information An Environment for Data Analysis - IEA-AIE2010

  14. Experiments: Sentence Classification An Environment for Data Analysis - IEA-AIE2010 How do human beings manually perform the sentence classification? Is it feasible to automate the sentence classification task? What kind of classification algorithm performs better in this task?

  15. Manual Classification by humans? 1 Fleiss (1971) An Environment for Data Analysis - IEA-AIE2010 Annotation Agreement in 50 sentences

  16. It is feasible to automate this task? 2 Landis e Koch (1977) An Environment for Data Analysis - IEA-AIE2010

  17. What kind of classification algorithm performs better in this task? 3 Distribution of classes for each sample An Environment for Data Analysis - IEA-AIE2010

  18. Sentence Classification Process:training and testing phase 3 • Bag-of-words model • AVM configuration: • Minimum Frequency = 2 • Attributes: 1 to 3-grams • 1, for the case the n-gram occurs in the sentence (present); • 0 otherwise (absent). • Not considered: stopwords removal and stemming An Environment for Data Analysis - IEA-AIE2010

  19. Evaluation 3 • Partitioning method: 10-fold cross-validation An Environment for Data Analysis - IEA-AIE2010

  20. Conclusions An Environment for Data Analysis - IEA-AIE2010 • The environment proposed – Information Extraction and Decision Support System in Biomedical domain – aims at being • a general environment for mining relevant information in the biomedical domain • First experiments on sentence classification • a step of the whole process • very good results (95.9% accuracy) for papers about Sickle Cell Anemia (SCA) • Task of sentence classification in the SCA domain is well defined and possible to be automated

  21. Future Work An Environment for Data Analysis - IEA-AIE2010 • Investigate the identification of treatment and symptoms information in scientific papers • Extract of the relevant sentence pieces for populating our databases • using IE approaches, e.g., rule-based and dictionary-based • Investigate the use of parallel processing to optimize the more time-consuming tasks, • e.g., the application of data mining algorithms and the analytical query processing • Other biomedical areas may also benefit from our text mining approach

  22. An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Questions ? • USP NLP Group and UFSCar Database Group, São Carlos, BR

  23. References An Environment for Data Analysis - IEA-AIE2010 ANTHONY, L.; LASHKIA, G. V. Mover: a machine learning tool to assist in the reading and writing of technical papers. IEEE Transactions on Professional Communication, v. 46, n. 3, p. 185-193, 2003. FLEISS, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, v. 76, n. 5, p. 378-382, 1971. LANDIS, J. R.; KOCH, G. G. The measurement of observer agreement for categorical data. Biometrics, v. 33, n. 1, p. 159-174, 1977. PINTO, A. C. S. et al. Technical Report "Sickle Cell Anemia". São Carlos: Department of Computer Science, Federal University of São Carlos, 2009. p. 16. Available at: <http://sca.dc.ufscar.br/download/files/report.sca.pdf>.

More Related