110 likes | 205 Views
Explore strategies for choosing the right institution for biomedical research funding through data analysis. Utilized preprocessing techniques and a bag-of-words approach. Generated a TFIDF matrix and reduced it for efficiency using SVD, leading to optimized grantor selection. Results from evaluation methods such as kNN search and custom scoring algorithm showcased the efficacy of the approach.
E N D
Where Do You Go forBiomedical Funding? Yi Liu, Ahmet Altay
Background • Problem • In biomedical research there are many sources of federal funding. • How to choose the right institution for funding for a given research idea? • Data • Biomedical grant summaries from 20 institutions between the period 1972 and 2009
Pre-Processing • Clean up texts from mark-up/meta words/duplicates • Remove institutions with less than 5000 grant information • Bag-of-words approach with a pre-determined dictionary • Removed 319 stop words from text • Used stemming (Porter) to further collapse text • Dictionary size of 83485 with 120636 distinct spellings • Use mgrep to annotate our data with dictionary words
Processing • Generate a TFIDF matrix given the dictionary and abstracts • TFIDF matrix is huge (83435 by 561769) • Reduce TFIDF matrix for computational efficieny • Remove zero dictionary counts and abstracts • Use SVD and represent use a smaller sub-space of original matrix • Singular values decrease quickly. We used first 100 eigen vectors without losing much precision.
Effect of Using Eigen Sub-space • Tested performance of smaller data set (400). • Performance of raw TFIDF is similar to eigen sub-space.
Evaluation • For a given test abstract we used kNN search to find 100 closest abstracts. • Used a custom scoring algorithm to pick a grantor that best represents 100 nearest neighbors found: • Tested entire data set using Leave-1-out cross-validation