1 / 29

Online Active Learning with Imbalanced Classes

Zahra Ferdowsi. Online Active Learning with Imbalanced Classes. October 15 th 2013. Accenture Technology Labs. DePaul University. Do we always have enough labeled data to train the classifier?. Active Learning Scenario. Large number of unlabeled examples

berg
Download Presentation

Online Active Learning with Imbalanced Classes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Zahra Ferdowsi Online Active Learning with Imbalanced Classes October 15th 2013 Accenture Technology Labs DePaul University

  2. Do we always have enough labeled data to train the classifier?

  3. Active Learning Scenario • Large number of unlabeled examples • The interactive nature (experts in the process) • Limited labeling resources • High labeling costs

  4. Healthcare example: motivation of this study • Inefficiencies in the healthcare insurance process result in large monetary losses affecting corporations and consumers • $91 billion over-spent in US every year on Health Administration and Insurance (McKinsey study’ Nov 2008) • 131 percent increase in insurance premiums over past 10 years

  5. Health Insurance Claim Process

  6. Healthcare example • Claim payment errors drive a significant portion of these inefficiencies • Increased administrative costs and service issues of health plans • Overpayment of Claims - direct loss • Underpayment of Claims – loss in interest payment for insurer, loss in revenue for provider

  7. Random Samples Early Rework Detection: How its done before 1. Random Audits for Quality Control Claims Database Manual Audits Auditors Extremely Low Hit Rates Long audit times due to fully manual audits

  8. Generate expert hypotheses Early Rework Detection: How its done before 2. Hypothesis and Rule Based Audits Database Queries Claims Database Hypothesis-basedaudits Auditors Better hit rates but still lot of manual effort in discovering, building, updating, executing, and maintaining the hypotheses

  9. Data • Duration: 2 years • Number of claims: 3.5 million • Labeled claims: 121k (49k rework) • Number of features: 16k

  10. Features • Member information • Provider information • Claim header • Contract information, total amount billed, diagnosis code, date of service • Claim line details • amount billed per service, procedure code, counter for the procedure (quantity)

  11. Predictive Modeling • Domain characteristics • High dimensional data • Sparse data • Fast training, updating and scoring required • Ability to generate explanation for domain experts • Classifier: Linear SVMs • Distance from margin is used as the ranking score

  12. Well-known Instance Selection Strategies (ISS) • Uncertainty • Distance to the hyper plain (Shen et. al, 2004) • Entropy (Settles, 2008) • Clustering • Density (similarity cosine) • Average similarity to all other cases (Shen et. al, 2004) • Hierarchical (Dasgupta, 2008) • k-means using Cosine similarity (Zhu et. Al, 2001)

  13. Well-known ISS (con.) • Hybrid approach: Density*Uncertainty (Zhu et. al, 2008; Settles et. al, 2008) • Query-by-Committee • measuring the level of disagreement of a few classifiers (Melville and Mooney, 2004)

  14. Select n instances randomly from the pool set Remove selected instances from the pool set Add these instances with label to the training set Select n instances from the pool set using an instance selection strategy Train the classifier on the training set Use the classifier to measure precision @ k% on testing set Is the pool set exhausted? No Yes End Experimental Setup • 5-fold cross-validation • Evaluation metric: precision at top 1%, 2%, and 5%. • Numbers of instances labeled in each iteration = 100 • SVM as the base classifier using LibSVM

  15. How do existing ISS perform? Claims data set

  16. How do existing ISS perform? Claims data set

  17. Experiments on more datasets • KDD cup 1999 dataset for network intrusion detection. I use the ”probing” intrusion as label. • HIVA is a chemoinformatics dataset was used to predict which compounds are active against AIDS HIV infection. • ZEBRAis an embryology dataset provides a feature representation of cells of zebrafish embryo to determine whether they are in division (meiosis) or not.

  18. How do existing ISS perform? ZIBRA data set

  19. Do existing ISS work? • No ISS is consistently the best in all domains and at all precision levels • Creating a validation set is challenging in since labeled data are scarce and expensive to obtain. Proposing an unsupervised score that can predict the performance of an ISS without using any additional labeled examples.

  20. Proposed Unsupervised Scores • MS on Unlabeled set (MSU) : mean score of the top k% instances in the unlabeled set • MS on Labeled set (MSL) : mean score of the top k% instances in the labeled set from the previous iteration • MS on All (MSA) : mean score of the top k% instances in the combined (unlabeled set and the labeled set from the last iteration) set.

  21. Do the new unsupervised scores work? • The graphs show high correlation between the score and precision. Certainty on Claims data set

  22. Do they work? • The correlation values are promising

  23. Can we use the unsupervised score to predict the best ISS in each iteration? • The online algorithm has two component: • The unsupervised score (MSU) that can track the performance of individual ISS without using any validation set. • a simple online algorithm that uses MSU to switch between different strategies. • The existing unsupervised score: • CEM (Classification Entropy Maximization) as score • Algorithm for switch between ISS (multi-armed bandit)

  24. Online Active Learning

  25. How does the online algorithm work? HIVA data set

  26. Conclusion • Proposing an online algorithm for active learning that switches between different candidate ISS for classification in imbalanced data sets. • This online algorithm has two components: • a score, MSU, that can track the performance of individual ISS without using any validation set • a simple online algorithm that uses change in MSU to switch between different strategies. • The online approach works better than (or at least similar to) the best individual ISS and achieves 80% - 100% of the highest possible precision.

  27. Questions

  28. References [1] Active learning challenge. [2] Kdd cup 1999. [3] J. Attenberg and F. Provost. Inactive learning?: difficulties employing active learning in practice. SIGKDD Exploration Newsletter, 12, March 2011. [4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochasticmultiarmed bandit problem. SIAM J. Comput., 32(1):48– 77, 2002. [5] Y. Baram, R. El-Yaniv, K. Luz, and M. Warmuth. Online choice of active learning algorithms. Journal of Machine Learning, 2004. [6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. [7] P. Donmez and J. G. Carbonell. Active sampling for rank learning via optimizing the area under the roc curve. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR ’09, pages 78–89, Berlin, Heidelberg, 2009. Springer. [8] P. Donmez, J. G. Carbonell, and P. N. Bennett. Dual strategy active learning. In ECML, 2007.

  29. References [9] J. He and J. Carbonell. Nearest-neighbor-based active learning for rare category detection. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, MIT Press, Cambridge, MA, 2008. [10] M. Kumar, R. Ghani, and Z.-S. Mei. Data mining to predict and prevent errors in health insurance claims processing. In KDD 2010, KDD ’10, New York, USA, 2010. [11] A. McCallum and K. Nigam. Employing em in pool-based active learning for text classification. In In Proceedings of the International Conference on Machine Learning (ICML), pages 359–367. Morgan Kaufmann, 1998. [12] H. T. Nguyen and A. Smeulders. Active learning using pre-clustering. ICML, 2004. [13] B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009. [14] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling tasks. In EMNLP, 2008. [15] S. Tong and D. K. Nguyen. Support vector machine active learning with applications to text classification. In In Proceedings of the International Conference on Machine Learning (ICML), pages 999–1006. Morgan Kaufmann, 2000.

More Related