1 / 35

Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households

Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households. Rajni Jain 1 , S. Minz 2 and P. Adhiguru 1 1 Sr. Scientist, NCAP, Pusa, New Delhi 2 Associate Professor, Jawaharlal Nehru University. Outline. Problem Knowledge Discovery Process Data Mining

viveca
Download Presentation

Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rough Set based Decision Tree for Identifying Vulnerable and Food Insecure Households Rajni Jain1, S. Minz2 and P. Adhiguru1 1Sr. Scientist, NCAP, Pusa, New Delhi 2Associate Professor, Jawaharlal Nehru University

  2. Outline • Problem • Knowledge Discovery Process • Data Mining • Classification Task of Data Mining • Methodology: RDT • Dataset for this Study • Classifier Model • Evaluation

  3. Problem of Food Security • Most often, available Funds are scarce • Need to target the Food security program to most vulnerable group. • Exhaustive surveys exclusively for this purpose will be very costly and time consuming. • Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.

  4. Pre-processed Data Preprocessing Selection Target Data Data Transformation Knowledge Discovery in Dataset • Selection phase defines KDD problem by focusing on a subset of data attributes or data samples on which KDD is to be performed. • Preprocessing care to be taken not to induce any unwanted bias. They include removing noise and missing data handling • Transformations may be combining attributes or discretizing continuous attributes • In Data Mining step many different learning and modeling algorithms are potential candidates Tranformed Data Data Mining Interpretation Knowledge Patterns

  5. Data Mining Tasks • Classification • Decision Tree • Decision Rule • Summarization • Association rules • Characteristic rules

  6. Classification Step I Training Data Rules/Tree/Formula Classification Algorithm Step II Estimate the predictive accuracy of the model. If acceptable Step III Step III Classification Rules Label the class New Data

  7. Data • Training Data • The data used for developing the model • Test Data • The data used to estimate the evaluation parameter of the model • New Data • Condition attributes known but decision attribute is not known

  8. Basis of Classification Algorithms • Rough Sets • Decision tree Learning • Statistics • Neural Network • Genetic Algorithms None of the method is suitable for all types of domain

  9. Methodology: Machine Learning • Rough Sets • Decision Tree induction • Rough set based Decision Tree induction (RDT) • Two phases • RS for dominant attributes selection • J4.8 for decision tree induction

  10. Rough Sets • 1980, Prof. Z. Pawlak, A Polish Mathematician • Indiscernible- similar Objects (say Patients, households etc.) • Indiscernibility Relation

  11. Indiscernibility Relation - contd.. Flu Patients U/IND(H)={{1,4,6}{2,3,5}} U/IND(F)={{1,2,3,6},{4,5}}

  12. Lower and Upper Approximation Let the Bigger Square represent the domain of the universe Small Squares represent the partitions of the universe for a given set of attributes P. All objects in a partition are indiscernible. Oval represents the concept X to be defined P (X)= {13,14,18,19} P(X)={7,8,9,12,13,14,15,17,18,19,20,22,23,24} Coming down in the other square, P={7},P(X)={7}, so crisp set

  13. Important Terms • Reduct: R • A minimum set of attributes that preserve the IND relation. • Decision relative reduct • Core : C • Intersection of all Reducts • Johnson’s method for single efficient reduct computation • GA based algorithm for multiple reducts computation

  14. Data Reduct Computation Algorithm Reduct Remove attributes absent in reduct Reduced Training Data ID3 Algorithm DT Architecture of RDT Model

  15. Decision Tree CHLD y n 0 HAGE Very old young old 1 middle 1 0 LAND 0 1 1 0

  16. Dataset • Source • Primary Survey data of 180 rural households from three villages as a part of the Project by Dr. P. Adhiguru at National Centre for Agricultural Economics and Policy Research (NCAP), India • 3 different production systems from Dharampuri district of Tamilnadu state • Actual food intake was measured by 24 hours recall method. Later corresponding nutrients intake was worked out

  17. Attributes • Attributes are the variables in the dataset that are used to describe the objects • Any attributes is either qualitative or quantitative • In classification problem two types of attributes are considered • Condition attributes - Independent Variables • Class or Decision attributes -Dependent Variable

  18. Food Groups Cereals and Millets Pulses Green leafy vegetables Fruits Milk Fats and oils Roots and Tuber Sugar Nutrients Protein Energy Calcium Iron Vitamin A Vitamin C Energy is used as a proxy for measuring food insecurity of the household

  19. Morphological Attributes HouseHold_Id 1. Land: Whether house has its own land 2. Hedu: Highest education of the head 3. Hage: age of the head in the household 4. Chld: Whether children in the family 5. Flsz: No of members in the family 6. PrWm: Proportion of Women to Family Size 7. Hstd: whether own home stead garden 8. Pear: proportion of earning to family size PCENER: Energy/Capita/day in terms of KCAL 9. Decision: Derived from PCENER

  20. Average Calorie Intake • In Tamil Nadu, Average intake per consumer unit per day in Kcal= 2347 • In Tamil Nadu, Calorie intake of the lowest decile per consumption unit per day in Kcal= 1551 • For All India, Calorie intake of the lowest decile per consumption unit per day in Kcal= 1954 • To identify poorest of the poor, lowest decile average figure was used • If Energy <1500 then decision attribute is labeled 0 means poorest of the poor or vulnerable to food insecurity Else 1 means not vulnerable to food insecurity

  21. Revisiting Problem • Most often, available Funds are scarce • Need to target the Food security program to most vulnerable group. • Exhaustive surveys exclusively for this purpose will be very costly and time consuming. • Need to learn simple concepts to facilitate identification of target beneficiaries people on the basis of morphological characteristics.

  22. Concepts to be Learned from Rural Household Dataset • Decision Tree • A hierarchical structure with root node and sub trees as children • Rules • Tree may be mapped to rules traversing the path from root to leaves

  23. Softwares • Rosetta for Rough set Analysis • Weka for Decision tree induction • C++ programs for interfacing between the two softwares • Excel for Evaluation of the classifiers

  24. Descriptionof Learning Algorithms

  25. DT and corresponding rules

  26. Evaluation • Experiment using 10 fold Cross Validation • Accuracy on Test data (A) • Complexity (S) • Number of Rules (Nr) • Number of attributes (Na) • Cumulative Score (CS)

  27. Evaluation of Simplified DT Accuracy =73% Complexity = 43 Number of rules = 9 Num. of attributes = 4 0 :poorest and vulnerable to food insecurity 1: not vulnerable to food insecurity

  28. Comparing Algorithms using CS

  29. Nutrition Dataset

  30. DT(DRJP) - Nutrition Data Accuracy=73% Complexity=43 Attributes=4 Rules=9 CHLD y n 0 HAGE <40 >51 [41,51) 40 1 1 0 FLSIZE <4 >4 4 1 1 PEAR <45 >45 [45,54) 1 1 0

  31. Benefits • Cost Effective • Timely • Simple to understand and implement • No scope for personal Bias

  32. Constraints • Development or model building requires expertise • Lack of synergy among disciplines • Adequate sample of data • Region specific • Mindset towards conventional and traditional techniques

  33. References • Adhiguru, P. and C. Ramasamy 2003. Agricultural-based Interventions for Sustainable Nutritional Security. Policy Paper 17. NCAP, New Delhi, India. • Han, J. and M. Kamber 2001. Data Mining: Concepts and Techniques. MK • Hand, D., Mannila, H. and P. Smyth 2001. Principles of Data Mining. PHI. • Minz S. and R. Jain 2003. Rough Set based Decision Tree Model for classification, In Proc of 5th Intl. Conference, DaWak 03, LNCS 2737. • Minz, S. and R. Jain 2005. Refining decision tree classifiers using rough set tools. International Journal of Hybrid Intelligent Systems, 2(2):133-147. • Pawlak, Z. 2001. Drawing Conclusions from Data-The Rough Set Way. IJIS 16: 3-11. • Polkowski, L. and A. Skowron 2001. Rough Sets in Knowledge Discovery 1 and 2, Heidelberg, Germany: Physica-Verlag. • Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kauffman. • Rosetta, Rough set toolkit at http://www.idi.ntnu.no/~aleks/ rosetta/. • Witten, I. H. and E. Frank 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, MK • Wroblewski, J. 1998. Genetic algorithms in decomposition and classification problems. In: Polkowski, L. and Skowron, A., Rough Sets in Knowledge Discovery 1 and 2, Heidelberg, Germany: Physica-Verlag 472-492. • Ziarko, W. 1993. Variable precision rough set model, Journal of Computer and System Sciences 46: 39-59.

  34. ?

  35. Thank You

More Related