1 / 48

Advanced Methods and Analysis for the Learning and Social Sciences

Advanced Methods and Analysis for the Learning and Social Sciences. PSY505 Spring term, 2012 February 13, 2012. Today’s Class. Classification and Behavior Detection. Prediction. Pretty much what it says A student is using a tutor right now. Is he gaming the system or not?

davida
Download Presentation

Advanced Methods and Analysis for the Learning and Social Sciences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Methods and Analysis for the Learning and Social Sciences PSY505Spring term, 2012 February 13, 2012

  2. Today’s Class • Classification and Behavior Detection

  3. Prediction • Pretty much what it says • A student is using a tutor right now.Is he gaming the system or not? • A student has used the tutor for the last half hour. How likely is it that she knows the skill in the next step? • A student has completed three years of high school. What will be her score on the college entrance exam?

  4. Two Key Types of Prediction This slide adapted from slide by Andrew W. Moore, Google http://www.cs.cmu.edu/~awm/tutorials

  5. Classification • There is something you want to predict (“the label”) • The thing you want to predict is categorical • The answer is one of a set of categories, not a number • CORRECT/WRONG (sometimes expressed as 0,1) • HELP REQUEST/WORKED EXAMPLE REQUEST/ATTEMPT TO SOLVE • WILL DROP OUT/WON’T DROP OUT • WILL SELECT PROBLEM A,B,C,D,E,F, or G

  6. Where do those labels come from? • Field observations (take PSY503) • Text replays (take PSY503) • Post-test data (take PSY503) • Tutor performance • Survey data • School records • Where else?

  7. Classification • Associated with each label are a set of “features”, which maybe you can use to predict the label Skill pknow time totalactions right ENTERINGGIVEN 0.704 9 1 WRONG ENTERINGGIVEN 0.502 10 2 RIGHT USEDIFFNUM 0.049 6 1 WRONG ENTERINGGIVEN 0.967 7 3 RIGHT REMOVECOEFF 0.792 16 1 WRONG REMOVECOEFF 0.792 13 2 RIGHT USEDIFFNUM 0.073 5 2 RIGHT ….

  8. Classification • The basic idea of a classifier is to determine which features, in which combination, can predict the label Skill pknow time totalactions right ENTERINGGIVEN 0.704 9 1 WRONG ENTERINGGIVEN 0.502 10 2 RIGHT USEDIFFNUM 0.049 6 1 WRONG ENTERINGGIVEN 0.967 7 3 RIGHT REMOVECOEFF 0.792 16 1 WRONG REMOVECOEFF 0.792 13 2 RIGHT USEDIFFNUM 0.073 5 2 RIGHT ….

  9. Classification • Of course, usually there are more than 4 features • And more than 7 actions/data points • These days, 800,000 student actions, and 26 features, would be a medium-sized data set

  10. Classification • One way to classify is with a Decision Tree (like J48) PKNOW <0.5 >=0.5 TIME TOTALACTIONS <6s. >=6s. <4 >=4 RIGHT WRONG RIGHT WRONG

  11. Classification • One way to classify is with a Decision Tree (like J48) PKNOW <0.5 >=0.5 TIME TOTALACTIONS <6s. >=6s. <4 >=4 RIGHT WRONG RIGHT WRONG Skill pknow time totalactions right COMPUTESLOPE 0.544 9 1 ?

  12. J48/C4.5 • Can handle both numerical and categorical predictor variables • Tries to find optimal split in numerical variable • Repeatedly looks for variable which best splits the data in terms of predictive power for each variable • Later prunes out branches that turned out to have low predictive power

  13. Step Regression Linear regression (discussed in detail in a later class), with a cut-off Essentially assigns a weight to each parameter, and then computes a numerical value Then all values below 0.5 are treated as 0, and all values >= 0.5 are treated as 1

  14. And of course… • There are lots of other classification algorithms you can use... • K* (instance-based classification) • JRip (rule-based classification using trees) • PART (rule-based classification using trees) • Neural Network • Logistic Regression • SMO (support vector machine) • In your favorite Machine Learning package

  15. If there’s time at the end of class… • We could go through some of these algorithms

  16. Comments? Questions?

  17. What data set should you generally test on? • A vote… • Raise your hands as many times as you like

  18. What data set should you generally test on? • The data set you trained your classifier on • A data set from a different tutor • Split your data set in half (by students), train on one half, test on the other half • Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. • Votes?

  19. What data set should you generally test on? • The data set you trained your classifier on • A data set from a different tutor • Split your data set in half (by students), train on one half, test on the other half • Split your data set in ten (by actions). Train on each set of 9 sets, test on the tenth. Do this ten times. • What are the benefits and drawbacks of each?

  20. The dangerous one(though still sometimes OK) • The data set you trained your classifier on • If you do this, there is serious danger of over-fitting

  21. The dangerous one(though still sometimes OK) • You have ten thousand data points. • You fit a parameter for each data point. • “If data point 1, RIGHT. If data point 78, WRONG…” • Your accuracy is 100% • Your kappa is 1 • Your model will neither work on new data, nor will it tell you anything.

  22. The dangerous one(though still sometimes OK) • The data set you trained your classifier on • When might this one still be OK?

  23. The dangerous one(though still sometimes OK) • The data set you trained your classifier on • When might this one still be OK? • Computing complexity-based goodness metrics such as BiC • Determine maximum possible performance of modeling approach

  24. K-fold cross validation (standard) • Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. • What can you infer from this?

  25. K-fold cross validation (standard) • Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. • What can you infer from this? • Your detector will work with new data from the same students

  26. K-fold cross validation (standard) • Split your data set in ten (by action). Train on each set of 9 sets, test on the tenth. Do this ten times. • What can you infer from this? • Your detector will work with new data from the same students • How often do we really care about this?

  27. K-fold cross validation (student-level) • Split your data set in half (by student), train on one half, test on the other half • What can you infer from this?

  28. K-fold cross validation (student-level) • Split your data set in half (by student), train on one half, test on the other half • What can you infer from this? • Your detector will work with data from new students from the same population (whatever it was) • Possible to do in RapidMiner • Not possible to do in Weka

  29. K-fold or leave-one-out • Really not clear which one is best (as discussed in previous lecture) • Certain kinds of re-sampling/bootstrapping/etc. are easier to do with k-fold cross-validation

  30. A data set from a different tutor • The most stringent test • When your model succeeds at this test, you know you have a good/general model • When it fails, it’s sometimes hard to know why

  31. An interesting alternative • Leave-out-one-tutor-cross-validation (cf. Baker, Corbett, & Koedinger, 2006) • Train on data from 3 or more tutors • Test on data from a different tutor • (Repeat for all possible combinations) • Good for giving a picture of how well your model will perform in new lessons

  32. Worth noting • If you want to know if your model will work on new populations • Cross-validate at the population level rather than the student level

  33. Comments? Questions?

  34. Homework 3 • Let’s look at some of the homework 3 solutions • Please comment on what’s right and wrong, what’s clever, etc. • We’ll look at the approaches, the goodness, the final models

  35. Homework 3 • Now let’s take the best homework • Any other ideas for how to come up with a better model? • Let’s try them!

  36. Feature Engineering • There are lots of fancy algorithms • But typically your detector is no better than your features • Features that have good construct validity are more likely to produce a good model • Particularly nice example of this in Sao Pedro et al. (under review) • In the next assignment, you’ll create your own features to try to produce a better model

  37. Assignment 4 • Let’s review Assignment 4

  38. Comments? Questions?

  39. Next Class • Wednesday, February 15 • 3pm-5pm • AK232 • Feature engineering and feature distillation • SPECIAL GUEST LECTURER: SUJITH GOWDA • Assignments Due: 4. Feature Engineering

  40. The End

  41. Bonus Slides • If there’s time

  42. BKT with Multiple Skills

  43. Conjunctive Model(Pardos et al., 2008) • The probability a student can answer an item with skills A and B is • P(CORR|A^B) = P(CORR|A) * P(CORR|B) • But how should credit or blame be assigned to the various skills?

  44. Koedinger et al.’s (2011)Conjunctive Model • Equations for 2 skills

  45. Koedinger et al.’s (2011)Conjunctive Model • Generalized equations

  46. Koedinger et al.’s (2011)Conjunctive Model • Handles case where multiple skills apply to an item better than classical BKT

  47. Other BKT Extensions? • Additional parameters? • Additional states?

  48. Many others • Compensatory Multiple Skills (Pardos et al., 2008) • Clustered Skills (Ritter et al., 2009)

More Related