Tuned Models of Peer Assessment in MOOCs
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

Jonathan Huang PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on
  • Presentation posted in: General

Tuned Models of Peer Assessment in MOOCs . Chris Piech. Jonathan Huang. Stanford. Zhenghao Chen. Chuong Do. Andrew Ng. Daphne Koller. Coursera. A variety of assessments. How can we efficiently grade 10,000 students?. The Assessment Spectrum. Multiple choice. Coding assignments.

Download Presentation

Jonathan Huang

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Jonathan huang

Tuned Models of Peer Assessment in MOOCs

Chris Piech

Jonathan Huang

Stanford

Zhenghao Chen

Chuong Do

Andrew Ng

Daphne Koller

Coursera


A variety of assessments

A variety of assessments

How can we efficiently grade 10,000 students?


The assessment spectrum

The Assessment Spectrum

Multiple choice

Coding assignments

Proofs

Essay questions

Short Response

Long Response

Easy to automate

Hard to grade automatically

Can assign complex assignments and provide complex feedback

Limited ability to ask expressive questions or require creativity


Stanford coursera s hci course

Stanford/Coursera’sHCI course

Video lectures + embedded questions, weekly quizzes, open ended assignments


Student work

Student work

Slide credit: ChinmayKulkarni


Calibrated peer assessment

staff-graded

1) Calibration

2) Assess 5 Peers

3) Self-Assess

Calibrated peer assessment

[Russell, ’05, Kulkarni et al., ‘13]

Similar process also used in Mathematical Thinking, Programming Python, Listening to World Music, Fantasy and Science Fiction, Sociology, Social network analysis ....

Slide credit: ChinmayKulkarni (http://hci.stanford.edu/research/assess/)

Image credit: Debbie Morrison (http://onlinelearninginsights.wordpress.com/)


Largest peer grading network to date

Largest peer grading network to-date

  • 77“ground truth” submissions graded by everyone (staff included)

HCI 1, Homework #5


How well did peer grading do

How well did peer grading do?

Black stuff  much room for improvement!

within 5pp

within 10pp

Up to 20% students get a grade over 10% from ground truth!

~1400 students


Peer grading desiderata

Peer Grading Desiderata

  • Statistical model for estimating and correcting for grader reliability/bias

  • A simple method for reducing grader workload

  • Scalable estimation algorithm that easily handles MOOC sized courses

Our work:

  • Highly reliable/accurate assessment

  • Reduced workload for both students and course staff

  • Scalability (to, say, tens of thousands of students)


How to decide if a grader is good

How to decide if a grader is good

Who should we trust?

Graders

100%

100%

50%

55%

56%

30%

54%

Need to reason with all submissions and peer grades jointly!

Idea: look at the other submissions graded by these graders!

Submissions


Statistical formalization

Statistical Formalization

Submission

Student/Grader

Average grade inflation/deflation

True score

Bias

Reliability

Grading variance

Observed score

Observed score

Observed score

Observed score

Observed score


Model pg 1

Model PG1

Modeling grader bias and reliability

Related models in literature

True score of student u

Crowdsourcing

[Whitehill et al. (‘09), Bachrach et al. (‘12), Kamar et al. (‘12) ]

Grader reliability of student v

Anthropology

[Batchelder & Romney (‘88)]

Grader bias of student v

Peer Assessment

[Goldin & Ashley (‘11), Goldin(‘12)]

Student v’s assessment of student u (observed)


Correlating bias variables across assignments

Correlating bias variables across assignments

20

15

10

y = 0.48*x + 0.16

5

Bias on Assn 5

0

-5

Biases estimated from assignment T with biases at assignment T+1

-10

-15

-20

-20

-15

-10

-5

0

5

10

15

20

Bias on Assn 4


Model pg 2

Model PG2

Grader bias at homework T depends on bias at T-1

Temporal coherence

True score of student u

Grader reliability of student v

Grader bias of student v

Student v’s assessment of student u (observed)


Model pg 3

Model PG3

Coupled grader score and reliability

True score of student u

Your reliability as a grader depends on your ability!

Grader bias of student v

Approximate Inference:

Gibbs sampling (also implemented EM, Variational methods for a subset of the models)

Running time: ~5 minutes for HCI 1

** PG3 cannot be Gibbs sampled in “closed form”

Student v’s assessment of student u (observed)


Incentives

Incentives

Scoring rules can impact student behavior

Model PG3 gives high scoring graders more “sway” in computing a submission’s final score.

Improves prediction accuracy

Model PG3 gives higher homework scores to students who are accurate graders!

Encourages students to grade better

See [Dasgupta & Ghosh, ‘13] for a theoretical look at this problem


Prediction accuracy

Prediction Accuracy

  • 33% reduction in RMSE

  • Only 3% of submissions land farther than 10% from ground truth

Baseline (median) prediction accuracy

Model PG3 prediction accuracy


Prediction accuracy all models

Prediction Accuracy, All models

Despite an improved rubric in HCI2, the simplest model (PG1 with just bias) outperforms baseline grading on all metrics.

Just modeling bias (constant reliability) captures ~95% of the improvement in RMSE

An improved rubric made baseline grading in HCI2 more accurate than HCI1

PG3 typically performs other models

HCI 1

HCI 2


Jonathan huang

Meaningful Confidence Estimates

When our model is 90% confident that its prediction is within K% of the true grade, then over 90% of the time in experiment, we are indeed within K%. (i.e., our model is conservative)

We can use confidence estimates to tell when a submission needs to be seen by more graders!

Experiments where confidence fell between .90-.95


Jonathan huang

How many graders do you need?

Some submissions need more graders!

Some grader assignments can be reallocated!

Note: This is quite an overconservative estimate (as in the last slide)


Understanding graders in the context of the mooc

Understanding graders in the context of the MOOC

Question: What factors influence how well a student will grade?

Better scoring graders grade better

“Harder” submissions to grade

“Easiest” submissions to grade

Standard deviation

Standard deviation

Mean

Mean


Jonathan huang

Residual given grader and gradee scores

The worst students tend to inflate the best submissions

Grade inflation

Gradee grade (z-score)

Best students tend to downgrade the worst submissions

Grade deflation

Grader grade (z-score)

# standard deviations from mean


Jonathan huang

How much time should you spend on grading?

“sweet spot of grading”: ~ 20 minutes


Jonathan huang

What your peers say about you!

Best submissions

Worst submissions


Jonathan huang

Commenting styles in HCI

Students have more to say about weaknesses than strong points

sentiment polarity

sentiment polarity

feedback length (words)

On average, comments vary from neutral to positive, with few highly negative comments

feedback length

residual (z-score)


Jonathan huang

Student engagement and peer grading

all features

1

just grade

0.9

0.8

just bias

0.7

just reliability

0.6

True Positive Rate

Task: predict whether a student will complete last homework

0.5

0.4

0.3

0.2

(AUC = 0.97605)

0.1

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate


Takeaways

Takeaways

Peer grading is an easy and practical way to grade open-ended assignments at scale

Real world deployment: our system was used in HCI 3!

Reasoning jointly over all submissions and accounting for bias/reliability can significantly improve current peer grading in MOOCs

Grading performance can tell us about other learning factors such as student engagement or performance


Jonathan huang

The End


Jonathan huang

Gradient descent for linear regression

~40,000 submissions


  • Login