Problem Description “Using Machine Learning to Make Money at Horse Races”

Problem Description“Using Machine Learning to Make Money at Horse Races” PosDrawBtnHorseWgtJockeyTrainerAgeSPCommentsRaceid 1 4 Timocracy 10-0 S Drowne A B Haynes 5 4/9 f led after 1f, ridden 2f out, stayed on well and in command final furlong opened 4/5 touched 4/5 £800-£1100 £400-£550 £400-£650 (x3) £400-£750 (x4) £500-£1000 (x4) £300-£600 £200-£400 (x2) 372966 2 5 1¾ Bussell Along (IRE) 9-3 S Sanders Stef Higgins 4 8/1 held up early, headway on outside to chase leaders 4f out, effort and hung left from 2f out, went 2nd 1f out, no chance with winner opened 10/1 touched 10/1 372966 Task: Given a training set; learn a function which predicts the winner/selects a horse to bet $10 on from a given set of entries. Performance Measures: • Accuracy in picking the winner of a race (simple version) • Return of placing a $10 bet on a horse in the race (advanced version; solves the “real problem” trying to make money on the track) Links: http://www.racingpost.com http://www.drf.com/ http://www.shrp.com/ http://socialmediaseo.net/2010/05/01/kentucky-derby-2010/ http://www.equibase.com/premium/eqbPDFChartPlus.cfm?RACE=11&BorP=P&TID=CD&CTRY=USA&DT=05/01/2010&DAY=D&STYLE=EQB

Problem Description2 • This is an individual project • In general, the problem is a ranking problem; one approach is to learn a function that assigns a score to the horses in a race and pick the horse with the highest score. But it can also be viewed as a classification or prediction problem. • The datasets will be “very basic” only containing a few attributes, but you are allowed to create additional attributes by creating statistics from datasets/by extracting information from other sources (e.g. percentage of races won by a jockey) • Basically, the project tries to predict the future. Likely we will use races of a single race track, given you are true temporal sequence of race: DS1(races in Jan./Feb.), DS2 (races in March/April),…DS6(true testset---you are not allowed to peak into this one; only Chun-sheng has access to this dataset) which serve as training sets, validation sets, test sets, and sources of new feature generation in the project. • Student have freedom in what approaches to use—there are many of them; adhoc approaches are welcome; likely every student will use a different approach, and some will be quite complicated while others use simpler appraoches. • The goal is to get something running; students who use a well-tuned simple approach will get a better grades than students who use a very complicated, sophisticated approach which does not run at all. • Deliverables: You will demo your system, write a medium-sized report, and Chun-sheng will test your system with a test set of his own. • You are allowed to use any software/tool in the project; you just have to mention what you used in your report/ • In general, the submission deadline is We., March 23, 11p, but the idea is you spent at most 5 weeks on the project!

Data (Wolverhampton (UK)) • http://maps.google.com/maps?rlz=1T4ADRA_enUS403US403&um=1&ie=UTF-8&q=wolverhampton+racecourse&fb=1&gl=us&hq=wolverhampton+racecourse&cid=0,0,4040866765308641574&ei=T0RdTfnEFYP98AaOqoGMCw&sa=X&oi=local_result&ct=image&resnum=2&ved=0CC4QnwIwAQ • http://www.racingpost.com/horses2/cards/meeting_popup.sd?crs_id=513&action_date=2011-02-17&selected_tab=COURSE_MAP • http://www.wolverhampton-racecourse.co.uk/ • http://www.racingpost.com/horses2/cards/meeting_popup.sd?crs_id=513&action_date=2011-02-17&selected_tab=MEETING_INFO • There is a chance that we still change the race track, but the data sets formats likely will not change. • Data: • RaceID ID# identify the race • Pos. Finish position of the horse • Draw The start stall that a horse has been allocated • Dist. The distance a horse has finished behind the winner • Horse The name of the horse • Wt The weight that the horse carries • Jockey The name of the Jockey • Trainer The name of the trainer • Age The age of the horse • SP The official starting price of the horse (optional) • RaceHeader Metadata of the race • RaceDetail Metadata of the race • Winning time The finish time of the winner • Date The date of the race • 3 Example Entries of the “April 2010” File (http://www2.cs.uh.edu/~lyons19/Horse/training/April.txt ): “414","1","1","","Little Edward","9-9","G Baker","R J Hodges","12","5/6 f","Wolverhampton 14:30 - Result","£2600 added, 3yo plus, 5f 20y, Class 6, £1774 penalty, 7 ran","","Monday April 26 2010“ "414","3","4","1/2","Hinton Admiral","9-5","V Slattery","M S Tuck","6","8/1","Wolverhampton 14:30 - Result","£2600 added, 3yo plus, 5f 20y, Class 6, £1774 penalty, 7 ran","","Monday April 26 2010“ "446","2","1","2","Mejd (IRE)","9-3","K Fallon","M R Channon","3","2/1","Wolverhampton 20:30 - Result","£3600 added, 3yo only, 1m 141y, Class 5, £2457 penalty, 4 ran","Winning Time: 1m 52.11s","Saturday April 10 2010"

Problem Description3 • More on alternative approaches: • Treat it as a regression/prediction problem; e.g. assign 1/n to the n-th finisher in the race (or use the percentage of the prize money allocated to the place the horse took in the race over the total price money); then learn a prediction function f; if h1,…,hn are the horses entered into a race, use a decision making policy that uses f(h1),…,f(hn), possibly the prior odds oh1),…,o(hn), e.g bet on horse i which has the maximum value for: o(hi)/(f(hi)/nf(hi)) • Treat the problem as a classification problem, in which the classification algorithm picks the horse to win/the horse to bet on • Treat the problem as a ranking problem (you can learn a similar function as in the first approach except the objective function minimizes the number of rank violations(e.g. http://olivier.chapelle.cc/pub/err.pdfdescribes such a function; e.g. look at http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html which gives the code for a ranking support vector machine, this approach is also connected to a Yahoo! Contest “Learning to Rank”: http://learningtorankchallenge.yahoo.com/index.php • Honesty Rule: Datasets a temporally ordered; D1<…<DN (‘<‘ means prior to); when learning from dataset Di, you are only allowed to use knowledge from datasets D1,…,Di but not from later datasets Di+1,…,Dn • When using prior statistics, you will have to deal with a cold start problem; there will be new horses, jockeys,…; usually, you should initialize those using average values or other values but not 0. • Your approach should focus on horses which win, and not on horses that place second and third a lot, due to the performance measures used in the project (see first slide).

Problem Description 4 • You can use any tools and software packages, such as WEKA, regression packages • Your system should have 3 (2) modules: • A Preprocessing module that formats the dataset and adds additional information to the dataset (optional) • A Learning module which takes a dataset and creates a model (also reports some training statistics) • A Testing modules that uses a model, picks horses, and creates a detailed report of using the model for the races in the test set; e.g.: Race 337: Bet on “Sally” and lost $10 Race 338: Bet on “Enforcer” and lost $10 Race 339: Bet on “Caregiver” and won $5 (odds were ½) Race 340: Bet on “Trailer” and won $30 (odds were 3/1) Race 341: Bet on “Lateentry” and lost $10 Total: Won $5 total, $1 per race Comment: There should be a way to deliver your model for testing (e.g. to Chun-sheng) • I suggest you the non-horse picking parts of the systems first (e.g. Use choosing the horse with the second highest odds as your initial horse picking function), and then focus on learning “good” horse picking functions from the project. • There is a transfer learning aspect of the project; e.g. you could use the model for other race tracks in the UK and US. • We might allow the use of some basic statistics, for jockeys and trainers (but not for horses) for the Wolverhampton Race Track such as: http://www.racingpost.com/horses2/cards/meeting_popup.sd?crs_id=513&action_date=2011-02-17&selected_tab=MEETING_INFO

Problem Description “Using Machine Learning to Make Money at Horse Races”