Event History Models: Why R? Why SabreR? Rob Crouchley

Event History Models: Why R? Why SabreR? Rob Crouchley

Contents • Some science • Performance of the available tools for multilevel models • Breaking the technological barrier to adoption (sabreR) • Demo • Performance of parallel sabreR • Conclusions

Some Science: BHPS Data (small dataset) • Sample of males who were employed and earning a wage at some point over the period 1991-2003 (13 years) • Gives a total of 5130 individuals with a sequence of responses that occurred somewhere in the 1991-2003 interval • At the 1st sample point of the survey (1991) there were 2316 individuals of whom 945 of these males had some form of training in the previous 12 months, • 106 had been promoted in the previous 12 months. The mean of the log of their weekly wage was 5.65 (Sterling)

What is the Effect of Training & Promotion on Wages? Suppose we want to disentangle the dependencies between: Promotion (P=1,0) in the last 12 months (latent var P*) On the job training (T=1,0) in the last 12 months (latent var T*) Current wages (W)

Correlated Random Effects Model

Commercial Software for MGLMMs Stata: http://www.stata.com/ Standard/Adapt Quadrature, Newton Raphson. See also Stata MP SAS PROC NLMIXED: http://www.sas.com/Standard/Adap Quadrature and Taylor/Laplace expansions, Quasi Newton. See also SAS PROC MPCONNECT and SAS Grid computing Limdep: http://www.limdep.com/ Quadrature, Quasi Newton

MGLMMs: Other Systems • MLwiN: http://www.cmm.bristol.ac.uk/ Laplace approximation and IRLS (also MCMC) • Gllamm (Stata prog): http://www.gllamm.org/ Stan/Adap Quadrature, Newton Raphson • aML: http://www.applied-ml.com/

Packages at http://cran.r-project.org/ for GLMMs and MGLMMs lmer (http://cran.r-project.org/web/packages/lme4/index.html) Laplace Approx, penalized iteratively reweighted least squares npmlreg (http://cran.r-project.org/web/packages/npmlreg/index.html) Quadrature and NPML, EM algorithm

Why Quadrature? PQL: Parameter estimates tend to be biased for binary dependent variables with small cluster sizes and high intraclass correlations (e.g. Rodriguez and Goldman, 1995, 2001) PQL: does not involve a likelihood, which prohibits the use of likelihood based inference Laplace Approximation: The 6th order expansion (Raudenbush et al., 2000) worked as well as 7-point AQ in simulations of a two-level binary dependent variable model The precision of GQ and AQ can be increased by simply using more quadrature points We can not increasing the degree of the Taylor or Laplace Expansion beyond the 2, 4 or 6 terms allowed for

Simulation Based Methods Computer intensive alternatives to GQ and AQ include simulation based approaches such as Markov Chain Monte Carlo (MCMC) (e.g. Gelman et al., 2003) and maximum simulated likelihood (MSL) (Hajivassiliou and Ruud, 1994) The hierarchical structure of multilevel models lends itself naturally to MCMC using for instance Gibbs sampling. If vague priors are specified, the method essentially yields maximum likelihood estimates Unfortunately, a problem with MCMC is how to ensure that a truly stationary distribution has been obtained for MGLMMs, especially when we have a lot of structural and incidental parameters

In tests, serial sabre out performs other software lmer: GQ and AQ not yet implemented, REML and ML give Laplace approx answer npmlreg: GQ times as AQ not available Sabre used Portand Group PGF90 7.1-6 Compiler with –FAST (Level 2 optimization) Times are system times (very close to real time in all figures), very little variation between runs R and gllamm interpreted code, SAS?

Other Sabre comparisons – V small to small sized data sets : • MlwiN (MCMC, IGLS) are 2-25 x slower in univariate 2-level models • For others see the Sabre site • http://sabre.lancs.ac.uk/

Changes in Substantive Findings Between Models

Breaking the technological barrier to adoption Previously • 2X harder to use the NGS than use your local HPC (private computing facility) Now • It is easier to use the NGS (public computing facility) than it is to use your local HPC

Enabling Technology for grid computing All you need is: • An internet connection • The installation of our multiR or sabreR packages for R • A certificate to identify the client to the host -- typically a grid certificate

Also • Users do not need to install or have familiarity of Globus, VDT, gsissh, gsiscp, grid-ftp, grid-proxy tools or any other GRID related software. • There is very little difference between using the Sabre library from within R on the desktop, and using Sabre for statistical modelling on the grid from within R.

Desktop Vs Grid on the Windows desktop Serial sabreR Parallel sabreR sabre.model.1<sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian“, first.mass=64, first.scale=0.5) #display results sabre.model.1 # load previously saved grid session object load(file=“ncess.demo.session.R") sabre.model.2<-sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian", first.mass=64, first.scale=0.5, session=ncess.demo.session, description="here ya go !!") # recover the results and display them sabre.results(ncess.demo.session,sabre.model.2)

sabre R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Demo • rob_sabrer_edit2.mov

Master-Slave (Distributed Memory) Model for MPI as used by Sabre on the NW-Grid Slave Processes Li, Hi, di, Sa’s i=1,...,1000 MASTER Process Li, Hi, di, Sb’s i=1001,...,2000 Li, Hi, di, Sc’s i=2001,...,3000 Li, Hi, di, Sd’s i =3001,...,4000 Sa+ Sb+ Sc+ Sd for L,H and d, etc then NR There is no commercial software on the NGS or NW-GRID (licensing and cost issues)

Performance of Parallel Sabre Relative performance of Parallel Sabre compared to serial sabre (=100) on example datasets In the Wage example 5 days becomes 2.75 hours on 48 processors

Why R? • Commercial Tools (Stata, SAS) are of limited use on a public grid, e.g. Stata MP can not have multiple data sets in memory and neither system provides access to their source code • There are no plans to install them on the UK National Grid Service (NGS) because of cost/licensing issues • R is an effective, efficient and easy to use tool for Statistical Modelling • Many existing tried and tested statistical methods already available for R can easily be modified to exploit the benefits of grid computing • Work flows to support the modelling process are simple to create. • R is easy to install on most popular operating systems (Windows, Unix, OSX) and can be used directly from a USB memory stick • R includes a programming environment, which when used in conjunction with our multiR and sabreR packages, automatically provides a data centric scripting tool for grid computing • There are no licensing issues

Conclusions • This approach makes all the grid middleware invisible and thus removes the biggest barrier to take up. This approachcan provide researchers with more sophisticated statistical modelling tools and help increase their understanding of complex processes and thus help them to undertake more effective research • Social researchers do not need to let their large scale science agenda using GLMs be set by the developments of the big statistics software houses, like SAS, Stata etc.

stop/end 23

Event History Models: Why R? Why SabreR? Rob Crouchley