hpc in linguistic research n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
HPC in linguistic research PowerPoint Presentation
Download Presentation
HPC in linguistic research

Loading in 2 Seconds...

  share
play fullscreen
1 / 25
vondra

HPC in linguistic research - PowerPoint PPT Presentation

106 Views
Download Presentation
HPC in linguistic research
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. HPC in linguistic research Andrew Meade University Of Reading a.meade@reading.ac.uk

  2. HPC use in linguistic research • Linguistic and biological models • Phylogenies • Linguistic data • Models of evolution • Parallelism • Scaling • Results • On going work • Key challenges

  3. Linguistic and biological systems

  4. Inferring evolutionary histories form linguistic data • Evolutionary histories, phylogenies • Tools for understand evolution • Depicts relationships between languages • Identify groups which share a common ancestor • Calculate timing events • Account for lack of independence in the data • Inferred from data, taken from different languages • Using an explicate statistical model of evolution • Problem is NP-hard, growth is a double factorial. • Markov chain Monte Carlo search methods, heuristic search, hill climber • Product of Data + Model

  5. Greek Indo-Iranian Slavic Celtic Germanic Romance

  6. The Data • Swadesh list, Morris Swadesh 1940, onwards • 200 meaning, present in all languages (all most) • Chosen to be stable, slowly evolving and resistant to borrowing • Some what of a language “gene”

  7. Cognate classes • Word with a common evolutionary ancestry and meaning English Fish Danish Fisk Dutch Visch Czech Ryba Russian Ryba Bulgarian Riba Fish Ryba 34other languages 23 other languages

  8. Data coding, Cognates • Cognates, words and meaning what are derived from a common ancestor • Languages evolve by a processes of descent with modification “When” 1 cognate “Water” 3 cognates Englishwhen water Germanwannwasser Frenchquandeau Italianquandoacqua Greekqotenero Hittitekuwapiwatar English11 0 0 German 1 1 0 0 French10 1 0 Italian1 0 1 0 Greek10 0 1 Hittite11 0 0

  9. Continuous-time Markov Model Q10 0 Non cognate 1 Cognate Q01 Q01 Rate at which cognates are gained Q10 Rate at which cognates are lost

  10. The Likelihood Model • Calculates the probability of a tree (T), given the data (D) and model of evolution (M). Fitness / evaluation • Accounts for > 99% of the run time Product over the model 1 – 12 categories Product over the data 200 – 100,000 sites

  11. Level of parallelism Data – Analysis of multiple datasets (3-5) Model – Test a range of models (10-20) Trivially parallel Run – Stochastic process multiple runs (5-10) Code – individual run can still take years

  12. The problem • 2003 – 16 taxa, 125 sites, 1 x model • 2005 – 87 taxa, 2450 sites, 4 x model • 2007 – 400 taxa, 34,440 sites, 100 x model • Complexity 700,000x, 5-6 order of magnitude • 4.8 years per run, typically 5 publication quality runs + 10 model tests • 4.8 years < attention span of academics • results are required in days

  13. Parallel method 1Distribute the data (MPI) Cognates Data ……………………..…………….. Languages ……………………..…………….. Core 1 Core 2 Core 3

  14. Parallel method 2 Distribute the model (OpenMP) Pass 1 Pass 2 Pass 3 Pass 4 Data Data Data Data Core 1 Core 2 Core 3 Core 4

  15. Distribute the data and the model (MPI + OpenMP) Pass 1 Pass 2 Pass 4 Pass 3 Data Data Data Data Core 1 Core 7 Core 3 Core 5 Core 6 Core 4 Core 2 Core 8

  16. Seconds - log 10 Cores

  17. Efficiency Cores

  18. Results • Runtime reduced from 4.8 years to • Good scaling, but not sustainable • HPC has allowed for the accurate analysis of large complex data sets with statistically justifiable models.

  19. Current work • Phoneme data • Modelling sound utterances • Better resolution than cogency data • Relevant linguistics patterns are emerging • 120 phonemes, 2 cogency judgments • Another 3 order of magnitude complexity • Accelerator implementation CUDA / OpenCL

  20. Scalable computing • Last 10 years, 5-6 order of magnate increase in complexity • Reasonably scalable code redesign needed. • Need to change the how not the what • What – statistical framework, realistic models • How – algorithm, language, parallelisation method, hardware • Scalable algorithms

  21. Convergence Parallel Burn in Serial

  22. Parallel sampling using multiple chains

  23. Key challenges • Computing is a rate limiting step • Trending water / drowning • Widening gap between computing power and data models complexity • Data set size and model complexity restricted • 20-30 year old methods, which are less accurate and non statistical are returning • Connecting researchers with results not HPC • HPC is a nuisance in science • Steep learning curve • High cost. Hardware, running costs and personnel • Access and flexibility • Not one off activity, thousands of data sets are produced each year, 3000+ published in 2011

  24. Acknowledgments Mark Pagel