Deep learning and hpc
Download
1 / 23

Deep Learning and HPC - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Deep Learning and HPC. Adam Coates Visiting Scholar at IU Informatics Post-doc at Stanford CS. What do we want computers to do with our data?. Label: “Motorcycle” Suggest tags Image search …. Images/video Audio Text. Speech recognition Music classification Speaker identification ….

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Deep Learning and HPC' - hila


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Deep learning and hpc

Deep Learning and HPC

Adam Coates

Visiting Scholar at IU Informatics

Post-doc at Stanford CS


What do we want computers to do with our data
What do we want computers to do with our data?

Label: “Motorcycle”

Suggest tags

Image search

Images/video

Audio

Text

Speech recognition

Music classification

Speaker identification

Web search

Anti-spam

Machine translation


Computer vision is hard!

Motorcycle

Motorcycle

Motorcycle

Motorcycle

Motorcycle

Motorcycle

Motorcycle

Motorcycle

Motorcycle


What do we want computers to do with our data1
What do we want computers to do with our data?

Label: “Motorcycle”

Suggest tags

Image search

Images/video

Audio

Text

Machine learning performs well on many of these problems, but is a lot of work.

What is it about machine learning that makes it so hard to use?

Speech recognition

Music classification

Speaker identification

Web search

Anti-spam

Machine translation



Why is this hard

But the camera sees this:

Why is this hard?

You see this:


Machine learning and feature representations
Machine learning and feature representations

pixel 1

Learning

algorithm

pixel 2

Input

Motorbikes

“Non”-Motorbikes

Raw image

pixel 2

pixel 1


Machine learning and feature representations1
Machine learning and feature representations

pixel 1

Learning

algorithm

pixel 2

Input

Motorbikes

“Non”-Motorbikes

Raw image

pixel 2

pixel 1


Machine learning and feature representations2
Machine learning and feature representations

pixel 1

Learning

algorithm

pixel 2

Input

Motorbikes

“Non”-Motorbikes

Raw image

pixel 2

pixel 1


What we want
What we want

handlebars

Feature representation

Learning

algorithm

wheel

Input

E.g., Does it have Handlebars? Wheels?

Motorbikes

“Non”-Motorbikes

Raw image

Features

pixel 2

Wheels

pixel 1

Handlebars


How is computer perception done
How is computer perception done?

Images/video

Detection

Vision features

Image

Audio

Coming up with features is difficult, time-consuming, requires expert knowledge.

When working on applications of learning, we spend a lot of time tuning the features.

Audio

Audio features

Speaker ID

Text classification, Machine translation, Information retrieval, ....

Text

Text features

Text


Deep learning
Deep Learning

  • Find algorithms that can learn representations/features from data.

    • Deep neural networks.

    • “Unsupervised feature learning”

      • Learn representations without knowing task.


Deep learning1
Deep Learning

  • Build multi-stage pipelines from simple pieces.

    • Classic system: deep neural net.

    • Generally: compositions of differentiable functions.

Optimize weights inside network to give correct answers on training data.

“Motorcycle”


Basic algorithmic components
Basic algorithmic components

  • In a loop over entire training set:

    • Evaluate deep network.

      • Usually process a batch of training examples (e.g., 100) at once

    • Compute gradient of loss function w.r.t parameters.

      • Sum up gradients over batch of examples.

    • Update trainable parameters using gradient.


Scaling up deep learning at stanford
Scaling Up Deep Learning at Stanford

  • Most DL networks built on a few primitives.

    • Mostly large dense matrix/vector operations.

    • A few “block” matrices for widely-used cases.

    • Communication hidden in distributed arrays.

  • Most operations are hardware-friendly.

    • Not far from sgemm throughput.

    • Relatively low communication / IO needs.

  • But hard to avoid doing many iterations.

    • Have to focus on making each loop very fast.


Scaling up deep learning at stanford1
Scaling Up Deep Learning at Stanford

  • In-house MPI+CUDA infrastructure.

    • Up to 11.2B parameter networks.

    • Typical experiment: ~14M images (Image-Net).

[Coates et al., ICML 2013]


Scaling up deep learning at stanford2
Scaling Up Deep Learning at Stanford

  • Duplicated “Google Brain” with 3 machines.

    • Compared to 1000+ machines.

    • Unsupervised learning from 10M YouTube frames.

  • Largest artificial neural nets ever trained.

    • 6.5x larger than previous system.

      … but what should we do with it!?

      Surprisingly hard to find a problem big enough that such models matter!

[Coates et al., ICML 2013]


Applications
Applications

  • Building universal representations

    • “One neural net to rule them all.”

Object Recognition

Localization

Tagging

Depth Estimation

Shared representation

for many tasks.

[E.g., Collobert et al., 2011]


Applications1
Applications

  • Autonomous Driving

1 year * 1 Hz = ~30M frames

[Actually have to drive for 1 year!]

Can we train from a few hundred 1080pframes per second?


Applications why these
Applications: why these?

  • High impact.

    • Universal representations: many applications with diffused value.

    • Driving: single application with high value.

  • Train once, deploy everywhere.

    • Training is hard, expensive.

    • Deploying is easy, cheap.

    • A supercomputer can generate an artifact that gets re-used by others.


Things that work
Things that work

  • Find common cases; tightly optimize

    • Surprisingly few core pieces. E.g., 10.

  • Distributed arrays

    • Massive time-saver; easy to think about.

    • Easy to save and restore from Lustre.

    • Load shards and sanity-check them in Matlab.

  • High-level language bindings

    • Low-level code in C++/CUDA (JIT)


Challenges
Challenges

  • Experiment turn-around time is still long.

    • Maybe 3-5 experiments running at once.

    • Weeks for big models / big datasets.

  • Productivity is still much lower than, e.g., Matlab.

    • Lack of strong tools at every level except lowest.

  • Many DL hackers are not systems hackers.

  • Lots of hard-won lessons that are trapped in our group.


Laundry list from stanford infrastructure
Laundry list from Stanford infrastructure

  • Job control and scripting is painful

    • Zombies

    • PBS/Torque mostly works

  • JIT compilation

    • JIT compile C/C++ code

      • Flexible enough to do many things.

      • Easier to use CUDA runtime, templatizing, etc.

        • Avoids Driver API, which is much less convenient.

      • Easier to link with high-level languages.

    • Needs to be thread-savvy

      • Caching of compiled modules

      • Avoiding deadlocks or locking problems in cache(s)

    • Ideally invisible to users

      • But first use of kernels is really slow.

  • Debugging

    • Unclear what to do here. Support for common tools? NVTX, VampirTrace…?

  • Distributed arrays

    • Stanford implementation is rough. Should have pursued more standard approach.

    • MATLAB’s Co-distributed arrays; ScaLapack-style arrays.

      • Multi-dimensional array with a “distributor” that maps indices to ranks.

      • Support to re-distribute array.

      • Support to save/load arrays even when process grid changes.

      • Distribution-aware implementations of most functionality.

  • Execution structure

    • Imperative programming is just easier (esp. with students + scientists).

      • DAGs, etc. are static and difficult to alter. Works OK for us; but many headaches.

      • CUDA streams+events semantics is really nice.

        • Solves the same problem: hide massive parallelism from the caller.

        • But allows arbitrary scheduling on the fly. Easy to understand behavior as viewed by the host.

      • If you want custom functionality, you just have to write the parallel code.

        • In CUDA, you have to write the kernel.

        • For ScaLapack, you had to write code on top of BLACS.

    • Single-rank case should look like 100-rank case.

      • Students can prototype single-rank. Easier to think about.

  • IO tools

    • We spend a lot of time writing file loaders.

      • Application-specific, but lots of boiler-plate.

        • Many common cases in ML. E.g., a list of samples, where each sample = video, image, string, vector.

      • Currently difficult to handle distributed saving/loading of large arrays of data.


ad