If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens? -Seymour Cray
It's really hard for me to actually call GPU computing "the future" when in many fields it has already become the present. This article will not present any information that the rest of the "massive computing" world has not already known for some time, nor am I the only one that sees GPUs as a tool to be leveraged in our field. However, the Atmospheric Sciences is certainly lagging behind in leveraging what is an incredibly powerful revolution in parallel computing.
Why should I write GPU code?
Some fields such as the Machine Learning and AI communities have learned to leverage the massive parallel computing powers of Graphical Processing Units (GPUs) to speed up their data processing. NVIDIA even sells their own Deep Learning machine in a box that speeds up deep learning problems by several orders of magnitude. But how is this useful, and why should we care? Well, for starters, anything that can take a 700+ hour problem and turn it into a 7 hour problem is worth looking into if you like your code completing faster and using less resources to do so. Without getting into the nitty-gritty details on how it works, the principle is demonstrated in the above quote by Seymour Cray: the GPU approach is taking the 1024 chickens instead of the CPU approach with the oxen. My CPU on my computer has 4 cores - conversely, the graphics card I'm currently using has 1024 cores. It's a lot faster to run a parallel process over 1024 cores than 4 cores!
Up next in the "why you should care" arena is that Numerical Weather Prediction is already advancing in the GPU direction. NVIDIA is collaborating with NCAR, NASA, and NOAA, among others, to make the existing prediction systems faster by leveraging the GPU architecture already present on most supercomputing platforms. In the image above you can see the progress that's been made by the folks downstairs from me in SSEC in converting WRF physics schemes to GPU code! Performance increases pretty much justify themselves, but faster models that use less resources are 1) cheaper to run and 2) allow for us to up the resolution of whatever it is we're simulating. Other efforts from NCAR to leverage the GPU can be found here and here.
Using a GPU to integrate fluid parcel trajectories!
The only reason I did any research into this and began learning how to write GPU code is because of the massive scale of data analysis we need to do on our CM1 simulations. My adviser (Dr. Leigh Orf) has figured out a way to save our 30m isotropic tornado simulation every model time step, which in our case, is every 1/6 seconds. This essentially amberizes our simulation to do whatever we want with it without interpolating between time steps, including fluid parcel trajectories. However, when doing parcel trajectory analysis, we needed to meet the following criteria:
- We wanted to integrate without re-running all of CM1
- We wanted to be able to change where we're releasing parcels without re-running the model
- We wanted to be able to release massive numbers of parcels
- We wanted it to be fast. No waiting around for a few days for it to complete on Blue Waters.
Parcel trajectories are an inherently, embarrassingly, parallelizable problem. Each fluid parcel is independent of it's neighbor, and only relies on the U,V, and W wind component at the parcel point. This means that for a given time step, the integration can be parallelized over the number of parcels. And that's exactly what I did - I wrote GPU code to parallelize the parcel integration at a given time step.
The results: I was able to integrate 1,000,000 parcels forward in time 5,400 time steps in 19 seconds. The only reason I couldn't do more is that my desktop computer ran out of the 32 GB of RAM I have. While this time metric does not include I/O calls to the disk, it still implies that one could easily drop a quarter of a million fluid parcels in the model domain and integrate it forward and the computation would be done before your bathroom break is over. Don't like where you dropped those parcels? Run it again and wait a minute or two. Done. By the way, this is using a single NVIDIA GTX 960 GPU, and it isn't even sweating.
Imagine leveraging that power to actually run a cloud model, on say, 10, 100, or 10,000 GPUs? What if Warn on Forecast became a genuine possibility because GPU optimizations made it possible to run a 250 member, 1km convective ensemble and have it finish by lunch time? Personally, I think it's incredibly exciting to think about the possibilities. However, I want to stress that it isn't as easy as flipping a switch and all of the sudden your existing code is parallelized on the GPU. To truly get the most bang for your buck, you have to think using a different programming model and essentially start from scratch, which is why we don't have these things yet. It takes time to build, test, debug, and optimize. But, with some patience and willingness to learn a new skill, magical things can happen. It only took me about a week to learn and write my code. If this is something you're interested in learning, check out this beginner guide from NVIDIA. Additionally, this video lecture series from Udacity on parallel GPU computing really helped me solidify these concepts and programming model.