Skip to content

Thanks, NVIDIA

Andrew and I both received a note like this from NVIDIA:

We have reviewed your NVIDIA GPU Grant Request and are happy support your work with the donation of (1) Titan Xp to support your research.


In case other people are interested, NVIDA’s GPU grant program provides ways for faculty or research scientists to request GPUs; they also have graduate fellowships and larger programs.

Stan on the GPU

The pull requests are stacked up and being reviewed and integrated into the testing framework as I write this. Stan 2.19 (or 2.20 if we get out a quick 2.19 in the next month) will have OpenCL-based GPU support for double-precision matrix operations like multiplication and Cholesky decomposition. And the GPU speedups are stackable with the multi-core MPI speedups that just came out in CmdStan 2.18 (RStan and PyStan 2.18 are in process and will be out soon).

Plot of GPU timing

Figure 1. The plot shows the latest performance figures for Cholesky factorization; the X-axis is the matrix dimensionality and the Y-axis the speedup vs. the regular Cholesky factorization. I’m afraid I don’t know which CPU/GPU combo this was tested on.

Academic hardware grants

I’ve spent my academic career coasting on donated hardware back when hardware was a bigger deal. It started at Edinburgh in the mid-80s with a Sun Workstation I donated to our department. LaTeX on the big screen was just game changing over working on a terminal then printing the postscript. Then we got Dandelions from Xerox (crazy Lisp machines with a do-what-I-mean command line), continued with really great HP Unix workstations at Carnegie Mellon that had crazy high-res CRT monitors for the late ’80s. Then I went into industry, where we had to pay for hardware. Now that I’m back in academia, I’m glad to see there are still hardware grant programs.

Stan development is global

We’re also psyched that so much core Stan development is coming from outside of Columbia. For the core GPU developers, Steve Bronder is at Capital One and Erik Štrumbelj and Rok Češnovar are at the University of Ljubljana. Erik’s the one who tipped me off about the NVIDIA GPU Grant program.

Daniel Lee is also helping out with the builds and testing and coding standards, and he’s at Generable. Sean Talts is also working on the integration here at Columbia; he played a key design role in the recent MPI launch, which was largely coded by Sebastian Weber at Novartis in Basel.


  1. Carlos Ungil says:

    > Sebastian Weber at Novartis in Zurich.

    Do you mean Novartis in Basel?

  2. kaslin says:

    well, that Nvidia hardware retails for $1365 each — not a big deal for a college-professor income, but a polite thank-you to Nvidia is fine.

    Big retail discounts on computer hardware/software are very common generally for college staff & students… more than a few significant freebies available too from commercial, institutional & government sources.

    What would be the “ideal” retail computer hardware for a high level statistics professor/researcher … and what is the approximate price ?

    • I’m happy to thank anyone who donates $2500 worth of hardware we can use. It’s $2500 we can spend on grad students, postdocs, travel, etc.

      Yes, we can get things like Photoshop through Columbia at very low prices and they bought a site license for MS Office (grr, arg). Our discounts on Macs is low, but the discount on AppleCare is good.

      Depends on your price range and how much you want to use it. Does this hypothetical person work at a university that has a decent compute cluster with GPU-enabled nodes? If so, if it’s cheaper than AWS (Amazon’s cloud business), use that. If not, use AWS if you need really high bandwidth briefly. The config Rok used to test with a good CPU and Tesla GPU is about $2/hour.

      If you want to run big jobs or a a few jobs locally that need to scale, but not to hundreds of computers or terabytes of memory, then you want a high core, high memory linux machine (say 32 or 64 cores with at least 4 GB RAM per core)—that’s in the $10K order. Then if you have big matrix ops, throw in one or two of those Tesla P100 GPUs for another $6K each.

      But if you’re like me, you can get away with doing almost all of your work on a Macbook until you come to fit really big models or need to do a grid search over some configuration parameter for testing. Then you can get a grad student or postdoc (too bad nobody else will ever get Matt Hoffman as a postdoc) to help run things on a cluster.

  3. Emil Begtrup-Bright says:

    Nice thank you for the tip! What a nice program they have. I’ll give it a shot, could def use a new GPU

  4. Zach Smith says:

    Is there any indication at this point of performance differences in cards built for double precision vs single precision, ie Tesla/Quadro vs Geforce? It seems this double precision cholesky decomposition would benefit much more from the scientific gpus.

    How much of the sampling process can be moved to the GPU?

    Exciting news.

    • The card makes a huge difference. See Rok’s latest post, which compares a Tesla V100 (retail about US$6K) and Titan Xp (retail about US$1500). The Tesla is way faster. The graph in this post includes derivatives, which wind up being much better accelerated than just the matrix ops. That still holds, but isn’t shown in Rok’s diagram, which is double-precision only without derivatives.

      We’re making the log density and gradient evaluations faster. That’s where over 95% of the time is spent. When the sampler becomes a bottleneck, we’ll work on that. We’ll also be working on parallelizing some of the lower level functions, like vectorized log densities (and by we, I mean Sebastian Weber’s already working on it and posting profiling info on the Discourse groups).

  5. Anoneuoid says:

    Stan 2.19 (or 2.20 if we get out a quick 2.19 in the next month) will have OpenCL-based GPU support

    Isn’t there some issue with NVIDIA’s OpenCL support since it competes with their CUDA “framework”, so if going with NVIDIA you would be better off with a CUDA solution? Eg, from the most recent release notes:

    Note that there is no change in the OpenCL version (1.2) supported by NVIDIA.

    Some discussion about lack of OpenCL 2 support and OpenCL 10-20% slower on NVIDIA cards:

    I know little about it (when/why exactly is this a problem, etc) so would be very interested in any details or insight.

    • My guess is that NVIDIA cares more about selling hardware and CUDA is a means to that end. I’m sure they don’t mind when big projects commit 100% to CUDA.

      With Stan, we went in heavily favoring the open-source option. We also wanted it to run for cards other than NVIDIA.

      Rok and Erik have talked about also supporting a CUDA kernel to get some speed improvement. Our GPU support is very configurable.

      • Stephen Bronder says:

        Apologies for the late reply.

        As Bob said, we favored first building out the OpenCL version because it’s open-source and can run on non-Nvidia GPUs. In addition, OpenCL code can compile to run on CPUs and other devices*. At this time we are focusing on GPUs, but the same kernels can be used later to parallelize on CPUs**.

        In the future we’ve discussed making CUDA available as well, though when I hear the 10-20% slower thing I usually see that as sort of half truth half myth. CUDA only compiles on Nvidia GPUs because they do very specific optimizations for each type of device (IE a 1080 TI will have different optimized assembly compared to a V100). Justin Lebar in his cppcon 2016 talk speculated that Nvidia actually hand optimizes the assembly for particular operations on a per device basis. Which operations get that hand optimized speedup? That part of their compiler is proprietary so,


        OpenCL does pay a penalty for being more general, but I think the speedup differences vary quite a lot. Some specific operations and devices will see a gap like that, however I’ve also heard stories of switching to OpenCL from CUDA and not seeing any performance difference. Once we start working on the CUDA kernels we can empirically test for our case whether CUDA provides that sort of speed improvement.

        It’s frustrating the Nvidia does not currently support 2.x, though not terrible. OpenCL 1.2 gives us what we need. If in the future Nvidia gives support (or we experiment with OpenCL 2.x and it allows us to do a lot more) we can make a version of our current work that utilizes 2.x

        *Intel has an OpenCL SDK for compiling to FPGAs. The hardware nerd in me thinks that’s pretty neat and a fun idea, but idt that’s practical or reasonable for a normal user of stan to care about. Besides some weird experimental branch of stan I’ll make in the future I don’t see us prioritizing FPGA support.

        ** We actually did support CPUs in the beginning, but stopped because we want to focus on GPUs. In the future I want to make a smart load balancer for CPUs/GPUs so we can know which device a routine should be sent to.

Leave a Reply