Cluster nodes for RELION-GPU

In our opinion, the best solution by far is to use rackmount GPU cluster nodes where you can get 10Gbit or Infiniband connectivity and have the fans spin at 100% 24/7 without disturbing anybody.

Unfortunately, there is one complication: When people first started using GPUs for computing the power and cooling requirements were a big power_connectors
challenge for vendors. To optimize things, NVIDIA (and AMD, Intel) have created special passively cooled cards for servers, i.e. the GPU is cooled by the server’s fans instead of its own. To save space, the professional Tesla (as well as Quadro) cards also have their power connector on the short side of the card, and most vendors have used this. In other words, in most cluster nodes there is no space for a power connector placed on the side of the card – which means consumer cards will not fit physically in most rackmounted servers.

There are a couple of ways out of this. First, you can get workstations (e.g. Dell T630 or Supermicro SYS-7048GR-TR) that can be turned into 4U rackmounted servers. These will work great, but might take a bit of extra space in your racks. Second, some vendors might be able to sell you servers with extra-tall lids, which effectively turns a 4U server into a 4.5U server so you can fit the GeForce power connectors. This is most common for 8-way GPU systems, and since RELION-2 still does some things in the CPU we still prefer quad-GPU systems in terms of value for money and versatility.

However… the nicest solution is that there are now a couple of vendors 1028gqthat sell 1U rackmount nodes specifically designed to accommodate GeForce cards too. In particular, we love Supermicro’s SYS-1028GQ-TRT.
You might not find a lot of information about this on vendor web pages, but that is mostly because neither Supermicro nor NVIDIA officially certify this combination (but they still works awesome together). The short story

  • 1U rackmount server, 2kW power supply
  • Room four four GPUs; three in the front and a final one on the back.
  • We got the version with 10Gbit ethernet built-in on the motherboard, and even with four GPUs there is room for one small PCIe card that you use e.g. for Infiniband.
  • Drawback: There are only two small 2.5″ drives in the machine.
  • There is room between the GPUs to fit power connectors on the side.
  • You need to get PCIe power cables to fit the GPU you want to use. We went with 980Ti cables, since those cards need 8+6 pins. For the 1080 cards we only need 8 pins, but when a future 1080Ti appears we will probably need the extra six again.

To actually mount consumer cards in the SYS-1028GQ-TRT you will need to unscrew the mounting bracket (i.e., the metal piece around the display connectors). You won’t need it in the node anyway. We actually removed the backplates too, but that is probably not necessary.

980_in_node

 

There is a whole row of high-power cooling fans just in front of the GPUs. At some point we intent to test if we get even better cooling by disassembling the GPUs to disable their built-in fan, but for now this is good enough!

Even with the fast double 10Gbit connection to our file server, we decided to add a small SSD for caching. Here’s the setup we got for roughly $4000 per node (may 2016, GPU cost not included):

  • Supermicro SYS-1028GQ-TRT, i.e. the built-in 10Gbit ethernet option
  • Double Xeon E5-2620v4 CPUs (8 cores each).
  • 128GB DDR4 memory
  • Cables for 980Ti GPUs
  • A small 32GB Supermicro SuperDOM to boot the OS (disk-on-module, essentially a small SSD that you plug directly into the motherboard so you don’t waste one of the only two 2.5″ SATA bays).
  • A separate 512GB Samsung 850Pro SSD for caching.

It might seem stupid to use two separate SSDs. However, in particular when we use SSDs for caching and write several terabytes per day even the mid-range/semi-professional ones can wear out. By putting the OS on a separate small module we will almost never write to that disk, which means it will not wear out. If/when the cache-only disk wears out, you can just throw it away and push in a new one in a matter of minutes.

If you are not comfortable building things like this yourself, there are some vendors that are happy to sell you complete systems with consumer GPUs where they have already burnt in the GPUs for you. Since RELION-GPU is still very new you can look for systems intended for molecular dynamics, e.g. from Exxact that have been catering to this market for years.

Update 2016-10-21: When we first wrote this page, the only cards on the market were the “Founder’s edition” models. These are still fine, but by now they have mostly been replaced with vendor models, and can be hard to get. We’ve been quite happy with the even-cheaper ASUS TURBO-GTX1080 ($650) or ASUS TURBO-GTX1070 ($395). However, here we would recommend to be a bit careful! A lot of vendors (including ASUS – avoid their “dual” models) offer overclocked cards that provide even better performance. That can work great when you only put a single card in the machine, but they achieve improved cooling by using bulkier cooling solutions and by blowing the air transversally instead of longitudinally over the card. For a cluster node where air can only flow longitudinally, this would be a complete disaster! You can identify these cards with name suffixes such as “OC”, that they have several fans, or by the direction of the fins on the coolers. For a cluster node it is not just a matter of risk – there is no way overclocked cards with transversal air flow will survive long term.

Improving cooling

Fast GPUs will generate a lot of heat, and if they get too hot they will throttle by reducing their clock. You will not notice this unless you are monitoring clocks (which you can do e.g. with the nvidia-smi tool in Linux), but your machine will be slower. The GeForce 1080 cards are more sensitive to this since they draw more power than the 1070 cards.

Fcase_cooling_modificationsirst, you want high airflow straight throw the case. Decide how the air will flow, put a bunch of case fans on the front to push it in, and then have the GPUs and existing case fans on the back push the air out (check the direction your extra fans blow whenmounting them…). As illustrated for the Corsair case we use for our quad-GPU machines, we have the air flow along the blue arrows, and then we try to block the holes at the top (marked with red) to avoid air flowing randomly. extra_fans

The first thing you need for this is a couple of extra fans. Fans with larger diameter are usually quieter, so pick either two 140mm fans or three 120mm. They are only a couple of dollars, so don’t be a cheapskate but buy quality.

Second, consumer GeForce GTX cards pull in air from the side of the card. This is not a problem when you only have a single card, but with multiple GPUs in the machine each extra card will reduce the air inflow to the previous card a bit. This is a bit worse in the latest-generation cards since there is usually a fancy black “backplate” mounted on the rear of card:

backplate

If you look carefully you will see that there is a thin line separating two halves of this backplate, and the entire reason for that is to make it possible to remove the left half in the image above to improve air flow in multi-GPU configuration. Don’t touch the four large screws – they hold the cooler for the GPU!  Since we want all the air we can get, we remove both halves of the backplate to get it more naked:

no_backplate

 

This is all that we will do on the physical side. Once we get started with Linux there are another couple of tricks we’ll use. By default the drivers are optimized to make our office a sane place to work, so the fans will usually never max out, but there are ways we can force this to maximize cooling (but you probably don’t want to run it like that constantly if you are working in the same room).

A high-end quad-GPU workstation

This machine is very similar to a workstation we used to perform a complete reprocessing of the 12TB EMPIAR 10061 beta-galactosidase dataset in just 115h on a single workstation (all the way from micrographs to final 2.2Å structure).

  • You need a case that can fit quad GPUs. For our machines we used the same case as NVIDIA used for their “DIGITS DevBox”, Corsair Carbide AIR 540 ($129), but you might also be lucky with Fractal Design XL R2.case
  • To push enough air through this case, get a couple of extra 140mm case fans. A silent fan might be $13. Note that it is important to mount them so you push-pull air from the front, over the GPUs, to the back. To get even better airflow, cover the holes on the top of the case.

extra_fans

  • A motherboard with support for four double-wide PCIe x16 GPU slots. We originally got an ASUS X99-E WS board (now $520), but today we would likely try a cheaper Gigabyte X99P-SLI for $249.
    Update: We have had one colleague who tried the Gigabyte card and there might be some problems with quad GPUs. Until we know for sure, you might not want to risk this.
  • An 8-core CPU. We used Core i7 5960X (3GHz, now $1000). Today Intel has released the next-generation CPU, which ~10% faster, but also more expensive. A new 8-core Core i7-6900K will be $1200. Don’t bother splashing out for the super-expensive 10-core version; it’s not worth it.
  • 64 GB DDR4 memory. Use 4 sticks of 16GB each, so you have the possibility of upgrading to 128 GB with another four later. Any brand will work, right now it’s $229 at Newegg.
  • The power supply (PSU) is probably the most complex item for quad-GPU systems. Right now the fastest consumer cards draw between 150W (GeForce 1070) and 180W (GeForce 1080), but the fastest cards they release (like 980Ti) will draw up to 250W each.
    • If you want to be able to upgrade to 250W cards later, get a 1500W PSU. We got a Corsair AX1500i ($399 now), but there are also cheaper options that will probably be fine (EnerMax EMR1500EWT, $250). Just make sure that it has plenty of PCI power connectors for the GPUs.
    • For the GeForce 1080 cards, you can likely get by with a 1000W PSU since we never run the cards and GPU at 100%, but to have a bit of margin for disks and later improvements we would recommend a 1200W PSU, e.g. Corsair HX1200i for $240 at Newegg.
    • If you will only use the 150W GeForce 1070 cards, a 1000W PSU is definitely enough. Corsair RM1000X is $159.
  • At the higest end, some NVIDIA GPUs can draw up to 250W per card (e.g. the 980Ti). For that you need a 1500W power supply. However, right now the fastest card you can get is the GeForce GTX 1080 (180W), and the best value for money is probably the GTX 1070 (150W). GPUs evolve fast, so I suspect you will want to update. Based on history, I would expect NVIDIA to release an even faster 250W GeForce GTX 1080Ti in early 2017. If you want to upgrade to that, get the larger PSU already now.
  • A fast SSD to cache data, say Samsung 850 EVO 512GB for $219. 
    UPDATE 2016-09-13: We’ve noticed that the SSD cache despite its speed amazingly enough can become the bottleneck in this machine. However, this is easy to work around. Instead of a single large SSD, get two smaller ones (e.g. Samsung 850 EVO 256GB) and configure them as RAID0 to double the effective bandwidth.
    You might also want to consider the slightly more expensive longer-life professional drives (e.g. Samsung 850 Pro) if you use the machine a lot and put valuable data like your home directory on the SSD.
    All SSDs will eventually wear out, and the consumer ones wear out faster.
    Some modern motherboards have support for an M.2 slot, but this will use 2-4 PCIe lanes and reduce bandwidth to one of the GPUs – so for quad-GPU machines you want to stay away from it.
  • A larger mechanical hard drive. It’s worth getting a 7200rpm drive for better speed. Newegg has a Toshiba X300 for $140, but any brand will do. If you have lots of data you can add multiple 6TB drives as well as multiple SSDs in these cases.
  • A quiet CPU fan. Not absolutely necessary, but it will make your office a nicer place at very low cost. The Cooler Master Hyper 212 EVO ($29) is what we usually get, simply because we know it fits in our cases.
  • Four GPUs. We initially had GeForce GTX 980Ti cards in this machine, so with a 1500W power supply it can definitely handle four cards each drawing 250W. Today we would get GeForce GTX 1080 cards ($699) to optimize for performance, or GeForce GTX 1070 cards ($449) to optimize for value.
    Update 2016-10-21: The “Founder’s edition” models are still fine, but by now they have mostly been replaced with vendor models, and can be hard to get. We’ve been quite happy with the even-cheaper ASUS TURBO-GTX1080 ($650) or ASUS TURBO-GTX1070 ($395). However, here we would recommend to be a bit careful! A lot of vendors (including ASUS – avoid their “dual” models) offer overclocked cards that provide even better performance. That can work great when you only put a single card in the machine, but they achieve improved cooling by using bulkier cooling solutions and by blowing the air transversally instead of longitudinally over the card. This can be really bad with a quad-GPU config, since you then just rotate air inside the machine of blowing it through the machine.  You can identify these cards with name suffixes such as “OC”, that they have several fans, or by the direction of the fins on the coolers. Do yourself a favor and stay away from all such overclocked cards.

quad2

IMPORTANT: For a quad-GPU machine it is critical that you go with reference cards (“Founder’s Edition”). We are well aware that you have just seen the lower-priced “gamer’s edition” that scores higher on all the review sites, and that it is said to have better cooling. However, these cards achieve this better cooling by having the air flow the short way across the card instead. This is great when you only have a single card in your case, but in a quad-GPU configuration that would now blow out all the hot air inside the case, and the cards will throttle when they get too hot.
Update 2016-09-13: We have been happy with the few ASUS Turbo GTX 1080 ($650) and 1070 ($400) cards we’ve gotten, although we haven’t tested dozens of them. These are the black models with longitudinal cooling. Stay away from an model that says “OC” (overclock) or the “twin” models with two fans, but vertical cooling.

Don’t be stupid and cause yourself a bunch of trouble by trying to save the last $100 – get either the reference model cards, or at least cards that are not overclocked and have the air flowing in the long direction of the card. If the card has some fancy/special cooling solution you might not even be able to mount four of them next to each other.

Mounting all this stuff in the case is straightforward if you have ever built any PC, but there are a couple of tricks to use that will improve your cooling – which will cover in the next post.

back

 

This will get you an insanely fast personal RELION-2 workstation – think of having the equivalent of half-a-rack of cluster nodes under your desk. With GeForce GTX 1080 cards it will be $5000. Since RELION-2 still does not max out performance of the GPUs (i.e., use them to 100%), the performance difference if you go with 1070 cards instead will only be roughly 10% – and then the cost will be just under $4000 per machine.

However, there is one caveat: Since RELION can now be more than an order-of-magnitude faster, it is also extremely data-hungry. You will need an SSD just to cache particles, and even when you have data connected on a fast external drive that will be a bottleneck. Both the low-end and high-end workstations described here have the problem that they only come with 1Gbit network interfaces, and for the high-end workstation there is not even an option of adding a 10Gbit network card – since we just filled up all slots with GPUs.

There are a couple of ways around this:

  1. Some vendors sell workstations that have room both four four GPUs and some extra cards, for instance Dell T630 or Supermicro SYS-7048GR-TR. They are typically equipped with dual sockets, which doesn’t hurt, but it will cost you a bit extra.
  2. There might be some single-socket quad-GPU motherboards on the market that have built-in 10Gbit ethernet.
  3. Go with rackmount nodes that have built-in 10Gbit ethernet or infiniband. This will be covered in the next post.

 

 

Hardware for a cost-efficient workstation

Update 2016-09-13: This entry original said “low-end workstation”, but I don’t think that’s a fair description. The machines I describe here are simply wonderful. They provide insanely great value for money, they are extremely quiet, and because they only need room for two GPUs you can get almost any motherboard and case – it’s not at all as important to get exactly the right hardware as with the quad-GPU machines.

What we’ve done in Stockholm is that pretty much everybody in the team working with image reconstruction has a personal desktop like this. Nobody complains about the noise level (in contrast to the quad-GPU boxes…), and they are literally so cheap that we just get more desktops without thinking much when we need them.

Here, you are not very dependent on specific hardware. In principle you can get any modern machine (including pre-built ones from vendors like Dell, HP, Lenovo, etc.) that is equipped with

  • Two unoccupied double-wide PCIe 16x slots (3.0 is better than 2.0)
  • A power supply of at least ~500W

However, since we like value for money we’ll get some parts and build a machine that is probably both quieter and more cost-efficient than what you get pre-built. In this case there is a wide variety of hardware you can use, so the links under each item below are mostly examples – and for the same reason I don’t bother showing images of the specific item. You can definitely get even cheaper items, but first we like quality, and second we try to recommend things that are at least similar to what we have used ourselves. These recommendations are current as of early June 2016:

  • A good case that is reasonably quiet
    We have used e.g. Fractal Design Define R4, currently $79 at Newegg. This case also has external USB3 connectors on the front, which is nice if you need to use external enclosures with micrograph data.
  • A motherboard with support for dual PCIe 16x slots and Socket 1151 CPUs, e.g. with Z170-A chipset. You can either go for low price ($89 right now) or pick whatever motherboard is best-selling at the time; last time we used the ASUS Z170-A that Newegg has for $155. This motherboard even has USB-C connectors, which will be great for fast data transfer from/to external disks.
  • A CPU. We don’t try to save here, but usually buy the fastest 4-core CPU. Currently this is Core i7-6700K for $325.  You can get a cheaper one too, but don’t go too cheap since RELION uses a bit of CPU-power.
  • 64GB of DDR4 memory, configured as 4 sticks of 16GB each. Brand doesn’t matter, so just pick something cheap from any brand. We found 64GB for $229 at Newegg.
  • A power supply. The CPU might use up to 140W, and add 50W for hard drives and other minor stuff. Each NVIDIA GeForce GTX 1070 card will then use max 150W, which brings us to 500W. If you want to GeForce GTX 1080 cards instead you might want to bump this to 600W.
    Update 2016-09-13: Over summer, NVIDIA has released the even faster TITAN X Pascal. (Beware: there is also a non-Pascal TITAN X) with 12GB memory that draws up to 250W. Right now you need to buy them directly from NVIDIA, and while the provide the best absolute performance the value-for-money is better with 1080 or 1070 cards. However, if you want to be able to upgrade to TITAN X Pascal, you should get at least a 700W power supply. Newegg has a Corsair CX500 (500W) for $49.
  • A smaller SSD drive for booting and caching data for RELION. A Samsung 850EVO 256GB is currently $88.
  • A larger mechanical hard drive. It’s worth getting a 7200rpm drive for better speed. Newegg has a Toshiba X300 for $140, but any brand will do.
  • A quiet CPU fan. Not absolutely necessary, but it will make your office a nicer place at very low cost. The Cooler Master Hyper 212 EVO ($29) is what we usually get, simply because we know it fits in our cases.
  • Oh, and you’ll need some GPUs too! Just get one or two of the “Founder’s Edition” NVIDIA GeForce GTX 1070 from any vendor (this edition is identical between vendors) for $449Update 2016-10-21: The “Founder’s edition” models are still fine, but by now they have mostly been replaced with vendor models, and can be hard to get. We’ve been quite happy with the even-cheaper ASUS TURBO-GTX1080 ($650) or ASUS TURBO-GTX1070 ($395). However, here we would recommend to be a bit careful! A lot of vendors (including ASUS – avoid their “dual” models) offer overclocked cards that provide even better performance. That can work great when you only put a single card in the machine, but they achieve improved cooling by using bulkier cooling solutions and by blowing the air transversally instead of longitudinally over the card. This can be bad with a multi-GPU config, since you then just rotate air inside the machine of blowing it through the machine (although it might work for a dual-GPU setup if your motherboard allows you to leave some space between the two cards). You can identify these cards with name suffixes such as “OC”, that they have several fans, or by the direction of the fins on the coolers. Do yourself a favor and stay away from all such overclocked cards – it’s not worth worrying about unstable machines to save $50 or maybe getting slightly better performance.

With two GeForce 1070 cards, this machine can be as cheap as $1900, and you can shave off another $100-200 if you go with cheaper components. You will also need a keyboard, mouse, and monitor, but we assume you have that around in the lab.

 

 

 

 

CUDA & Professional vs. Consumer GPUs

The GPU acceleration in RELION-2 is entirely based on CUDA, which means you need a reasonably modern NVIDIA GPU (Compute capability 3.5 for the experts). You will also need a decent amount of memory on the card – say 4GB or more. RELION-GPU will use most memory during the final autorefinement, and since this always involves two models that can be put on different cards it is often a better idea to get two mid-range cards instead of a single high-end one, if you need to choose.

You also have a choice between professional “Tesla” cards and cheaper consumer “GeForce GTX” that target slightly different audiences. The professional cards have outstanding quality, they have been burnt in by the vendor, they have more memory, and they also provide very good performance for double precision floating point calculations (which is important for many scientific domains). For many rack-mounted servers these cards are the only option due to the way the cooling works and how the power cables are connected. Most compute centers love these cards – one reason is that they will always get a replacement instantly even if the card fails after several years, and the drivers are often more conservative. However… You will have pay a bit for this quality. The advantage is that you will get a card that is certified for your specific server and great support both when you buy and later. For this reason there simply isn’t a whole lot of need to document this – just ask your vendor!

However, the other alternative is consumer hardware. You will not get as much official support here, which means it is useful to document. These cards are not lower quality, but they only provide good single precision performance, they do not have as much memory, and since both hardware and drivers are targeted for a different market (games) you need to burn in new cards yourself. There are only a few rack-mounted servers that accept consumer cards (we’ll describe it in a later post), but the upside is that the cost is only a small fraction. The good news is that we have optimized RELION so you can use consumer cards if you want to – in particular the new 1070 & 1080 cards are outstanding value for money, but stay away from overclocked cards. Those work fine for games that only run at full speed for fractions of a second, but when using the GPU at 100% 24/7/365 and putting several cards next to each other we have had problems. For instance, “game edition” cards that pull in the air sideways will never work in a cluster node! Don’t be too cheap – spend the extra $100 for the “Founder’s Edition” reference models and be fine. We’ll start by showing you the low-end options for desktops and gradually move up.

 

 

 

RELION-GPU

About two years ago, the Department of Biochemistry & Biophysics at Stockholm University decided to join the amazing work so many of you are doing with cryo-EM, and we managed to get funding for a new state-of-the art microscope facility. Half of the time in this resources will also be made available nationally in Sweden, through the Science for Life Laboratory national infrastructure for molecular life sciences.

Since we have a history of working with simulations and modeling (in particular the GROMACS package), our team naturally started working with the computational needs of this infrastructure – and quickly realized our hardware budget was at least an order of magnitude too small.  However, as we’ve had to spend large efforts enabling our simulation codes to use modern accelerators – in particular graphics processors (GPUs), we decided to try to contribute to this new community by attempting to crack the computational problem instead of just throwing more money at it.

Since early 2015, Björn Forsberg & Dari Kimanius have worked tremendously hard to create a new GPU-accelerated version of RELION – something we could never have done without close collaboration with Sjors Scheres at the MRC Laboratory of Molecular Biology. This has turned into a great collaboration (and I think it’s a wonderful example of the power of open source software): What was initially just a testing hack has turned into RELION-2.0 that will fully support GPU processing out of the box. There are still some rough edges and parts that have not been accelerated, but the parts that account for by far the largest share of computational processing have been accelerated: Both particle autopicking, 2D- and 3D-classification, and autorefinement. There is a preprint available at biorxiv while the work is undergoing peer review.

At this point we have just pushed out the first beta release to a small set of users. We will do our utmost to expand this as quickly as we can (think weeks, at most), but since there are quite a few differences in the code we need to have a smaller set of early test users that help us find things we might have missed during a few months of testing at MRC-LMB and Stockholm University – we simply don’t want this large mass of new code to create bad results for anybody.

So, that brings us to the main part: Lots of you have been asking about what hardware to get for this new code. Since we finally finished the code and an accompanying paper, I’ll add a couple of posts later tonight with suggested hardware ranging from extremely cost-efficient consumer cards you can use either in desktops or servers up to high-end professional cards that your supercomputing center might prefer. The first part is about deciding whether you need consumer or professional cards, the second describes hardware for a low-end workstation (but still dual-card!) optimized for cost, and the third is focused on quad-GPU workstations. There’s now also a post about quad-consumer-GPU rackmount nodes.

In addition to Björn & Dari doing the bulk of GPU programming, we’ve also had great collaborations with Ross Walker at SDSC in the context of MD codes, and in particular to get awesome performance out of consumer NVIDIA hardware – thanks Ross!

Welcome to the SciLifeLab national Swedish cryo-EM facility

Hi; this site is very much a work in progress while we take the new national cryo-EM facility in Stockholm live during summer 2016.

Until that happens, the main reason for coming here is likely that you are interested in what hardware you should get for the new GPU version of the RELION program. Our main reason for developing this acceleration was simply to solve the computational needs of the new infrastructure, and although it is still in beta we suspect a lot of other groups might also be interested in doing cryo-EM processing even for very large projects on single workstations in a matter of days.