Future Storage Systems: Part 6: GPU accelerated storage

by Dave Graham on November 3, 2008



In developing the Future Storage System series, I have been trying to take a part of my excitement for storage technologies and overlay them with systems/platform technology.  Typically, the storage industry lags on the platform development side of the house (mostly out of necessity).  So, part of looking at the Future Storage System was to take into consideration that in the basic design, some of the more current technologies could and should be used to enable “forward” thinking.  That’s why you see such a heavy emphasis on Torrenza, Hypertransport, and integrated memory controllers.  With the exception of Torrenza, each of the other aspects of system design have a rich history.  Hypertransport, arguably, has been an outlier on the bus technology side, but it’s capabilities and industry support are unparalleled.  Integrated memory controllers, while “nothing new” (DEC Alpha, anyone?) really came to the for when AMD introduced them as part of the Athlon series of processors.  Today, I’d like to toss another wrinkle into the “platform meets storage” discussion by including another developing technology: the GPU (Graphical Processing Unit).

Assuming that you made it to the second page, you’re probably scratching your head right now.  “What do graphics have to do with storage?”  In a way, your confusion is merited.  Historically, GPUs have been dedicated to processing mathematical calculations that ultimately result in the display of images on an attached screen.  What hasn’t really been available up until now (~2006-2007) has been a way to utilize the GPU and it’s absolutely astronomical calculation power (and memory bandwidth) for anything other than basic vector calculations.  Starting around 2006-2007, nVidia and ATI made available programming “hooks” for using the stream engines for different payloads.  This type of accessibility was most noticable in the World Community Grid Project (WCG) which used optimized ATI driverset to crunch protein folding routines for research purposes.  Mercury Computing, another name in the industry, was utilizing the Sony Cell processor for much of the same thing.  nVidia was a little late to the party but brought out its Tesla line of computation farms and developed the CUDA (Computer Unified Device Architecture) framework for optimizing workloads for GPU processing.

The question is, what applicability does this have with the FSS?  If you look back at Part 3a, you’ll note my explanation for Torrenza and dedicated workload processors that weren’t CPUs. To quote:

So, you want dedicated processing for x-type of application?  Install a co-processor into an available 1207 socket in the system.  Systems using Cell processors, for example, have been demonstrated behind doors (not commercially available to the best of my knowledge).  The ultimate goal here would be to allow specialized co-processors for applications (RSA disk encryption) that would be offloaded from the general storage I/O processors.  The application set is really endless.  Want to do data encryption inband or at rest?  Install an RSA encryption co-processor.  Want to do compression or de-dupe?  Install a compliant DSP or co-processor that performs that task.  When we look at the operating system for this Future Storage System, you’ll see even more applicability.

As you can see, there are various types of workloads that benefit from additional processing power. Deduplication, Compression, Encryption, even LUN virtualization require additional processing power beyond the standard RAID XOR or Parity operations. Think of it this way, if the FSS Operating System has the intelligence to shift specialized workloads over to a separate ” processing stream,” the overall storage system performance can scale even more more I/O before additional base computation power is required.  Thus, it minimizes overall storage solution cost (not necessarily the FSS cost, mind you) and reduces footprint in the environment.  There would be no more need for external deduplication appliances (sorry, Avamar and Data Domain), no need for external encryption devices (I’d say sorry to NeoScale but the last time I checked, they weren’t solvent), and certainly no need for compression cards.  Even further, some of the basic pack/unpack duties for replication could be handled out of stream as well.  Imagine the possibilities.

Considerations:

As with any other sort of technology being introduced, there are considerations to be made.  First and foremost, how does the GPU interface with the rest of the FSS (physically/electrically)?  Secondly, what accommodations need to be made within the FSS OS to ensure proper pathing?

Physical/Electrical:

There are two major ways that the GPU could be integrated within the FSS;  PCIe (PCI Express) and HT (Hypertransport).  The advantages of PCIe really come down to ease of integration, both from a design perspective (the GPUs already have native PCIe communication) as well as overall FSS system integration.  In my mind, I envision the GPU being mounted either in a pluggable card (similar to the I/O expansion options discussed previously) or, in a Torrenza model, being placed in a CPU socket on the expansion board. The advantage of using Hypertransport really would only be apparent if you were using the HTX standard for the pluggable slots.  The Torrenza approach to GPU integration already injects the GPU into the overall platform I/O stream but would require hardware compatibility.

Along with the basic I/O issues, there also is the issue of power and heat.  GPUs have historically been extremely hot and power hungry. Most high end cards feature large heatsinks with large blower fans that direct heat away from the memory banks and GPU core.  The introduction of such a component within the FSS requires careful study of the hardware layout as well as the overall power requirements needed for both core operation and GPU integration.  However power-bloated GPUs are, there have been developments within the GPU power management and optimization fields. The current crop of GPUs can scale their processing speeds depending on the incoming workload, thus lessening power draw and thermal dissapation.

OS Accomodations:

The inclusion of a GPU within a storage system obviously means that some level of OS optimization must be done in order to utilize the GPU’s stream engines for various tasks like deduplication, etc.  For one, the core OS must recognize incoming data and be able to route it either through the base CPU engine or through the GPU.  Secondly, the OS must still maintain a meta mapping of data and its placement within the system. These are not light tasks at all and must be integrated into the design.

Closing:

Hopefully, I’ve made a case for the role that a GPU can have within a storage system.  It’s not about graphics anymore, it’s about processing power.  The GPU is a powerful tool that has real-world effects and applicability to our storage workloads.

cheers,

Dave

Reblog this post [with Zemanta]
Share
  • http://twobombs.blogspot.com Aryan

    I expect GPGPU functions to be integrated into both software and hardware.
    Software at first specialised apps later into the OS, hardware more and more into southbridges/hybrid solution. Ye ole M$ is barking at the tree in this regard, but progress in this field has been so storming that GPGPU integration is only a small problem compared with the huge security risks that are caused by this massive increase in avaliable processing power; a workload that a CPU took years is reduced a weeks. Ie: bruteforcing strong MD5 hashes. [ now listening GSYBE - Static ] Imagine a marriage between rainbow and bruteforce at this level. It takes me seconds to bruteforce any MD5 hash of 8 or less characters/numbers…. If m$ KEEPS ON SLACKING they'll also miss this train as it streams into the cloud.

  • dave_graham

    Aryan,

    thanks for the comments. When you look at the specialized workloads that GPGPUs can provide additional processing power, it's amazing that they haven't been integrated sooner. I downloaded an interesting whitepaper entitled “QP: A Heterogeneous Multi-Accelerator Cluster” done by the University of Illinois at Urbana-Champaign that really highlighted the strengths of using both foundational CPUs (OS and programmatic interface) as well as specialized GPGPUs for determined workloads. I think, based on what you've written above, that the encryption (data security) vertical is well suited to using these devices. From a storage perspective, encrypt/decrypt, deduplication (calculation and stream intensive), and even more complex parity workloads will benefit highly from GPGPUs.

    Thanks again!

    cheers,

    Dave Graham

  • http://twobombs.blogspot.com Aryan

    I expect GPGPU functions to be integrated into both software and hardware.
    Software at first specialised apps later into the OS, hardware more and more into southbridges/hybrid solution. Ye ole M$ is barking at the tree in this regard, but progress in this field has been so storming that GPGPU integration is only a small problem compared with the huge security risks that are caused by this massive increase in avaliable processing power; a workload that a CPU took years is reduced a weeks. Ie: bruteforcing strong MD5 hashes. [ now listening GSYBE - Static ] Imagine a marriage between rainbow and bruteforce at this level. It takes me seconds to bruteforce any MD5 hash of 8 or less characters/numbers…. If m$ KEEPS ON SLACKING they'll also miss this train as it streams into the cloud.

  • http://www.flickerdown.com Dave Graham

    Aryan,

    thanks for the comments. When you look at the specialized workloads that GPGPUs can provide additional processing power, it's amazing that they haven't been integrated sooner. I downloaded an interesting whitepaper entitled “QP: A Heterogeneous Multi-Accelerator Cluster” done by the University of Illinois at Urbana-Champaign that really highlighted the strengths of using both foundational CPUs (OS and programmatic interface) as well as specialized GPGPUs for determined workloads. I think, based on what you've written above, that the encryption (data security) vertical is well suited to using these devices. From a storage perspective, encrypt/decrypt, deduplication (calculation and stream intensive), and even more complex parity workloads will benefit highly from GPGPUs.

    Thanks again!

    cheers,

    Dave Graham