Future Storage Systems: Part 6: GPU accelerated storage

by dave on November 3, 2008



In developing the Future Storage System series, I have been trying to take a part of my excitement for storage technologies and overlay them with systems/platform technology.  Typically, the storage industry lags on the platform development side of the house (mostly out of necessity).  So, part of looking at the Future Storage System was to take into consideration that in the basic design, some of the more current technologies could and should be used to enable “forward” thinking.  That’s why you see such a heavy emphasis on Torrenza, Hypertransport, and integrated memory controllers.  With the exception of Torrenza, each of the other aspects of system design have a rich history.  Hypertransport, arguably, has been an outlier on the bus technology side, but it’s capabilities and industry support are unparalleled.  Integrated memory controllers, while “nothing new” (DEC Alpha, anyone?) really came to the for when AMD introduced them as part of the Athlon series of processors.  Today, I’d like to toss another wrinkle into the “platform meets storage” discussion by including another developing technology: the GPU (Graphical Processing Unit).

Assuming that you made it to the second page, you’re probably scratching your head right now.  “What do graphics have to do with storage?”  In a way, your confusion is merited.  Historically, GPUs have been dedicated to processing mathematical calculations that ultimately result in the display of images on an attached screen.  What hasn’t really been available up until now (~2006-2007) has been a way to utilize the GPU and it’s absolutely astronomical calculation power (and memory bandwidth) for anything other than basic vector calculations.  Starting around 2006-2007, nVidia and ATI made available programming “hooks” for using the stream engines for different payloads.  This type of accessibility was most noticable in the World Community Grid Project (WCG) which used optimized ATI driverset to crunch protein folding routines for research purposes.  Mercury Computing, another name in the industry, was utilizing the Sony Cell processor for much of the same thing.  nVidia was a little late to the party but brought out its Tesla line of computation farms and developed the CUDA (Computer Unified Device Architecture) framework for optimizing workloads for GPU processing.

The question is, what applicability does this have with the FSS?  If you look back at Part 3a, you’ll note my explanation for Torrenza and dedicated workload processors that weren’t CPUs. To quote:

So, you want dedicated processing for x-type of application?  Install a co-processor into an available 1207 socket in the system.  Systems using Cell processors, for example, have been demonstrated behind doors (not commercially available to the best of my knowledge).  The ultimate goal here would be to allow specialized co-processors for applications (RSA disk encryption) that would be offloaded from the general storage I/O processors.  The application set is really endless.  Want to do data encryption inband or at rest?  Install an RSA encryption co-processor.  Want to do compression or de-dupe?  Install a compliant DSP or co-processor that performs that task.  When we look at the operating system for this Future Storage System, you’ll see even more applicability.

As you can see, there are various types of workloads that benefit from additional processing power. Deduplication, Compression, Encryption, even LUN virtualization require additional processing power beyond the standard RAID XOR or Parity operations. Think of it this way, if the FSS Operating System has the intelligence to shift specialized workloads over to a separate ” processing stream,” the overall storage system performance can scale even more more I/O before additional base computation power is required.  Thus, it minimizes overall storage solution cost (not necessarily the FSS cost, mind you) and reduces footprint in the environment.  There would be no more need for external deduplication appliances (sorry, Avamar and Data Domain), no need for external encryption devices (I’d say sorry to NeoScale but the last time I checked, they weren’t solvent), and certainly no need for compression cards.  Even further, some of the basic pack/unpack duties for replication could be handled out of stream as well.  Imagine the possibilities.

Considerations:

As with any other sort of technology being introduced, there are considerations to be made.  First and foremost, how does the GPU interface with the rest of the FSS (physically/electrically)?  Secondly, what accommodations need to be made within the FSS OS to ensure proper pathing?

Physical/Electrical:

There are two major ways that the GPU could be integrated within the FSS;  PCIe (PCI Express) and HT (Hypertransport).  The advantages of PCIe really come down to ease of integration, both from a design perspective (the GPUs already have native PCIe communication) as well as overall FSS system integration.  In my mind, I envision the GPU being mounted either in a pluggable card (similar to the I/O expansion options discussed previously) or, in a Torrenza model, being placed in a CPU socket on the expansion board. The advantage of using Hypertransport really would only be apparent if you were using the HTX standard for the pluggable slots.  The Torrenza approach to GPU integration already injects the GPU into the overall platform I/O stream but would require hardware compatibility.

Along with the basic I/O issues, there also is the issue of power and heat.  GPUs have historically been extremely hot and power hungry. Most high end cards feature large heatsinks with large blower fans that direct heat away from the memory banks and GPU core.  The introduction of such a component within the FSS requires careful study of the hardware layout as well as the overall power requirements needed for both core operation and GPU integration.  However power-bloated GPUs are, there have been developments within the GPU power management and optimization fields. The current crop of GPUs can scale their processing speeds depending on the incoming workload, thus lessening power draw and thermal dissapation.

OS Accomodations:

The inclusion of a GPU within a storage system obviously means that some level of OS optimization must be done in order to utilize the GPU’s stream engines for various tasks like deduplication, etc.  For one, the core OS must recognize incoming data and be able to route it either through the base CPU engine or through the GPU.  Secondly, the OS must still maintain a meta mapping of data and its placement within the system. These are not light tasks at all and must be integrated into the design.

Closing:

Hopefully, I’ve made a case for the role that a GPU can have within a storage system.  It’s not about graphics anymore, it’s about processing power.  The GPU is a powerful tool that has real-world effects and applicability to our storage workloads.

cheers,

Dave

Reblog this post [with Zemanta]
Share