Future Storage Systems: Part 2 – Detailed Node View

by dave on October 8, 2008

So, in my article yesterday, I gave a global view of a very simple storage system for the future. Since I LOVE this type of conjecture and theoretics (is that a word?), I decided to take this a step further and flesh out some of the other intricacies of the design.  Check out the image below and then click through to read the rest.

Fleshing out the Hypertransport Storage System

Fleshing out the Hypertransport Storage System

Starting clockwise from the lower left hand corner, let’s take a deeper look at things.

First off, multi-core processors are here to stay.  It’s honestly not worth evaluating even dual core processors these days in that most server/workstation systems are designed around multiple processor cores. (AMD Shanghai/Istanbul, Intel Dunnington, for example) The obvious requirement from a software level is that the storage system OS is multi-threaded and contains a scheduler that understands that multiple logical cores exist and are usable.  Further, using processor affinity rules (a la VMWare’s ESX hypervisor layer), you could tie the OS to a specific core and leave the reast of the cores available for specific processes (i.e. RAID XOR or parity calculations, etc.).

Second on the list is inter-processor communication.  Historically, this would be called the “system bus” and any I/O traffic to processors and memory would be conducted over this link.  With the advent of direct memory access (see next point) and the subsequent removal of general memory traffic from the processor bus (outside of requests for memory pages not located directly in the processor’s “owned” bank), the traffic across these links would be limited to I/O and processing requests.  Pretty handy.  Using Hypertransport, you could have have (as is the case today) a coherent processor link and a non-coherent link (sideband traffic) that would be able to gang/un-gang as needed (feature of HT 3).

Thirdly, it’s important to realize the role that cache/memory plays in storage systems.  Not only is important for the array OS (FLARE, DART, OnTAP, etc. etc.) and it’s various operating needs, it’s also used for the read/write cache.  Obviously, the different OSs’ offer various methods of manipulation to tune reads vs. writes and allocate more to one bucket than the other.  The other necessary requirement for memory has to do with the amount of processors in a system.  Obviously, based on this design above, you can either tie your processors to a central MCH (Memory Controller Hub) like Intel has historically done (Nehalem changes this) or put the MCHs inside the physical processor die (for example, AMD Opteron processors).  An issue obviously will arise when you go to read memory from an adjoining processor, injecting latency and I/O hops, but overall the integrated MCH design reduces memory access and allows for additional bandwidth and reduced latency for direct processor accesses. A lot of this can be assuaged by careful system design, but, we’ll need to approach this in a later article more in depth.

Fourth on the hitlist is the Southbridge and PCIe.  PCIe is usually discussed from a platform perspective as having “lanes.” The higher the bandwidth, the more “lanes” required. In order to keep the system I/O per node within servicable spec, I’ve purposefully limited the PCIe expansion to 3 x x8 PCIe slots.  In my mind, to be truly multiprotocol, there needs to be support for the “standard” SAN connectivity types: Fibre Channel, iSCSI, and NAS.  In my mind, I can see where either Infiniband (doing something like IP over IB or IB-to-FC routing with Qlogic Silverstorm or Xsigo routers) or FCoE  (Cisco Nexus or Brocade DCX) would provide enough SAN-facing connectivity and bandwidth for customers without completely saturating the system bus.  At the same time, it’s important to allow for some level of I/O expandability to avoid issues of array downtime for port upgrades and the like.  This is currently an issue with NetApp, Lefthand, HP, HDS, Sun, 3Par, Pillar, Equallogic, et al and I’d point to the EMC Clariion CX4 line as the exemplar of how to handle the I/O expansion issue the correct way. Compellent has a pseudo-interface updating process documented somewhere, but honestly, it’s nowhere near as elegant as the CX4.

Since I didn’t touch on the SouthBridge portion of things, I’d just hastily point out that the AMD/ATI relationship has recently born significant fruit in the addition of new chipsets that will actually give the AMD platform a bit of a boost when it comes to server-class implementations.  Nothing against nVidia or Broadcom, per se, but I’m tired of MCP55/HT2000 based systems. 😉

Fifth on the list is I/O interfaces to the SAN. As noted, you can see that I’ve included pretty much any interface on the market as an option.  Fibre Channel over Ethernet, Fibre Channel (discrete), Infiniband, iSCSI, and IP (GigE or greater) all have their relative merits on the market.  This type of flexibility is a must especially since some customers will never use certain interface types.  From a high level, I agree with the “multi protocol” box sentiment that everyone seems to espouse, but there needs to be a certain level of flexibility to remove those protocols that are not needed.  If you look at this platform as a “framework” for any protocol, you could easily add/remove interfaces as needed.  Customer wants only NAS/IP connectivity? The base framework won’t change, the I/O card(s) will.  I personally believe the trend is towards integrated fabrics/protocols like FCoE and IPoIB, but, we’ll see what the general acceptance is.

Finally, you need some way to manage the nodes and the OS.  There’s a GigE port per node to handle any management functions that could be needed.

In any case, that’s it for today!  Let me know your feedback and thoughts!

(special thanks to Stephen Todd and Ken Ferson @ EMC for getting me to THINK!)

10/14/08 : (EDITS) : made edits for clarity and appropriate product naming conventions.

Reblog this post [with Zemanta]