@joeharris76: July 2009

Wednesday, 29 July 2009

Why GPUs matter for DW/BI

I tweeted a few days ago that I wasn't particularly excited about either the Groovy Corp or XtremeData announcements because I think any gains they achieve by using FPGAs will be swept away by GPGPU and related developments. I got a few replies either asking what GPGPU is and a few dismissing it as irrelevant (vis a vis Intel x64 progress). So I want to explain my thoughts on GPGPU, and how it may affect the Database / Business Intelligence / Analytics industry (industries?).

GPGPU stands for "general-purpose computing on graphics processing units". ([dead link]) GPGPU is also referred to as "stream processing" or "stream computing" in some contexts. The idea is that you can offload the processing normally done by the CPU to the computer's graphics card(s).

But why would you want to? Well, GPUs are on a roll. Their performance is increasing exponentially faster than the increase in CPU performance. I don't want to overload this post with background info but suffice to say that GPUs are *incredibly* powerful now and getting more powerful much faster than CPUs. If you doubt that this is the case have a look at this article on the Top500 supercomputing site, point 4 specifically. ([dead link])

This is not a novel insight on my part. I've been reading about this trend since at least 2004. There was a memorable post on Coding Horror in 2006 ([dead link]). Nvidia released their C compatibility layer "CUDA" in 2006 ([dead link]) and ATI (now AMD) released their alternative "Stream SDK" in 2007 ([dead link]). More recently the OpenCL project has been established to allow programmers to tap the power of *any* GPU (Nvidia, AMD, etc) from within high level languages. This is being driven by Apple and their next OSX update will delegate many tasks to the GPU using OpenCL.

That's the what.

Some people feel that GPGPU will fail to take hold because Intel will eventually catch up. This is a reasonable point of view and in fact Intel has a project called Larrabee ([dead link]). They are attempting to make a hybrid chip that effectively emulates a GPU within the main processor. It's worth noting that this is very similar to the approach IBM have taken with the Cell chip used in the Playstation3 and many new supercomputers. Intel will be introducing a new set of extensions (like SSE2) that will have to be used to tap into the full functionality. The prototypes that have been demo'ed are significantly slower than current pure GPUs. The point is that Intel are aware of GPGPU and are embracing it. The issue for Intel is that the exponential growth of GPU power looks like it's going to put them on the wrong side of a technology growth curve for once.

Why are GPUs important to databases and analytics?

The multi-core future is here now.

I'm sure you've heard the expression "the future is already here it's just unevenly distributed". Well that applies double to GPGPU. We can all see that multi-core chips are where computing is going. The clock speed race ended in 2004. Current high end CPUs now have 4 cores and 8 cores will arrive next year and on it goes. GPUs have been pushing this trend for longer and are much further out on this curve. High end GPUs now contain up to 128 cores and the core count is doubling faster than CPUs.

Core scale out is hard.

Utilizing more cores is not straightforward. Current software does not utilize even 2 cores effectively. If you have a huge spreadsheet calculating on your dual core machine you'll notice that it only uses one core. So half the available power of your PC is just sitting there while you're twiddling your thumbs.

Database software has a certain amount of parallelism built in already, particularly the big 3 "enterprise" databases. But the parallel strategies they employ where designed for single core chips residing in their own sockets and having their own private supply of RAM. Can they use the cores we have right now? Yes, but the future now looks very different. Hundreds of cores on a single piece of silicon.

Daniel Abadi's recent post about hadoopDB predicts a "scalability crisis for the parallel database system". His point is that current MPP databases don't scale well past 100 nodes ([dead link]). I'm predicting a similar crisis in scalability for *all database systems* at the CPU level. Strategies for dividing tasks up among 16 or 32 or even 64 processors with their own RAM will grind to a halt when used across 256 (and more) cores on a single chip with a single path to RAM.

Main memory I/O is the new disk I/O.

Disk access has long been our achilles heel in the database industry. The rule of thumb for improving performance is to minimize the amount of disk I/O that you perform. This weakness has become ever more problematic as disk speeds have increased very, very slowly compared to CPU speed. Curt Monash had a great post about this a while ago ([dead link])

In our new multi-core world we will have a new problem. Every core we add increases the demand for data going into and out of RAM. Intel have doubled the width of this "pipe" in recent chips but practical considerations will constrain increases in this area in a similar manner to the constraints on disk speed seen in the past.

Databases will have to change.

Future databases will have to be heavily rewritten and probably re-architected to take advantage of multi-core processor improvements. Products that seek to fully utilize many cores will have to be just as parsimonious with RAM access as current generation columnar and "in-memory" databases are with disk. Further they will have to become just savvy about parallelizing the actions as current MPP databases but they will have to co-ordinate this parallelism at 2 levels instead of just 1.

1st: Activity and data must be split and recombined across Servers/Instances (as currently)

2nd: Activity and data must be split and recombined across Cores, which will probably have dedicated RAM "pools".

1st movers will gain all the momentum.

So, finally, this is my basic point. There's a new world coming. It has a lot of cores. It will require new approaches. That world is accessible today through GPUs. Database vendors who move in this direction now will gain market share and momentum. Those who think they can wait on the Intel and "traditional" CPUs to "catch up" may live to regret it.

A few parting thoughts…

I said at the start that I feel FPGAs will be swept away. I should make 2 caveats to that. First, I can well imagine a world where FPGAs come to the fore as a means to co-ordinate very large numbers of small simple cores. But I think we're still quite a long way from that time. Second, Netezza use FPGAs in a very specific way between the disk and CPU/RAM. This seems like a grey area to me, however Vertica are able to achieve very good performance without resorting to such tricks.

Kickfire is a very interesting case as regards GPGPU. They are using a "GPU-like" chip as their workhorse. Justin Swanhart was very insistent that their chip is not a GPU (that is an analogy) and that it is truly a unique chip. For their sake I hope this is marketing spin and the chip is actually 99% standard GPU with small modifications. Otherwise, I can't imagine how a start-up can engage in the core count arms race long term, especially when it sells to the mid-market. Perhaps they have plans to move to a commodity GPU platform.

A very interesting paper was published recently about performing database operations on a GPU. You can find it here ([dead link]). I'd love to know what you think of the ideas presented.

Finally, I want to point out that I'm not a database researcher nor an industry analyst. My opinion is merely that of a casual observer, albeit an observer with a vested interest. I hope you will do me the kindness of pointing out the flaws in my arguments in the comments.

Monday, 6 July 2009

Useful benchmarks vs human nature. A final thought on the TPC-H dust-up.

The was a considerable flap recently on Twitter and in the blogosphere about TPC-H in general. It was all triggered by the new benchmark submitted by ParAccel in the 30TB class. You can relive the gory details on Curt Monash's DBMS2 site here (http://tr.im/rbCe), if you're interested.

I stayed out of the discussion because I'm kind of burned out on benchmarks in general. I got fired up about benchmarks a while ago and even sent an email with some proposals to Curt. He was kind enough to respond and his response can be summed up as "What's in it for the DB vendor?". Great question and, to be honest, not one I could find a good answer for.

For the database buyer; a perfect benchmark tells them which database has the best mix of cost and performance, especially in data warehousing. This is what TPC-H appears to offer (leaving aside the calculation of their metrics). However, a lot of vendors have not submitted a benchmark. It's interesting to note that vendors such as Teradata, Netezza and Vertica are TPC members but have no benchmarks. The question is why not.

For a database vendor; a perfect benchmark is a benchmark that they can win. Curt has referred to Oracle's reputed policy of WAR (win all reviews). This why their licenses specifically prohibit you from publishing benchmarks. There is simply no upside to being 3rd, 5th or anything but first in a benchmark. If Oracle are participating in a given benchmark the simple economic reality is that they know they can win it.

This is the very nature of the TPC-H, it is designed to be very elastic and to allow vendors wiggle room so that they can submit winning figures. I'm sure the TPC folks would disagree on principle but TPC is an industry group made of up of vendors. Anything that denied them this wiggle room will either be vetoed or get even less participation than we currently see.

This is a bitter pill to swallow but seems unlikely to change. These days I'm delivering identical solutions across Teradata, Netezza, Oracle and SQL Server. I have some very well formed thoughts on the relative cost and performance of these databases but of course I can't actually publish any data.

By the way, the benchmark I suggested to Curt was about reducing the hardware variables. Get a hardware vendor to stand up a few common configurations (mid-size SMP using a SAN, 12 server cluster using local storage, etc.) at a few storage levels (1TB, 10TB, 100TB) and then test each database using identical hardware. The metrics would be things like max user data, aggregate performance, concurrent simple queries, concurrent complex queries, etc. Basically trying to isolate the performance elements that are driven by the database software and establish some approximate performance boundaries. With many more metrics being produced there can be a lot more winners. Maybe the TPC should look into it…

Sunday, 5 July 2009

The future of BI? It has nothing to do with business…

I've been reading and re-reading Stuart Sutherland's excellent book Irrationality for several weeks (review to come - promise). One of the things he talks about is "making the wrong connections". His point is that humans can't mentally evaluate evidence and make connections. We focus on the elements that are unusual or different and we massively over value our initial guesses.

That really resonates with me. After all that's what Business Intelligence is about, right? We provide factual, numeric, and clean data in a format that allows the user to make reasonable, rational decisions. We lambast the BI nay-sayers who operate on "gut instinct" and rightly so. But we leave that hyper-rational approach at the office door and conduct the rest of our lives in our normal irrational way.

In truth we conduct 95% of our working lives that way as well. The minute-to-minute stuff that business is *really* made of is unrecorded, unanalysed and (of course) irrational. All those conversations, relationships, emails, phone calls and meaningful looks are dealt with by instinct.

Outside the office we're seeing an explosion in personal monitoring and self surveillance. Devices like the iPhone can track every interaction, accessories like Nike+ allow us to track every step we take, software like RescueTime continuously monitors our computer usage. Even Facebook is a way to monitor your relationships, something that seemed completely intangible a few years ago. Etc, etc, etc.

This is the future of BI: Rational Augmentation. Using tracking data to make faster, better and more rational decisions about everything in our lives. It's about dealing with huge volumes of hyper-personal data and finding the patterns that matter. It lives outside the office and outside the corporation. It's a dash of text-mining, a pinch of regression, a dollop of aggregation and spoonful advanced analytics and a heap of basic statistics.

Many people will feel uncomfortable about this but the young will adopt it without question and those who adopt it will do better. Let's face it, it's a sub-optimal world out there and an edge in rationality could be a very big edge indeed.

As a final thought, this has the makings of a classic innovators dilemma for the current BI players. Rational Augmentation (I'm loving this phrase but call it what you like…) is going to need to deal with large data volumes very cheaply and very locally. It will probably be service based. It will probably be free for at least some users. But ultimately it will be a huge market, dwarfing the current BI market. The current players may have the skills to take this on but they've been swallowed by the corporate quicksand and they will sit and watch it pass them by. C'est la vie.