@joeharris76: December 2010

Thursday, 30 December 2010

2010 Review: a BI-DW Top 5

This post is written completely 'off the cuff' without any fact checking or referring back to sources. Just sayin'…

Top 5 from 2010

5) Big BI consolidation is finished
  There were no significant acquisitions of "Big BI" vendors in 2010. Since Cognos went to IBM and BO went to SAP, the last remaining member of the old guard is MicroStrategy. (It's interesting to consider why they have not been acquired but that's for another post.) In many ways the very definition of Big BI has shifted to encompass smaller players. Analysts, in particular, need things to talk about and they have effectively elevated a few companies to Big BI status that were previously somewhat ignored, e.g., SAS (as a BI provider), InformationBuilders, Pentaho, Acuate, etc. All of the major conglomerates now have a 'serious' BI element in their offerings and so I don't see further big spending on BI acquisitions in 2011. The only dark horse in this race seems to be HP and it's very unclear what their intentions are, particularly with the rumours of Neoview being cancelled; if HP were to move I see them going for either a few niche players or someone like InformationBuilders with solid software but lacking in name recognition.

4) Analytic database consolidation began
  We've seen an explosion of specialist Analytic databases over the last ~5 years and 2010 saw the start of a consolidation phase amongst these players. The first big acquisition of 2010 was Sybase by SAP; everyone assumed Sybase's IQ product (the original columnar database) was the target but the talk since then has been largely about the Sybase mobile offerings. I suspect both products are of interest to SAP; IQ allows them to move some of their ageing product lines forward and Mobile will be an enabler for taking both SAP and Business Objects to smartphones going forward.
  The banner acquisition was Netezza by IBM. I've long been very critical/sceptical of IBM's claims in the Data Warehouse / Analytic space. Particularly as I've worked with a number of DW's that were taken off DB2 (onto Teradata) but never come across one actively running on DB2. I'm a big Netezza fan so my hope is that they survive the integration and are able to leverage the resources of IBM going forward.
  We also saw Teradata acquiring the dry husk of Kickfire's ill-fated MySQL 'DW appliance'. Kickfire's fundamental technology appeared to be quite good but sadly their market strategy was quite bad. I think this a good sign from Teradata that they are open to external ideas and they see where the market is going. The competition with Netezza seems to have revitalised them and given them a new enemy to focus on. A new version of Teradata database that incorporated some columnar features (and an 'free' performance boost) could be just the ticket to get their very conservative customers migrated onto the latest version.

3) BI vendors started thinking about mobile
  Mobile BI became a 'front of mind' issue in 2010. MicroStrategy has marketed aggressively in this space but other vendors are in the hunt and have more or less complete mobile offerings. Business Objects also made some big noise about mobile but everything seemed to be demos and prototypes. Cognos has had a 'mobile' offering for some time but they remained strangely quiet, my impression is that their mobile offerings are not designed for the iOS/Android touchscreen world.
  Niche vendors have been somewhat quiet on the mobile front, possibly waiting to see how it plays out before investing, with the notable exception of Qlikview who have embraced it with both arms. This is a great strategic move for Qlikview (who IMHO prove the koan that 'strategy trumps product') because newer mobile platforms are being embraced by their mid-market customers far faster than at Global 5000 companies that the Big BI vendors focus on. Other niche and mid-market vendors should take note of this move and get something (anything!) ready as quickly as possible.

2) Hadoop became the one true MapReduce
  I remain somewhat non-plussed by MapReduce personally, however a lot of attention has been lavished on it over the last 2 years and during the course of 2010 the industry has settled on Hadoop as the MapReduce of choice. From Daniel Adabadi's HadoopDB project to Pentaho's extensive Hadoop integration to Aster's "seamless connectivity" with Hadoop to Paraccel's announcement of the same thing coming soon and on and on. The basic story of MapReduce was very sexy but in practice the details turned out to be "a bit more complicated" (as Ben Goldacre [read his book!] would say). It's not clear that Hadoop is the best possible MR implementation but it looks likely to become the SQL of MapReduce. Expect other MapReduce implementations to start talking about Hadoop compatibility ad nauseum.
  All of this casts Cloudera in an interesting light. They are after all "the Hadoop company" according to themselves. It's far too early for a 'good' acquisition in this space however money talks and I wonder if we might see something happen in 2011.

1) The Cloud got real and we all got sick of hearing about it
  I'm not sure whether 2010 was truly the "year of the Cloud" but it certainly was the peak of it's hype cycle. In 2010 the reality of cloud pricing hit home; the short version is that a lot of the fundamental cost of cloud computing is operational and we shouldn't expect to see continuous price/performance gains like we have seen in the hardware world. Savvy observers have noted that the bulk of enterprise IT spending has been non-hardware for a long time but the existence of cloud offerings brings those costs into focus.
  Ultimately, my hope for the Cloud is that it will drive companies toward buying results, e.g., SaaS services that require little-to-no customisation, and away from buying potential, e.g. faster hardware and COTS software that is rarely fit for purpose. The cycle should go something like: "This Cloud stuff seems expensive, how much does it cost us to do the same thing?" > "OMG are you frickin' serious, we really spend that?!" > "Is there anyone out there that can provide the exact same thing for a monthly fee?". Honestly, big companies are incredibly bad at hardware and even worse at software. The Cloud (as provided by Amazon, et al) is IMHO just a half step towards then endpoint which is the use of SaaS offerings for everything.

Wednesday, 15 December 2010

Initial thoughts about ParStream

So here are my thoughts about ParStream based on researching their product on the internet only. I have not used the product, so I am simply assuming it lives up to all claims. As an analytics user and a BI-DW practitioner I sincerely hope that ParStream succeeds.

I'm a GPU believer
I'm a long time believer in the importance of utilising GPU for challenging database problems. I wrote a post in July 2009 about using GPUs for databases and implored database vendors to move in that direction: "Why GPUs matter for DW/BI" (http://joeharris76.blogspot.com/2009/07/why-gpus-matter-for-dwbi.html). Here's the key quote - "There's a new world coming. It has a lot of cores. It will require new approaches. That world is accessible today through GPUs. Database vendors who move in this direction now will gain market share and momentum. Those who think they can wait on Intel and 'traditional' CPUs to 'catch up' may live to regret it."

On the right track
I think ParStream is *fundamentally* on the right track with a GPU accelerated analytic database. The ParStream presentation from Mike Hummel (http://www.youtube.com/watch?v=knicXkXd9hQ) talks about a query that took 12 minutes on Oracle taking just a few *miliseconds* on ParStream. If that is even half right the potential to shake up the industry and radically raise the bar on database performance is very exciting.

Reminiscent of Netezza
I remember the first time I used Netezza back in 2004. I had just taken a new role and my new company had recently installed a first generation Netezza appliance. In my previous job we had an Oracle data warehouse that was updated *weekly* and contained roughly 100 million rows. Queries commonly took *hours* to return. The Netezza machine held just less than 1 *billion* rows. I ran the following query: "SELECT month, COUNT(*), SUM(call_value) FROM cdr GROUP BY month;". It came back in 15 seconds! I was literally blown away.

A fast database changes the game
When you have a very fast analytic databases it totally changes the game. You can ask more questions, ask more complex questions and ask them more often. Analytics requires a lot of trial and error and removing time spent waiting on the database enables a new spectrum of possibilities. For example, Netezza enabled me to reprice _every_ call in our database against _every_ one of our competitors tariffs (i.e. an 'explosive' operation: 50 mil records in => 800 mil records out) and then calculate the best *possible* price for each customer on any tariff. I used that information to benchmark my company on "value for money" and to understand the hidden drivers for customer churn.

ParStream appliance strategy:
So, given that background, let's look at the positioning of ParStream, the potential problems they may face, and the opportunities they need to pursue.

ParStream is not Netezza
I've positively compared ParStream to Netezza above so you might expect me to applaud ParStream for offering an appliance. Sadly not; Netezza's appliance success was due to unique factors that ParStream cannot replicate. Netezza had to use custom hardware because they use a custom FPGA chip. Customers were (and are) nervous about investing heavily in such hardware, however Netezza goes to great lengths to reassure them; providing service guarantees, plenty of spare parts and using commodity components wherever possible (power supplies, disks, host server, etc.). Also we must remember that most customers looking at Netezza were using very large servers (or server clusters) and required *very many* disks to get reasonable I/O performance for their databases. Netezza was actually reducing complexity for those customers.

The world has changed going into 2011
ParStream cannot replicate those market conditions. The world has changed considerably going into 2011 and different factors need to be emphasised. ParStream relies on Nvidia GPUs that are widely available and installed on commodity interconnects (e.g. PCIe). Moreover there are high quality server offerings available in 2 form factors that make the appliance strategy more of a liability than an asset. First, Nvidia (and others) sell 1U rack mounted 'server' that contain 4 GPUs and connect to 'host' server via a PCIe card. Second Supermicro (and others) sell 4U 'super' servers that contain 2 Intel Xeons and 4 GPUs in a pre-integrated package. The ParStream appliance may well be superior to these offerings in some key way however such advantages will be quickly wiped by out as the server manufactures continuously refresh their product line.

Focus on the database software business
ParStream should focus on the database software business where they have a huge advantage not the server business where they have huge disadvantages. You should read this article if you have any further doubts: "The Power of Commodity Hardware" (http://www.svadventure.com/svadventure/2009/01/the-power-of-commodity-hardware.html). Key quotes: "Customers love commodity hardware.", "Competing with HP, IBM, and Dell is dumb.", "Commodity hardware is much more capital efficient". Also consider the fates of Kickfire and Dataupia who floundered on a database appliance strategy, and ParAccel who is going strong after initially offering an appliance and quickly moving to emphasise software-only.

Position GPUs as a new commodity
ParStream must position GPUs and GPU acceleration as a new commodity. Explain that GPUs are an essential part of all serious supercomputers and the technology is being embraced by everyone; Intel with Larabee, AMD with Fusion, etc. Emphasise the option to add 'commodity' 4 GPU pizza boxes servers alongside a customer's existing Xeon/Opteron servers and, using ParStream, make huge performance gains. Talk to Dell customers about using a single Dell PowerEdge C410x GPU chasis (http://www.dell.com/us/en/enterprise/servers/poweredge-c410x/pd.aspx) to accelerate an entire rack of "standard" servers running ParStream. The message must be clear: ParStream runs on commodity hardware; you may not have purchased GPU hardware before but you can get exactly what ParStream needs from your preferred vendor.

One final point here; ParStream needs to make Windows support a priority. This is probably not going to be fun, technically speaking, but Windows support will be important for the markets that ParStream should target (which will have to be another post, sadly).

UPDATE - I followed this post up with:
An overview of the analytic database market, a simple segmentation of the main analytic database vendors, and a summary of the key opportunities I see in the analytic databases market (esp. for ParStream and RainStor)

Thursday, 9 December 2010

Comment regarding Infobright's performance problems

UPDATE: This is a classic case of the comments being better than the post; make sure you read them! In summary, Jeff explained better and a lightbulb went off for me: Infobright is for OLAP in the classical sense with the huge advantage of being managed with a SQL interface. Cool.

I made a comment over on Tom Barber's blog post about a Columnar DB benchmarking exercise: http://pentahomusings.blogspot.com/2010/12/my-very-dodgy-col-store-database.html

Jeff Kibler said...
Tom –

Thanks for diving in! As indicated in your results, I believe your tests cater well to databases designed for star-schemas and full table-scan queries. Because a few of the benchmarked databases are engineered specifically for table scans, I would anticipate their lower query execution time. However, in analytics, companies overwhelmingly use aggregates, especially in ad-hoc fashion. Plus, they often go much higher than 90 gigs.

That said, Infobright caters to the full fledged analytic. As needed by the standard ad-hoc analytic query, Infobright uses software intelligence to drastically reduce the required query I/O. With denormalization and a larger data set, Infobright will show its dominance.

Cheers,

Jeff
Infobright Community Manager
8 December 2010 17:04

Joe Harris said...
Tom,

Awesome work, this is the first benchmark I've seen for VectorWise and it does look very good. Although, I'm actually surprised how close InfiniDB and LucidDB are, based on all the VW hype.

NFS on Dell Equilogic though? I always cringe when I see a database living on a SAN. So much potential for trouble (and really, really slow I/O).

Jeff,

I have to say that your comment is off base. I'm glad that Infobright has a community manager who's speaking for them but this comment is *not* helping.

First, your statement that "in analytics, companies overwhelmingly use aggregates" is plain wrong. We use aggregates as a fallback when absolutely necessary. Aggregates are a maintenance nightmare and introduce a huge "average of an average" issue that is difficult to work around. I'm sure I remember reading some Infobright PR about removing the need for aggregate tables.

Second, you guys have a very real performance problem with certain types of queries that should be straightforward. Just looking at it prima facie it seems that Infobright starts to struggle as soon as we introduce multiple joins and string or range predicates. The irony of the poor Infobright performance is that your compression is so good that the data could *almost* fit in RAM.

What I'd like to see from Infobright is: 1) a recognition of the issue as being real. 2) An explanation of why Infobright is not as fast in these circumstances. 3) An explanation of how to rewrite the queries to get better performance (if possible). 4) A statement about how Infobright is going to address the issues and when.

I like Infobright; I like MySQL; I'm an open source fan; I want you to succeed. The Star Schema Benchmark is not going away, Infobright needs to have a better response to it.

Joe