Tuesday, 29 March 2011

Analytic Database Market Opportunities

In my first post in this series I gave an overview of ParStream and their productIn the second post I gave an overview of the Analytic Database Market from my perspective. In the third post I introduced a simple Analytic Database Market Segmentation.

In this post I will look at the gaps in this market and the new opportunities for ParStream and RainStor to introduce differentiated offerings. First, though I'll address the positioning of Hadapt.

Hadapt Positioning
Hadapt have recently come out of stealth and will be offering a very fast 'adaptive' version of Hadoop. Hadapt is a reworked and commercialized version of the Daniel Abadi's HadoopDB project. You can read Curt Monash's overview for more on that.  Basically Hadapt provides a Hadoop compatible interface (buzz phrase alert) and uses standard SQL databases (currently Postgres or VectorWise) underneath instead of HDFS. The unique part of their offering is keeping track of node performance and adapting queries to make the best use of each node. The devil is in the details of course, but a number of questions remain unanswered: How much  of the Hadoop API will be mapped to the database? Will there be a big inflection in performance between logic that maps to the DB and logic that runs in Hadoop? Etc. At a high level Hadapt seems like a very smart play for cloud based Hadoop users. Amazon EC2 instances have notoriously inconsistent I/O performance and a product that works around that should find fertile ground.

RainStor's Current Positioning
RainStor, if you don't know, is a special archival database that features massive compression. They current sell it as an OLDR solution (Online Data Retrieval) primarily aimed at company's that have large data volumes and stringent data retention requirements, e.g., anyone in Financial Services. They promise between 95% (20:1)  and 98% (40:1) compression rates for data whilst remaining fully query-able. Again Curt Monash has the best summary of their offering. I briefly met some RainStor guys a while back and I feel pretty confident that the product delivers what it promises. That said, I have never come across a RainStor client and I talk to lots of Teradata and Netezza types who would be their natural customers. So, though I have no direct knowledge of how they are doing, I suspect that it's been slow going to date and focusing on a different part of the market might be more productive.

Hyper Compressed Hadoop - RainStor Opportunity
I tweeted a while back that "RainStor needs a MapReduce story like yesterday". I still think that's right although now I think they need a Hadoop compatible story. To me, RainStor and Hadoop/MapReduce seem like a great fit. Hadoop users value the ability to process large data volumes over simple speed. Sure, they're happy with speed when they can get it but Hadoop is about processing as much data as possible. RainStor massively compresses databases while keeping them online and fully query-able. If RainStor could bring that compression to Hadoop it would be incredibly valuable. Imagine a Hadoop cluster that's maxed out at 200TB of raw data, compressed using splittable LZO to 50TB and replicated on 200TB of disk. If RainStor (replacing HDFS) could compress that same data at 20:1, half their headline rate, that cluster can now scale out to roughly 2,000TB. And many operations in Hadoop are constrained by disk I/O so if RainStor can operate to some extent on compressed data the cluster might just run faster. Even if it runs slightly slower the potential cost savings are huge (insert your own Amazon EC2 calculation here where you take EC2+S3 spend and divide by 20).

ParStream Opportunities
I see 2 key opportunities for ParStream in the current market. They can co-exist; but may require significant re-engineering of the product. First a bit of background; I'm looking for 'Blue Ocean Strategies' where ParStream can create a temporary monopoly. Selling into the 'MPP Upstart' segment is not considered due to the large number of current competitors. It's interesting to note though that that is where ParStream's current marketing is targeted.

Real-Time Analytic Hadoop - ParStream Opportunity 1
ParStream's first opportunity is to repurpose their technology into a Hadoop compatible offering. Specifically a 'real-time analytic Hadoop' product that uses GPU acceleration to vastly speed up Hadoop processing and opens up the MapReduce concept for many different and untapped use cases.  ParStream claim to have a unique index format and to mix workloads across CPUs and GPUs to minimise response times. It should be possible to use this technology to replace HDFS with their own data layer and indexing. They should also aim to greatly simplify data loading and cluster administration work. Finally transparent SQL access would be a very handy feature for business that want to provide BI directly from their 'analytic Hadoop' infrastructure. In summary: Hadoop's coding flexibility, processing speeds that approach CEP, and Data Warehouse style SQL access for downstream apps.

Target customers: Algorithmic trading companies (as always…), Large-scale online ad networks, M2M communications, IP-based Telcos, etc . Generally businesses with large volumes of data and high inbound data rates who need to make semi-complex decisions quickly and who have a relatively small staff.

Single User Data Warehouses - ParStream Opportunity 2
ParStream's second opportunity is to market ParStream as a single user, desk side data warehouse for analytic professionals, specifically targeting GPU powered workstations (like this one: ~$12k => 4 GPUs [960 cores], 2 Quad core CPUs, 48GB RAM, 3.6TB of fast disk). This version of ParStream must run on Windows (preferably Win7 x64, but Win Server at a minimum). Many  IT departments will balk at having a non-Windows workstation out in the office running on the standard LAN. However they are very used to analysts requesting 'special' powerful hardware. That's why the desk side element is so critical, this strategy is designed to penetrate restrictive centralised IT regimes.

In my experience a handful of users place 90% of the complex query demand on any given data warehouse. They're typically statisticians and operational researchers doing hard boiled analysis and what-if modelling. Many very large businesses have separate SAS environments that this group alone uses but that's a huge investment that many can't afford. Sophisticated analysts are a scarce and expensive resource and many companies can't fill the vacancies they have. A system that improves analyst productivity and ensures their time is well used will justify a significant premium. It also gives the business an excellent retention tool  to retain their most valuable 'quants'.

This opportunity avoids the challenges of selling a large scale GPU system into a business that has never purchased one before and avoids the red ocean approach of selling directly into the competitive MPP upstart segment. However it will be difficult to talk directly to these users inside the larger corporation and, when you convince them they need ParStream; you still have to work up the chain of command to get purchase authority (not the normal direction). On the plus side though these users form a fairly tight community and they will market it themselves if it makes their jobs easier.

Target customers: Biotech/Bioscience start-up companies, University researchers, marketing departments or consultancies. Generally, if a business is running their data warehouse on Oracle or SQL Server, their will be an analytic professional who would give anything to have a very fast database all to themselves.

In my next post I will look at why Hadoop is getting so much press, whether the hype is warranted and, generally, the future shape of the thing we currently call the data warehouse.

Friday, 25 March 2011

Analytic Database Market Segmentation

In my first post in this series I gave an overview of ParStream and their product.

In this post I will briefly talk about how vendors are positioned and introduce a simple market segmentation.

Please note that this is not exhaustive, I've left off numerous vendors with perfectly respectable offerings that I didn't feel I could reasonably place.

The chart above gives you my view of the current Analytic Database market and how the various vendors are positioned. The X axis is log scale going from small data sizes (<100GB) to very large data sizes (~500TB). I have removed the scale because it based purely on my own impressions of common customer data sizes for that vendor based on published case studies and anecdotal information.

The Y axis is a log scale of the number of employees that a vendor's customers have. Employee size is highly variable for a given vendor but nonetheless each vendor seems to find a natural home in businesses of a certain size. Finally the size of each vendor's bubble represents the approximate $ cost per TB for their products (paid versions in the case of 'open core' vendors). Pricing information is notoriously difficult to come across so again this is very subjective but I have first hand experience with a number of these so it's not a stab in the dark.

Market Segments
SMP Defenders: Established vendors with large bases operating on SMP platforms
Teradata+Aster: The top of the tree. Big companies, big data, big money.
MPP Upstarts: Appliance – maybe, Columnar – maybe, Parallelism – always.
Open Source Upstarts: Columnar databases, smaller businesses, free to start.
Hadoop+Hive: The standard bearer for MapReduce. Big Data, small staff.

SMP > MPP Inflection
Still with me? Good, let's look at the notable segments of the market. First, there is a clear inflection point between the big single server (SMP) databases and the multi-server parallelised (MPP) databases. This point moves forward a little every year but not enough to keep up with the rising tide of data. For many years Teradata owned the MPP approach and charged a handsome rent. In the previous decade a bevy of new competitors jumped into the space with lower pricing and now the SMP old guard getting into MPP offerings, e.g., Oracle Exadata and Microsoft SQL Server PDW.

Teradata's Diminished Monopoly
Teradata have not their lost grip on the high end however. They maintain a near monopoly on data warehouse implementations in the very largest companies with the largest volumes of 'traditional' DW data (customers & transactions). Even Netezza has failed to make large dent into Teradata's customers. Perhaps there are instances of Teradata being displaced by Netezza; however I have never actually heard of one. There are 2 vendors who have a publicised history of being 'co-deployed' with Teradata: Greenplum and Aster Data. Greenplum's performance reputation is mixed and it was acquired last year by EMC. Aster's performance reputation is solid at very large scales and their SQL/MapReduce offering has earned them a lot of attention. It's no surprise that Teradata decided to acquire them.

The DBA Inflection Point
The other inflection point in this market happens when the database becomes complex enough to need full time babysitting, e.g., a Database Administrator. This gets a lot less attention than SMP>MPP because it's very difficult to prove. Nevertheless word gets around fairly quickly about the effort required to keep a given product humming along. It's no surprise that vendors of notoriously fiddly products sell them primarily to large enterprises where the cost of employing such an expensive specimen as a collection of DBAs is not an issue.

Small, Simple & Open(ish)
Smaller businesses, if they really need an analytic DB, choose products that have a reputation for being usable by technically inclined end users without a DBA. Recent columnar database vendors fall into this end of the spectrum, especially those that target the MySQL installed base. It's not that a DBA is completely unnecessary, simply that you can take a project a long way without one.

MapReduce: Reduced to Hadoop
Finally we have those customers in smaller businesses (or perhaps government or universities) who need to analyse truly vast quantities of data with the minimum amount of resource. In the past it was literally impossible for them to do this; they were forced to rely on gross simplifications. Now though we have the MapReduce concept of processing data in fairly simple steps, in parallel across a numerous cheap machines. In many ways this is MPP minus the database, sacrificing the convenience of SQL and ACID reliability for pure scale. Hadoop has become the face of MapReduce and is effectively the SQL of MapReduce, creating a common API that alternative approaches can offer to minimise adoption barriers. 'Hadoop compatible' is the Big Data buzz phrase for 2011.

Disqus for @joeharris76