@joeharris76: January 2011

Friday 28 January 2011

HandlerSocket - More grist for the ORM mill

A plugin called HandlerSocket was released last year that allows InnoDB to be used to directly, bypassing the MySQL parsing and optimising steps. The genius of HandlerSocket is that the data is still "in" MySQL so you can use the entire MySQL toolchain (monitoring, replication, etc.). You also have your data stored in a highly reliable database, as opposed to some of the horror stories I'm seeing about newer NoSQL products.

In the original blog post ( here ) it talks about 720,000 qps on an 8 core Xeon with 32GB RAM. Granted this is all in memory data we're talking about but that is a hell of a figure. He also claims it outperforms Memcached.

Next, Percona added HandlerSocket to their InnoDB fork back in December ( here ) so if you're looking for someone to talk to they may be the best people.

Finally, Ilya Grigorik (way-smart guy from PostRank) blogged about it a couple of weeks ago ( here ) and there's a fairly interesting discussion in the comments comparing this to prepared statements in Oracle.

All of this reinforces my opinion that new generation ORMs are the technology that will finally allow the RDBMS apple cart to tip all the way over. Products like Redis, Riak, CouchDB, etc. are not enough on their own.

The *really* interesting thing about HandlerSocket is that shows open source databases are perfect fodder for the next wave.

Wednesday 19 January 2011

Analytic Database Market 'Fly Over'

This is a follow up to my previous post where I laid out my initial thoughts about ParStream. This is a very high level 'fly over' view of the analytic database market. I'll follow this up with some thoughts about how ParStream can position themselves in this market.

Powerhouse Vendors

The power players in the Analytic Database market are: Oracle (particularly Exadata), IBM (mostly Netezza, also DB2), and Teradata. Each of these vendors employs a large, very well funded and sophisticated sales force. A new vendor competing against them in accounts will find it very, very hard to win deals. They can easily put more people to work on a bid than a company like ParStream *employs*. If you are tendering for business in a Global 5000 corporation then you should expect to encounter them and you need a strategy for countering their access to the executive boards of these companies (which you will not get). In terms of technology their offerings have become very similar in recent years with all 3 emphasising MPP appliances of one kind or another, however most of the installed base are still using their traditional SMP offerings (Netezza and Teradata excepted).

New MPP niche players

There are a number of recent entrants to the market who also offer MPP technology, particularly: Greenplum, AsterData and ParAccel. All 3 offer software-only MPP databases, although Greenplum's emphasis has shifted slightly since being acquired by EMC. These vendors seem to focus mostly on (or succeed with) customers who have very large data volumes but are small companies in terms of employees. Many of these customers are in the web space. These vendors also have strong stories about supporting MapReduce/Hadoop inside their databases, which also plays to the leanings of web customers. According to testimonials on the vendor's websites customers seem to choose them because they are very fast and software only.

Microsoft

Microsoft is a unique case. They do not employ a direct sales force (as far as I know) however they have steadily become major force in enterprise software. Almost all companies run Windows desktops, have at least a few Windows servers and at least a few instances of SQL Server in production. Therefore Microsoft will be considered in virtually every selection process you're involved in. Microsoft have been steadily adding BI-DW features to the SQL Server product line and generally those features are all "free" with a SQL Server license. This doesn't necessarily make SQL Server cheaper but it does make it feel like very good value. Recent improvements include the Parallel Data Warehouse appliance (with HP hardware), columnar indexing for the next release and PowerPivot for local analysis of large data volumes.

Proprietary columnar

Columnar databases have been the hot technology in analytic databases for the last few years. The biggest vendors are Sybase with their very mature IQ product, SAND with an equally mature product and Vertica with their newer (and reportedly much faster) product. These databases can be used in single server (SMP / scale-up) and MPP (multi-server / scale-out) configurations. They appear to be most popular with customers who appreciate the high levels of compression that these databases offer and already have relatively mature star-schema / Kimball style data warehouses in place. In my experience Sybase and SAND are used most in companies where they were introduced by an OEM as part of another product. Vertica is so new that it's not clear who their 'natural' customers are yet.

Open Source columnar

In the open source world there are 2 MySQL storage engines and a standalone product offering columnar databases. The MySQL engine Infobright was the first open source columnar database. It features very high compression and very fast loading however it is not suited for lots of joins and may be better thought of as a OLAP tool managed via SQL. The InfiniDB MySQL engine on the other hand is very good at joins and very good at squeezing all the available performance out of a server, however it does not have any compression currently. Finally there is LucidDB which is a Java based standalone product and has performance characteristics somewhere between the other two. LucidDB features excellent compression, index support and generally good performance but can be slow to load.

Vectorised columnar
There is only one player here: VectorWise. VectorWise is a columnar database (AFAIK) that has been architected from top to bottom to take advantage of the vector pipelines built into all recent CPUs. Vectorisation is a way of running many highly parallel operations through a single CPU. It basically removes all of the waiting and memory shifting that slows a CPU down. Initial testers have been very positive about the performance of VectorWise and had nothing but good things to say. There is also talk of an open source release so they are covering a lot of bases. They also have the advantage of being part of Ingres who may not be the force they once were but have a significant installed base and are well placed to sell VectorWise. They are the biggest direct competitor to ParStream that I can see right now.

Open Source MapReduce/NoSQL
ParStream will also compete with a new breed of open source MapReduce/NoSQL products, most notably Hadoop (and it's variants). These products are not databases per se but they have gained a lot of mindshare among developers who need to work with large data volumes. Part of their attraction is their 'cloud friendliness'. They are perfect for the cloud because they have been designed to run on many small servers and to expect that a single server could fail at any time. There is a trade-off to be made and MapReduce products tend to be much more complex to query, however for a technically savvy audience the trade is well worth it.

Next time I'll talk about where I think ParStream need to place themselves to maximise their opportunity.

UPDATE: Actually, in the next post I talk about how analytic database vendors are positioned and introduce a simple market segmentation. A further post about market opportunities will follow.

My take on why businesses have problems with ETL tools

Check out this very nice piece by Rick about the reasons why companies have failed to get the most out of their ETL tools.

My take is from the other side of the fence. As a business user I'm often frustated by ETL tools and have been known to campaign against them for the following reasons:

> ETL tools have been too focussed on Extract-Transform-Load and too little focused on actual data integration. I have complex integration challenges that are not necessarily a good fit for the ETL strategy and sometimes I feel like I'm pushing a square peg into a round hole.

> It's still very challenging to generate reusable logic inside ETL tools and this really should be the easiest thing in the world (ever heard the mantra Don't Repeat Yourself!). Often the hoops that have to be jumped through are more trouble than they are worth.

> Some ETL tools are a hodge podge of technologies and approaches with different data types and different syntaxes wherever you look. (SSIS I'm looking at you! This still is not being addressed in Denali.)

> ETL tools are too focused on their own execution engines and fail miserably to take advantage of the processing power of columnar and MPP databases by running processes on the database. This is understandable in open source tools (database specific SQL may be a bridge too far) but in commercial tools it's pathetic.

> Finally, where is the ETL equivalent of SQL? Why are we stuck with incompatible formats for each tool. The design graphs in each tool look very similar and the data they capture is near identical. Even the open source projects have failed to utilise a common format. Very poor show. This is the single biggest obstacle to more widespread ETL. Right now it's much easier for other parts of the stack to stick with SQL and pretend that ETL doesn't exist.

Wednesday 12 January 2011

Chinese Mother: Psychology is Modern Shamanism

A couple of days ago there was a widely linked article in the WSJ called "Chinese Mother" ( http://on.wsj.com/f3nh9d ). The basic premise of the article is that Western mothers are too soft and don't push their children enough and Chinese mothers are like a blacksmith's hammer cruelly pounding there children until they become brilliant swords of achievement (or something equally pathetic).

I'm not going to deal with the premise though; it's the subtext that I'm interested in. The subtext is: 'Western people develop psychological problems because their parents make them weak, self indulgent quitters.' I've seen lots of counterpoints who's subtext is something like 'Chinese parents turn their children into soulless robots who can only take orders'. The really interesting thing about both of these ideas is that they tacitly accept the current fashions of Western psychology as if they were scientifically proven facts. You may well expect that from Western responses but in the original piece she frames Chinese Mothers as the antidote to the 'problems' identified by Western psychological ideas.

I'm going to digress for a minute but if you do nothing else make sure you read "The Americanization of Mental Illness" in the New York Times ( http://nyti.ms/ggQKCG ).

Let me introduce an imaginary a world in which the internal combustion engine evolved on it's own and everyone in this world is given an engine when they're born, sort of like a puppy, and the engine has to develop and eventually reach a mature state. They keep the engine through their life and use it to assist with physical work. These engines are completely sealed (a la Honda) and cannot be opened or disassembled without destroying them. The engines accept a few limited inputs (petroleum products, coolant and accelerator signals). They output power and waste (heated coolant and exhaust). Virtually all work is done by the engines and no one can conceive of life without them.

People in this imaginary world are naturally very curious about engines but they basically know nothing about them. They cannot create an engine from first principles. They have invested huge efforts in studying engines but this 'study' basically amounts to looking at engines while they're working and measuring which parts get hot. The engines display remarkably diverse behaviour. They are very sensitive to the quality of the petroleum products the user provides. Some substitutes have been found to work but others will kill the engine. Scientists studying engines have found that chemicals can be added to the fuel to generate different performance characteristics. It's not known whether these additives have a long term impact on the engine. Temperature, humidity, age, etc; many other variables also subtly affect the engines.

Alongside the scientists, a separate field of engine philosophy has grown up. These people develop complex theories about engine performance and how it can be influenced. Their theories are never tested (it would be unethical to destroy an engine to test a theory). Regardless, engine philosophies are extremely popular and wield a huge influence over people's perception of how engines should be used to best effect. Finally there is a third group - the practical philosophers. They are engine philosophers who also study all of the components and inputs of engines. They are called upon to intervene when an engine is not performing as expected. They use various mechanical devices and chemical cocktails depending on which school of philosophy they belong to. No one knows if these 'treatments' actually work but many people 'feel' like they do and that seems to be good enough.

Back to reality, clearly my imaginary world is ridiculous. Right? They sound like cargo cult tribes making earphones out of coconuts and waiting for wartime planes to return. And what does this have to do with the 'Chinese Mother' nonsense anyway? Well the truth is that the engine people are us and this is how our culture deals with the brain.

How much to we know about the brain? Nothing. Seriously - NOTHING! The brain is, in many ways, the last great mystery of the natural world. I don't want to demean the good work that scientists are doing with fMRIs of the brain, but they are a long way from explaining the mechanics of the brain and do not deserve sensational headlines. If the path from superstitious farmers to an explanation of brain phenomena from first principles is a mile - we've gone about 100 feet. Into that vacuum of understanding we have pushed a huge volume of nonsense. The nonsense varies widely in quality from laughably stupid 'ghost in the machine' stuff to the very sophisticated but utterly meaningless 'mental illnesses' of modern psychology.

To understand our progress in brain science let's consider a steam engine in our imaginary world. People have been tinkering with the idea of steam power since ancient Greece. The first workable steam engine appeared in 1712. In a world of natural 'engines' such machines would seem rudimentary and laughable. Compared to the high powered and perfectly working natural engines they would be. Many people would doubt that 'evolved' engines could possibly work on the same principles. Perhaps they would gain acceptance because you could create new ones as needed. Given time, steam engines could become increasingly sophisticated and perhaps eventually reach (or even surpass) the effectiveness of natural engines. I'd like to think that this is where we are now in our understanding of the brain. Modern computers are the 'steam age' of brain science. Compared to the brain the are incredibly inflexible and crude. Yet we have found them to be immensely useful and they have clearly changed our world.

So, if our brain science is in the steam age, at least scientists are studying something real. If you lived in a pre-Enlightenment tribe/village/etc. someone in the tribe was designated as the shaman (or whatever you called it). They were essentially selected at random and if you were lucky they had some knowledge about various plants that could be used if someone displayed a certain symptom. They also had a fancy story about to explain what they were doing and why it worked. Sometimes their stuff worked, sometimes it killed the patient but they basically knew nothing. The function of the shaman was to provide you with a reason to believe you would get better. That works surprisingly well a lot of time, it's called the placebo effect.

The problem with psychology and psychiatry is that it's still like that. There's a huge psycho-pharma industry geared up to give you a reason to believe you should feel better and charge you handsomely for the privilege. They're basically modern shamans! There is no detailed explanation for the effect of SRI anti-depressants. They are stuffing the world's population full of chemicals whose effect cannot be adequately explained. The use of the terms 'mental health' and 'mental illness' are basically ridiculous. The modern psycho-pharma practitioner has no better basis to label some symptom a 'mental illness' than a shaman had to explain why a tribe member was sick. They fundamentally DO NOT KNOW, they're just guessing.

Now, you may be about to rebuke me with various double blind, statistically valid and incredibly sophisticated studies that have been done on psycho-pharma drugs and mental illnesses. Those things are great but what are they really measuring? They're measuring deeply subjective experiences and outcomes as reported by human beings. These experiences and outcomes are very strongly shaped by the culture and expectations of the participants. They do not study of the actual physical effects of the compounds, it's ALL subjective. It may be sophisticated but it's not science. Good science is not subjective. Good science relies on verifiable and repeatable outcomes. Good science says 'we don't know' very clearly when that's the truth. No one in psycho-pharma ever says 'we don't know'.

It's kind of depressing, or maybe that's meaningless term. All I can say is be very careful about anyone who tries to sell you an explanation for how the brain works and remember that the placebo effect is a powerful force.

As far as parenting and being a Chinese Mother, I don't have any advice for you but I can promise you that simplistic explanation for complex outcomes (like the success or happiness of your kids) are invariably wrong. I guess you'll just have to do what seems best to you; know that your culture will have a huge effect that you can't really control; and trust your kids will probably turn out a little weird and mostly OK. As far as I can tell most people do.

Tuesday 4 January 2011

2011 Preview: BI-DW Top 5

Here are the trends I expect to see in 2011, but beware my crystal ball is hazy and known to be biased.

Top 5 for 2011

5) Niche BI acquisitions take off
Big BI consolidation may well be finished, but I think 2011 will be the start of niche vendor acquisitions as established BI vendors seek new growth in a (hopefully) recovering economy. I don't expect any given deal size to be huge (probably sub $100m) however we could easily see half a dozen vendors being picked up.

The driver for such acquisitions should be clear; Big BI vendors have ageing product stacks and many have been through post-merger product integration pains. Their focus on innovation has been sorely lacking (non-existent?). Also, there is huge leverage in applying a niche product to an existing portfolio. The Business Objects / Xcelsius acquisition is a great example of this (although BO seems to think Xcelsius is a lot better and more useful than I do).

I will not make any predictions about who might be acquired. However, here are some examples of companies with offerings that are not available from Big BI vendors. Tableau's data visualisation offering is 1^st class IMHO and is a perfect fit for the people who actually use BI products in practice. Lyza's BI/ETL collaboration offering is unique (and hard to describe) and a great fit for business oriented BI projects. Jedox' Palo offering brings unique power to Excel power users and appears to be the only rival to Microsoft's PowerPivot offerings; I suspect a stronger US sales force would help them immensely.

4) GPU based computing comes to the fore
I blogged some time ago about GPU's offering a glimpse of the many-core future. Since then I've been waiting (and waiting) for signs that GPUs were making the jump into business servers. Finally, in April 2010, Jedox released Palo OLAP Accelerator for GPUs. And this autumn I discovered ParStream's new GPU accelerated database (I blogged about it last week). Finally in December we saw the announcement of a new class of Amazon EC2 instance featuring a GPU as part of the package.

Based on these weak signals, I think 2011 will be the year that GPU processing and GPU acceleration starts to become a widely accepted part of business computing. The most recent GPU cards from Nvidia and AMD offer many hundreds (512+) of processing cores and multiple cards can be used in a single server. There is a large class of business computing problems that could be addressed by GPUs: analytic calculations (e.g. SAS / R), anything related to MapReduce / Hadoop, anything related to enterprise search / e-discovery, anything related to stream processing / CEP, etc. As final note I would strongly suggest that vendors who sell columnar databases or in-memory BI products (or are losing sales to such) should point their R&D team at GPUs and get something together quickly. Niche vendors have an opportunity to push the price/perform baseline up by an order of magnitude and take market share while Big BI vendors try to catch up.

3) Data Warehousing morphs into Data Intensive Computing

I once asked Netezza CTO Justin Lindsey if he considers Netezza machines to be supercomputers. He said no he didn't but that the scientific computing 'guys' call it a "Data Intensive Supercomputer" and use it in applications where the ratio of data to calculations is very high, i.e., the opposite of classical supercomputing applications. That phrase really stuck with me and it seems to describe the direction that data warehousing is headed.

If you've been around BI-DW for a while you'll be familiar with the Inmon v Kimball ideology war. That fight illustrates the idea that data warehouses had a well defined purpose simply because we could argue about the right way to do 'it'. I've noticed the purpose of the data warehouse stretching out over the last few years. The rise of analytics and ever increasing data volumes mean that more activities are finding a home on the data warehouse as a platform. Either the activity cannot be done elsewhere or the data warehouse is the most accessible platform for data driven projects with short term data processing needs.

In 2011 we need to borrow this term from the supercomputing guys and apply it to ourselves. We need to change our thinking from delivering and supporting a data warehouse to offering a Data Intensive Computing service (that enables a data warehouse). Those that fail to make the change should not be surprised when departments implement their own analytic database, make it available to the wider business and start competing with them for funding.

2) SharePoint destabilises incumbent BI platforms

SharePoint is not typically considered a BI product and is rarely mentioned when I talk to fellow BI people. Those who specialise in Microsoft's products occasionally mention the special challenges (read headaches) associated with supporting it but it's "just a portal". Right? Not quite. Microsoft has managed to drive a nuclear Trojan horse into the safety of incumbent BI installations. SharePoint contains extensive BI capabilities and enables BI capabilities in other Microsoft products (like, um, Excel!). Worst of all, if you're the incumbent BI vendor, SharePoint is everywhere! It has something like 75% market share overall and effectively 100% market share in big companies.

So what? Well, when you want to deploy a dashboard solution where is the natural home for such content? The intranet portal. When you need to collaborate on analysis with widely dispersed teams, what can you use that's better than email? Excel docs on the portal. If report bursting is filling up your inboxes like sand in an hourglass, where can you put reports instead? Maybe the intranet? You get the point. We have a history in BI of pushing yet another friggin' portal onto the business when we select our BI platform. Our chosen platform comes with such a nice portal, heck that's part of why we bought it. A year later we wonder why it doesn't get used. We wonder why we spend more time unlocking expired logins than answering questions about reports.

Right now businesses are only using a small fraction of SharePoint's capability. But they pay for all of them and I expect business to push for more return from SharePoint investments in 2011. I expect a lot of these initiatives to involve communicating business performance (BI) and collaborating on performance analysis (BI again). The trouble for incumbent vendors is clear: SharePoint has no substitute; your BI suite has direct substitutes, Microsoft offers some substitutes for free, your BI content is going to end up on SharePoint, once it's there its SharePoint content. BI vendors should expect hard conversation about maintenance fees and upgrade cycles in any account where dashboards are being hosted on SharePoint.

As a final note, I would suggest that vendors who sell to large customers need to have a compelling SharePoint story. It's basically a case of "if you can't beat them, join them". If you have a portal as part of your suite you need to integrate with SharePoint (yesterday). You need to make you products work better with SharePoint than Microsoft's own products do. This will be a huge, expensive PITA - do it anyway. You must find a way to embrace SharePoint without letting it own you. Good luck.

1) BI starts to dissolve into other systems

My final trend for 2011 is about BI becoming bifurcated (love that word) between the strategic stuff (dashboards and analysis) and everything else. That "everything else" doesn't naturally live on a portal or in a report that gets emailed out. It belongs in the system that generates the data in the first place; it belongs right at the point of interaction. James Taylor and Neil Raden talked about this idea in the book "Smart Enough Systems". I won't repeat their arguments here but I will outline some of the reason why I think it's happening now.

First, 'greenfield' BI sites are a thing of the past. Everyone now has BI, it may not work very well but they have it. New companies use BI from day 1. The market is effectively saturated. Second, most of the Big BI vendors are now part of large companies that sell line of business systems. There is a natural concern about diluting the value of the BI suite, however "BI for the masses" is a dead-end and I think they probably get that. Third, deep integration is one of the last remaining levers that Big BI vendors can use against nimble niche vendors and against SharePoint. They will essentially have to go down this route at some point. Finally, many system vendors have reached an impasse with their customers regarding upgrades. Customers are simply refusing to upgrade systems that work perfectly well. These vendors must create a real, tangible reason for the customers to move. I suspect that deep BI integration is their best bet.

I have had too many conversations about 'completing the circle' and feeding the results of analysis back into source systems. Sadly it never happens in practice, the walls are just too high. Once the data has left the source system it is considered tainted and pushing tainted data into production systems is never taken lightly. Thus the ultimate answer seems to be to push the "smarts" that have been generated by analysis down into the source system instead. Expect to see plenty of marketing talk in 2011 about systems getting 'smarter' and more integrated.