@joeharris76: 2011

Friday, 8 April 2011

Data Management - Data is Data is Data is…

[Sometimes I want to write about one topic but I end up writing 1,500 words of background before I even touch on the subject at hand. Sometimes the background turns out to be more interesting; hopefully this is one of those times.]

In this post I talk about the problems with mainstream data management, especially SQL databases. I then touch on the advantages of SQL databases and the good attributes we need to retain.

Data is Data is Data is…
Current IT practice splits data management into lots of niches: SQL databases, Email platforms, Network file systems, enterprise search, etc. There is plenty of overlap between niches and, in truth, the separations are artificial. It merely reflects the way systems are implemented, not fundamental data differences. Have a look at your email client; see those headers in the messages list (From, Subject, etc) they're just database field names and the message body is simply a BLOB field. Some email clients, e.g., Gmail, can also parse that blob and find links to previous messages, which is very much like a foreign key link.

File systems seem less like a database at first glance but let's consider the big file system developments of the last 10 years ZFS and BTRFS. Both of these introduce database-like ideas to the file system such as copy-on-write (a la MVCC), deduplication (a la normalisation), data integrity guarantees (a la ACID) and enhanced file metadata (a la SQL DDL).

The basic point I'm making is that data is data. Simple as that. It may be more or less 'structured' but structure and meaning are essentially equivalent. The most 'unstructured' file I can imagine is just plain text but the written word is still very structured. At a high level it has a lot of metadata (name, created, changed, size, etc.), it has structure embedded in the text itself (language, punctuation, words used, etc.) and, looking deeper, we can analyse the semantic content of the text using techniques like NLP.

Data is data; it needs to be stored, changed, versioned, retrieved, backed up, restored, searched, indexed, etc. The methods may vary but it's all just data.

The SQL Database Black Box
All data cannot be kept in databases because, amongst other things, SQL databases are opaque to other applications. Enterprise search illustrates the issue. Most enterprise search apps can look into JDBC/ODBC accessible databases, profile the data and include its content in search results. However, access to any given database is typically highly restricted and there is a DBA whose job hangs on keeping that data safe and secure. The DBA must be convinced that the search system will not compromise the security of his data and this typically means limiting search access to the people who also have database access. This is a time consuming process and we have to repeat it for every database in the company.

So a year later, when we have access to all SQL databases and a process to mirror access credentials, the next problem is that SQL provides no mechanism to trace data history. For example, I search for 'John Doe' and find a result from the CRM database. I look in the database and the record now has a name of 'Jane Doe'. Why did it change? When did it change? Who changed it? There is no baseline answer to these questions. The CRM application may record some of this information but how much? The database has internal mechanisms that trace some of this but each product has its own scheme and, worse, the tables are often not user accessible for security reasons.

In my experience, 80% of the value actually gained from a data warehouse comes from resolving this issue in a single place and in a consistent way. Hence the growth of the MDM industry, but I won't digress on that. The data warehouse doesn't actually solve the problem, it merely limits the number of SQL databases that must be queried to 1. And, of course, we never manage to get everything in the DW.

There are many other black box attributes of SQL databases such as: 2 very similar queries may perform in drastically different ways; background tasks can make the database extremely slow without warning; the database disk format cannot be accessed by other applications; the database may bypass the filesystem making us entirely reliant on the database to detect disk errors, etc., etc.

The SQL Database Choke Point
Current SQL databases are also a very real constraint on day-to-day operation. For example, a large company may only be able to process bulk updates against a few percent of the customer base each night. SQL databases must be highly tuned towards high performance for single type of access query and that tuning usually makes other access styles unworkable.

Further the schema of a production SQL database is effectively set in stone. Although SQL provides ALTER statements the performance and risk of using ALTER is so bad that it's never used. Instead we either add a new small table and use a join when we need the additional data, or we create a new table and export the existing data into it. Both of these operations impose significant overheads when all we really want is a new field. So, in practice, production SQL databases satisfy a single type of access, are very resistant to other access patterns and are very difficult to change.

These issues are well recognised and the answer has come back that we need specialist SQL databases for each use case. Michael Stonebraker, in particular, has been beating a drum about this for at least 5 years (and, credit where it's due, Vertica paid off in spades). However, we haven't seen a huge uptake in specialist databases for markets other than analytics. In particular the mainstream OLTP market has very few specialist offerings. Perhaps it's a more difficult problem or perhaps the structure of SQL itself is less amenable to secondary innovation around OLTP. I sense a growing recognition that improvements in the OLTP space require significant re-engineering of existing applications.

Specialist databases have succeeded to some extent in the data warehouse and business intelligence sphere. I think this exception proves the observation. 15 years ago I would add another complaint to my black box attributes: it was impossible to get reports and analysis from my production systems. The data warehouse was invented and gained popular acceptance simply because this was such a serious problem. The great thing about selling analytic databases for the last 15 years was that you weren't displacing a production system. Businesses don't immediately start losing money if the DW goes down. The same cannot be said of most other uses for SQL databases and that's why they will only be replaced slowly and only when there is a compelling reason (mainframes are still around, right?).

There's a baby in this bathwater!
It's worth remembering the SQL databases offer a lot advantages. Codd outlined 12 rules that relational databases should follow. I won't list them all here but at a high level a relational database maintains the absolute integrity of the data it stores and allows us to place constraints on that data, such as the type and length of the data or it's relation to other data. We take it for granted now but this was a real breakthrough and it took years to implement in practice.

Just for kicks imagine a CRM system based on Word docs. When you want to update a customer's information you open their file and make whatever changes you want and then save it. The system only checks that the doc exists, you can change whatever you want and the system won't care. If you want the system to make sure you only change the right things you'll have to build that function yourself. That's more or less what data management was like before SQL databases.

What to keep & what to throw away
So what would our ideal data management platform look like? It persists data in a format that can be freely parsed by other applications, i.e., plain text (XML? JSON? Protocol Buffers? ). It maintains data integrity at an atomic level probably by storing checksums alongside each item. It lets define stored data as strictly or loosely as we want but it enforces the definitions we set. All changes to our stored data actually create new versions and the system keeps a linked history of changes.

I think we're starting to see systems emerge that address some of the issues above. It's still early days but I'm excited about projects like Ceph and the very new Acunu.

In my next post I'll look about how the new breed of NoSQL databases display some of the traits we need for our ideal data management platform.

Wednesday, 6 April 2011

Unsolicited advice for MarkLogic - Pivot!

[This is actually a really long comment on Curt Monash's post but I think it's worth cross posting here.]

Seeing as I have been doing a lot of thinking about database opportunities lately I'll wade in on MarkLogic's as well. I can't really comment about the specific verticals that MarkLogic sells into or should sell into. However, I see 2 fundamental problems with MarkLogic's product positioning.

First problem; they backed the wrong horse by focusing exclusively on XML and XQuery. This has been toned down a lot but the die is cast. People who know about MarkLogic (not many) know of it as a 'really expensive XML database that you need if you have lots of XML and eXist-db is too slow for you'. They've put themselves into a niche within a niche, kind of like a talkative version of Ab Initio.

This problem is obvious if you compare them to the 'document oriented' NoSQLs such as CouchDB and MongoDB. Admittedly they were created long after MarkLogic but the NoSQLs offer far greater flexibility, talk about XML only as a problem to be dealt with and use a storage format that the market finds more appealing (JSON).

Second problem; 'Enterprise class' pricing is past its sell by date. What does MarkLogic actually cost? You won't find any pricing on the website. I presume that the answer is that old standby 'whatever you're looking to spend'. Again, the contrast with the new NoSQLs couldn't be more stark - they're all either pure open source or open core, e.g., free to start.

MarkLogic was essentially an accumulator bet: 1st bet - XML will flood the enterprise, 2nd bet - organisations will want to persist XML as XML, 3rd bet - an early, high quality XML product will move into an Oracle-like position.

The first bet was a win, XML certainly has flooded the enterprise. The second bet was a loss; XML has become almost a wire protocol rather than a persistence format. Rightly or not, very few organisations choose to persist significant volumes of data in XML. And the third bet was loss as well; the huge growth of open source and the open core model make it extremely unlikely that we'll see another Oracle in the data persistence market.

The new MarkLogic CEO needs to acknowledge that the founding premise of the company has failed and they must pivot the product to find a much larger addressable market. Their underlying technology is probably very good and could certainly be put to use in other ways (Curt gives some examples). I would be tempted to split the company in 2; leaving a small company to continue selling and supporting MarkLogic at maximum margins (making them an acquisition target) and a new company to build a new product in start-up mode on the foundations of the existing tech.

Tuesday, 29 March 2011

Analytic Database Market Opportunities

In my first post in this series I gave an overview of ParStream and their product. In the second post I gave an overview of the Analytic Database Market from my perspective. In the third post I introduced a simple Analytic Database Market Segmentation.

In this post I will look at the gaps in this market and the new opportunities for ParStream and RainStor to introduce differentiated offerings. First, though I'll address the positioning of Hadapt.

Hadapt Positioning
Hadapt have recently come out of stealth and will be offering a very fast 'adaptive' version of Hadoop. Hadapt is a reworked and commercialized version of the Daniel Abadi's HadoopDB project. You can read Curt Monash's overview for more on that. Basically Hadapt provides a Hadoop compatible interface (buzz phrase alert) and uses standard SQL databases (currently Postgres or VectorWise) underneath instead of HDFS. The unique part of their offering is keeping track of node performance and adapting queries to make the best use of each node. The devil is in the details of course, but a number of questions remain unanswered: How much of the Hadoop API will be mapped to the database? Will there be a big inflection in performance between logic that maps to the DB and logic that runs in Hadoop? Etc. At a high level Hadapt seems like a very smart play for cloud based Hadoop users. Amazon EC2 instances have notoriously inconsistent I/O performance and a product that works around that should find fertile ground.

RainStor's Current Positioning
RainStor, if you don't know, is a special archival database that features massive compression. They current sell it as an OLDR solution (Online Data Retrieval) primarily aimed at company's that have large data volumes and stringent data retention requirements, e.g., anyone in Financial Services. They promise between 95% (20:1) and 98% (40:1) compression rates for data whilst remaining fully query-able. Again Curt Monash has the best summary of their offering. I briefly met some RainStor guys a while back and I feel pretty confident that the product delivers what it promises. That said, I have never come across a RainStor client and I talk to lots of Teradata and Netezza types who would be their natural customers. So, though I have no direct knowledge of how they are doing, I suspect that it's been slow going to date and focusing on a different part of the market might be more productive.

Hyper Compressed Hadoop - RainStor Opportunity
I tweeted a while back that "RainStor needs a MapReduce story like yesterday". I still think that's right although now I think they need a Hadoop compatible story. To me, RainStor and Hadoop/MapReduce seem like a great fit. Hadoop users value the ability to process large data volumes over simple speed. Sure, they're happy with speed when they can get it but Hadoop is about processing as much data as possible. RainStor massively compresses databases while keeping them online and fully query-able. If RainStor could bring that compression to Hadoop it would be incredibly valuable. Imagine a Hadoop cluster that's maxed out at 200TB of raw data, compressed using splittable LZO to 50TB and replicated on 200TB of disk. If RainStor (replacing HDFS) could compress that same data at 20:1, half their headline rate, that cluster can now scale out to roughly 2,000TB. And many operations in Hadoop are constrained by disk I/O so if RainStor can operate to some extent on compressed data the cluster might just run faster. Even if it runs slightly slower the potential cost savings are huge (insert your own Amazon EC2 calculation here where you take EC2+S3 spend and divide by 20).

ParStream Opportunities
I see 2 key opportunities for ParStream in the current market. They can co-exist; but may require significant re-engineering of the product. First a bit of background; I'm looking for 'Blue Ocean Strategies' where ParStream can create a temporary monopoly. Selling into the 'MPP Upstart' segment is not considered due to the large number of current competitors. It's interesting to note though that that is where ParStream's current marketing is targeted.

Real-Time Analytic Hadoop - ParStream Opportunity 1
ParStream's first opportunity is to repurpose their technology into a Hadoop compatible offering. Specifically a 'real-time analytic Hadoop' product that uses GPU acceleration to vastly speed up Hadoop processing and opens up the MapReduce concept for many different and untapped use cases. ParStream claim to have a unique index format and to mix workloads across CPUs and GPUs to minimise response times. It should be possible to use this technology to replace HDFS with their own data layer and indexing. They should also aim to greatly simplify data loading and cluster administration work. Finally transparent SQL access would be a very handy feature for business that want to provide BI directly from their 'analytic Hadoop' infrastructure. In summary: Hadoop's coding flexibility, processing speeds that approach CEP, and Data Warehouse style SQL access for downstream apps.

Target customers: Algorithmic trading companies (as always…), Large-scale online ad networks, M2M communications, IP-based Telcos, etc . Generally businesses with large volumes of data and high inbound data rates who need to make semi-complex decisions quickly and who have a relatively small staff.

Single User Data Warehouses - ParStream Opportunity 2
ParStream's second opportunity is to market ParStream as a single user, desk side data warehouse for analytic professionals, specifically targeting GPU powered workstations (like this one: ~$12k => 4 GPUs [960 cores], 2 Quad core CPUs, 48GB RAM, 3.6TB of fast disk). This version of ParStream must run on Windows (preferably Win7 x64, but Win Server at a minimum). Many IT departments will balk at having a non-Windows workstation out in the office running on the standard LAN. However they are very used to analysts requesting 'special' powerful hardware. That's why the desk side element is so critical, this strategy is designed to penetrate restrictive centralised IT regimes.

In my experience a handful of users place 90% of the complex query demand on any given data warehouse. They're typically statisticians and operational researchers doing hard boiled analysis and what-if modelling. Many very large businesses have separate SAS environments that this group alone uses but that's a huge investment that many can't afford. Sophisticated analysts are a scarce and expensive resource and many companies can't fill the vacancies they have. A system that improves analyst productivity and ensures their time is well used will justify a significant premium. It also gives the business an excellent retention tool to retain their most valuable 'quants'.

This opportunity avoids the challenges of selling a large scale GPU system into a business that has never purchased one before and avoids the red ocean approach of selling directly into the competitive MPP upstart segment. However it will be difficult to talk directly to these users inside the larger corporation and, when you convince them they need ParStream; you still have to work up the chain of command to get purchase authority (not the normal direction). On the plus side though these users form a fairly tight community and they will market it themselves if it makes their jobs easier.

Target customers: Biotech/Bioscience start-up companies, University researchers, marketing departments or consultancies. Generally, if a business is running their data warehouse on Oracle or SQL Server, their will be an analytic professional who would give anything to have a very fast database all to themselves.

In my next post I will look at why Hadoop is getting so much press, whether the hype is warranted and, generally, the future shape of the thing we currently call the data warehouse.

Friday, 25 March 2011

Analytic Database Market Segmentation

In my first post in this series I gave an overview of ParStream and their product.

In the second post I gave an overview of the Analytic Database Market from my perspective.

In this post I will briefly talk about how vendors are positioned and introduce a simple market segmentation.

Please note that this is not exhaustive, I've left off numerous vendors with perfectly respectable offerings that I didn't feel I could reasonably place.

The chart above gives you my view of the current Analytic Database market and how the various vendors are positioned. The X axis is log scale going from small data sizes (<100GB) to very large data sizes (~500TB). I have removed the scale because it based purely on my own impressions of common customer data sizes for that vendor based on published case studies and anecdotal information.

The Y axis is a log scale of the number of employees that a vendor's customers have. Employee size is highly variable for a given vendor but nonetheless each vendor seems to find a natural home in businesses of a certain size. Finally the size of each vendor's bubble represents the approximate $ cost per TB for their products (paid versions in the case of 'open core' vendors). Pricing information is notoriously difficult to come across so again this is very subjective but I have first hand experience with a number of these so it's not a stab in the dark.

Market Segments

SMP Defenders: Established vendors with large bases operating on SMP platforms

Teradata+Aster: The top of the tree. Big companies, big data, big money.

MPP Upstarts: Appliance – maybe, Columnar – maybe, Parallelism – always.

Open Source Upstarts: Columnar databases, smaller businesses, free to start.

Hadoop+Hive: The standard bearer for MapReduce. Big Data, small staff.

SMP > MPP Inflection

Still with me? Good, let's look at the notable segments of the market. First, there is a clear inflection point between the big single server (SMP) databases and the multi-server parallelised (MPP) databases. This point moves forward a little every year but not enough to keep up with the rising tide of data. For many years Teradata owned the MPP approach and charged a handsome rent. In the previous decade a bevy of new competitors jumped into the space with lower pricing and now the SMP old guard getting into MPP offerings, e.g., Oracle Exadata and Microsoft SQL Server PDW.

Teradata's Diminished Monopoly

Teradata have not their lost grip on the high end however. They maintain a near monopoly on data warehouse implementations in the very largest companies with the largest volumes of 'traditional' DW data (customers & transactions). Even Netezza has failed to make large dent into Teradata's customers. Perhaps there are instances of Teradata being displaced by Netezza; however I have never actually heard of one. There are 2 vendors who have a publicised history of being 'co-deployed' with Teradata: Greenplum and Aster Data. Greenplum's performance reputation is mixed and it was acquired last year by EMC. Aster's performance reputation is solid at very large scales and their SQL/MapReduce offering has earned them a lot of attention. It's no surprise that Teradata decided to acquire them.

The DBA Inflection Point

The other inflection point in this market happens when the database becomes complex enough to need full time babysitting, e.g., a Database Administrator. This gets a lot less attention than SMP>MPP because it's very difficult to prove. Nevertheless word gets around fairly quickly about the effort required to keep a given product humming along. It's no surprise that vendors of notoriously fiddly products sell them primarily to large enterprises where the cost of employing such an expensive specimen as a collection of DBAs is not an issue.

Small, Simple & Open(ish)

Smaller businesses, if they really need an analytic DB, choose products that have a reputation for being usable by technically inclined end users without a DBA. Recent columnar database vendors fall into this end of the spectrum, especially those that target the MySQL installed base. It's not that a DBA is completely unnecessary, simply that you can take a project a long way without one.

MapReduce: Reduced to Hadoop

Finally we have those customers in smaller businesses (or perhaps government or universities) who need to analyse truly vast quantities of data with the minimum amount of resource. In the past it was literally impossible for them to do this; they were forced to rely on gross simplifications. Now though we have the MapReduce concept of processing data in fairly simple steps, in parallel across a numerous cheap machines. In many ways this is MPP minus the database, sacrificing the convenience of SQL and ACID reliability for pure scale. Hadoop has become the face of MapReduce and is effectively the SQL of MapReduce, creating a common API that alternative approaches can offer to minimise adoption barriers. 'Hadoop compatible' is the Big Data buzz phrase for 2011.

In my next post I will look at the gaps where a differentiated offering could be introduced. I will also look at where Hadapt have pitched themselves and how ParStream and RainStor can take advantage of these market gaps.

Friday, 28 January 2011

HandlerSocket - More grist for the ORM mill

A plugin called HandlerSocket was released last year that allows InnoDB to be used to directly, bypassing the MySQL parsing and optimising steps. The genius of HandlerSocket is that the data is still "in" MySQL so you can use the entire MySQL toolchain (monitoring, replication, etc.). You also have your data stored in a highly reliable database, as opposed to some of the horror stories I'm seeing about newer NoSQL products.

In the original blog post ( here ) it talks about 720,000 qps on an 8 core Xeon with 32GB RAM. Granted this is all in memory data we're talking about but that is a hell of a figure. He also claims it outperforms Memcached.

Next, Percona added HandlerSocket to their InnoDB fork back in December ( here ) so if you're looking for someone to talk to they may be the best people.

Finally, Ilya Grigorik (way-smart guy from PostRank) blogged about it a couple of weeks ago ( here ) and there's a fairly interesting discussion in the comments comparing this to prepared statements in Oracle.

All of this reinforces my opinion that new generation ORMs are the technology that will finally allow the RDBMS apple cart to tip all the way over. Products like Redis, Riak, CouchDB, etc. are not enough on their own.

The *really* interesting thing about HandlerSocket is that shows open source databases are perfect fodder for the next wave.

Wednesday, 19 January 2011

Analytic Database Market 'Fly Over'

This is a follow up to my previous post where I laid out my initial thoughts about ParStream. This is a very high level 'fly over' view of the analytic database market. I'll follow this up with some thoughts about how ParStream can position themselves in this market.

Powerhouse Vendors

The power players in the Analytic Database market are: Oracle (particularly Exadata), IBM (mostly Netezza, also DB2), and Teradata. Each of these vendors employs a large, very well funded and sophisticated sales force. A new vendor competing against them in accounts will find it very, very hard to win deals. They can easily put more people to work on a bid than a company like ParStream *employs*. If you are tendering for business in a Global 5000 corporation then you should expect to encounter them and you need a strategy for countering their access to the executive boards of these companies (which you will not get). In terms of technology their offerings have become very similar in recent years with all 3 emphasising MPP appliances of one kind or another, however most of the installed base are still using their traditional SMP offerings (Netezza and Teradata excepted).

New MPP niche players

There are a number of recent entrants to the market who also offer MPP technology, particularly: Greenplum, AsterData and ParAccel. All 3 offer software-only MPP databases, although Greenplum's emphasis has shifted slightly since being acquired by EMC. These vendors seem to focus mostly on (or succeed with) customers who have very large data volumes but are small companies in terms of employees. Many of these customers are in the web space. These vendors also have strong stories about supporting MapReduce/Hadoop inside their databases, which also plays to the leanings of web customers. According to testimonials on the vendor's websites customers seem to choose them because they are very fast and software only.

Microsoft

Microsoft is a unique case. They do not employ a direct sales force (as far as I know) however they have steadily become major force in enterprise software. Almost all companies run Windows desktops, have at least a few Windows servers and at least a few instances of SQL Server in production. Therefore Microsoft will be considered in virtually every selection process you're involved in. Microsoft have been steadily adding BI-DW features to the SQL Server product line and generally those features are all "free" with a SQL Server license. This doesn't necessarily make SQL Server cheaper but it does make it feel like very good value. Recent improvements include the Parallel Data Warehouse appliance (with HP hardware), columnar indexing for the next release and PowerPivot for local analysis of large data volumes.

Proprietary columnar

Columnar databases have been the hot technology in analytic databases for the last few years. The biggest vendors are Sybase with their very mature IQ product, SAND with an equally mature product and Vertica with their newer (and reportedly much faster) product. These databases can be used in single server (SMP / scale-up) and MPP (multi-server / scale-out) configurations. They appear to be most popular with customers who appreciate the high levels of compression that these databases offer and already have relatively mature star-schema / Kimball style data warehouses in place. In my experience Sybase and SAND are used most in companies where they were introduced by an OEM as part of another product. Vertica is so new that it's not clear who their 'natural' customers are yet.

Open Source columnar

In the open source world there are 2 MySQL storage engines and a standalone product offering columnar databases. The MySQL engine Infobright was the first open source columnar database. It features very high compression and very fast loading however it is not suited for lots of joins and may be better thought of as a OLAP tool managed via SQL. The InfiniDB MySQL engine on the other hand is very good at joins and very good at squeezing all the available performance out of a server, however it does not have any compression currently. Finally there is LucidDB which is a Java based standalone product and has performance characteristics somewhere between the other two. LucidDB features excellent compression, index support and generally good performance but can be slow to load.

Vectorised columnar
There is only one player here: VectorWise. VectorWise is a columnar database (AFAIK) that has been architected from top to bottom to take advantage of the vector pipelines built into all recent CPUs. Vectorisation is a way of running many highly parallel operations through a single CPU. It basically removes all of the waiting and memory shifting that slows a CPU down. Initial testers have been very positive about the performance of VectorWise and had nothing but good things to say. There is also talk of an open source release so they are covering a lot of bases. They also have the advantage of being part of Ingres who may not be the force they once were but have a significant installed base and are well placed to sell VectorWise. They are the biggest direct competitor to ParStream that I can see right now.

Open Source MapReduce/NoSQL
ParStream will also compete with a new breed of open source MapReduce/NoSQL products, most notably Hadoop (and it's variants). These products are not databases per se but they have gained a lot of mindshare among developers who need to work with large data volumes. Part of their attraction is their 'cloud friendliness'. They are perfect for the cloud because they have been designed to run on many small servers and to expect that a single server could fail at any time. There is a trade-off to be made and MapReduce products tend to be much more complex to query, however for a technically savvy audience the trade is well worth it.

Next time I'll talk about where I think ParStream need to place themselves to maximise their opportunity.

UPDATE: Actually, in the next post I talk about how analytic database vendors are positioned and introduce a simple market segmentation. A further post about market opportunities will follow.

My take on why businesses have problems with ETL tools

Check out this very nice piece by Rick about the reasons why companies have failed to get the most out of their ETL tools.

My take is from the other side of the fence. As a business user I'm often frustated by ETL tools and have been known to campaign against them for the following reasons:

> ETL tools have been too focussed on Extract-Transform-Load and too little focused on actual data integration. I have complex integration challenges that are not necessarily a good fit for the ETL strategy and sometimes I feel like I'm pushing a square peg into a round hole.

> It's still very challenging to generate reusable logic inside ETL tools and this really should be the easiest thing in the world (ever heard the mantra Don't Repeat Yourself!). Often the hoops that have to be jumped through are more trouble than they are worth.

> Some ETL tools are a hodge podge of technologies and approaches with different data types and different syntaxes wherever you look. (SSIS I'm looking at you! This still is not being addressed in Denali.)

> ETL tools are too focused on their own execution engines and fail miserably to take advantage of the processing power of columnar and MPP databases by running processes on the database. This is understandable in open source tools (database specific SQL may be a bridge too far) but in commercial tools it's pathetic.

> Finally, where is the ETL equivalent of SQL? Why are we stuck with incompatible formats for each tool. The design graphs in each tool look very similar and the data they capture is near identical. Even the open source projects have failed to utilise a common format. Very poor show. This is the single biggest obstacle to more widespread ETL. Right now it's much easier for other parts of the stack to stick with SQL and pretend that ETL doesn't exist.

Wednesday, 12 January 2011

Chinese Mother: Psychology is Modern Shamanism

A couple of days ago there was a widely linked article in the WSJ called "Chinese Mother" ( http://on.wsj.com/f3nh9d ). The basic premise of the article is that Western mothers are too soft and don't push their children enough and Chinese mothers are like a blacksmith's hammer cruelly pounding there children until they become brilliant swords of achievement (or something equally pathetic).

I'm not going to deal with the premise though; it's the subtext that I'm interested in. The subtext is: 'Western people develop psychological problems because their parents make them weak, self indulgent quitters.' I've seen lots of counterpoints who's subtext is something like 'Chinese parents turn their children into soulless robots who can only take orders'. The really interesting thing about both of these ideas is that they tacitly accept the current fashions of Western psychology as if they were scientifically proven facts. You may well expect that from Western responses but in the original piece she frames Chinese Mothers as the antidote to the 'problems' identified by Western psychological ideas.

I'm going to digress for a minute but if you do nothing else make sure you read "The Americanization of Mental Illness" in the New York Times ( http://nyti.ms/ggQKCG ).

Let me introduce an imaginary a world in which the internal combustion engine evolved on it's own and everyone in this world is given an engine when they're born, sort of like a puppy, and the engine has to develop and eventually reach a mature state. They keep the engine through their life and use it to assist with physical work. These engines are completely sealed (a la Honda) and cannot be opened or disassembled without destroying them. The engines accept a few limited inputs (petroleum products, coolant and accelerator signals). They output power and waste (heated coolant and exhaust). Virtually all work is done by the engines and no one can conceive of life without them.

People in this imaginary world are naturally very curious about engines but they basically know nothing about them. They cannot create an engine from first principles. They have invested huge efforts in studying engines but this 'study' basically amounts to looking at engines while they're working and measuring which parts get hot. The engines display remarkably diverse behaviour. They are very sensitive to the quality of the petroleum products the user provides. Some substitutes have been found to work but others will kill the engine. Scientists studying engines have found that chemicals can be added to the fuel to generate different performance characteristics. It's not known whether these additives have a long term impact on the engine. Temperature, humidity, age, etc; many other variables also subtly affect the engines.

Alongside the scientists, a separate field of engine philosophy has grown up. These people develop complex theories about engine performance and how it can be influenced. Their theories are never tested (it would be unethical to destroy an engine to test a theory). Regardless, engine philosophies are extremely popular and wield a huge influence over people's perception of how engines should be used to best effect. Finally there is a third group - the practical philosophers. They are engine philosophers who also study all of the components and inputs of engines. They are called upon to intervene when an engine is not performing as expected. They use various mechanical devices and chemical cocktails depending on which school of philosophy they belong to. No one knows if these 'treatments' actually work but many people 'feel' like they do and that seems to be good enough.

Back to reality, clearly my imaginary world is ridiculous. Right? They sound like cargo cult tribes making earphones out of coconuts and waiting for wartime planes to return. And what does this have to do with the 'Chinese Mother' nonsense anyway? Well the truth is that the engine people are us and this is how our culture deals with the brain.

How much to we know about the brain? Nothing. Seriously - NOTHING! The brain is, in many ways, the last great mystery of the natural world. I don't want to demean the good work that scientists are doing with fMRIs of the brain, but they are a long way from explaining the mechanics of the brain and do not deserve sensational headlines. If the path from superstitious farmers to an explanation of brain phenomena from first principles is a mile - we've gone about 100 feet. Into that vacuum of understanding we have pushed a huge volume of nonsense. The nonsense varies widely in quality from laughably stupid 'ghost in the machine' stuff to the very sophisticated but utterly meaningless 'mental illnesses' of modern psychology.

To understand our progress in brain science let's consider a steam engine in our imaginary world. People have been tinkering with the idea of steam power since ancient Greece. The first workable steam engine appeared in 1712. In a world of natural 'engines' such machines would seem rudimentary and laughable. Compared to the high powered and perfectly working natural engines they would be. Many people would doubt that 'evolved' engines could possibly work on the same principles. Perhaps they would gain acceptance because you could create new ones as needed. Given time, steam engines could become increasingly sophisticated and perhaps eventually reach (or even surpass) the effectiveness of natural engines. I'd like to think that this is where we are now in our understanding of the brain. Modern computers are the 'steam age' of brain science. Compared to the brain the are incredibly inflexible and crude. Yet we have found them to be immensely useful and they have clearly changed our world.

So, if our brain science is in the steam age, at least scientists are studying something real. If you lived in a pre-Enlightenment tribe/village/etc. someone in the tribe was designated as the shaman (or whatever you called it). They were essentially selected at random and if you were lucky they had some knowledge about various plants that could be used if someone displayed a certain symptom. They also had a fancy story about to explain what they were doing and why it worked. Sometimes their stuff worked, sometimes it killed the patient but they basically knew nothing. The function of the shaman was to provide you with a reason to believe you would get better. That works surprisingly well a lot of time, it's called the placebo effect.

The problem with psychology and psychiatry is that it's still like that. There's a huge psycho-pharma industry geared up to give you a reason to believe you should feel better and charge you handsomely for the privilege. They're basically modern shamans! There is no detailed explanation for the effect of SRI anti-depressants. They are stuffing the world's population full of chemicals whose effect cannot be adequately explained. The use of the terms 'mental health' and 'mental illness' are basically ridiculous. The modern psycho-pharma practitioner has no better basis to label some symptom a 'mental illness' than a shaman had to explain why a tribe member was sick. They fundamentally DO NOT KNOW, they're just guessing.

Now, you may be about to rebuke me with various double blind, statistically valid and incredibly sophisticated studies that have been done on psycho-pharma drugs and mental illnesses. Those things are great but what are they really measuring? They're measuring deeply subjective experiences and outcomes as reported by human beings. These experiences and outcomes are very strongly shaped by the culture and expectations of the participants. They do not study of the actual physical effects of the compounds, it's ALL subjective. It may be sophisticated but it's not science. Good science is not subjective. Good science relies on verifiable and repeatable outcomes. Good science says 'we don't know' very clearly when that's the truth. No one in psycho-pharma ever says 'we don't know'.

It's kind of depressing, or maybe that's meaningless term. All I can say is be very careful about anyone who tries to sell you an explanation for how the brain works and remember that the placebo effect is a powerful force.

As far as parenting and being a Chinese Mother, I don't have any advice for you but I can promise you that simplistic explanation for complex outcomes (like the success or happiness of your kids) are invariably wrong. I guess you'll just have to do what seems best to you; know that your culture will have a huge effect that you can't really control; and trust your kids will probably turn out a little weird and mostly OK. As far as I can tell most people do.

Tuesday, 4 January 2011

2011 Preview: BI-DW Top 5

Here are the trends I expect to see in 2011, but beware my crystal ball is hazy and known to be biased.

Top 5 for 2011

5) Niche BI acquisitions take off
Big BI consolidation may well be finished, but I think 2011 will be the start of niche vendor acquisitions as established BI vendors seek new growth in a (hopefully) recovering economy. I don't expect any given deal size to be huge (probably sub $100m) however we could easily see half a dozen vendors being picked up.

The driver for such acquisitions should be clear; Big BI vendors have ageing product stacks and many have been through post-merger product integration pains. Their focus on innovation has been sorely lacking (non-existent?). Also, there is huge leverage in applying a niche product to an existing portfolio. The Business Objects / Xcelsius acquisition is a great example of this (although BO seems to think Xcelsius is a lot better and more useful than I do).

I will not make any predictions about who might be acquired. However, here are some examples of companies with offerings that are not available from Big BI vendors. Tableau's data visualisation offering is 1^st class IMHO and is a perfect fit for the people who actually use BI products in practice. Lyza's BI/ETL collaboration offering is unique (and hard to describe) and a great fit for business oriented BI projects. Jedox' Palo offering brings unique power to Excel power users and appears to be the only rival to Microsoft's PowerPivot offerings; I suspect a stronger US sales force would help them immensely.

4) GPU based computing comes to the fore
I blogged some time ago about GPU's offering a glimpse of the many-core future. Since then I've been waiting (and waiting) for signs that GPUs were making the jump into business servers. Finally, in April 2010, Jedox released Palo OLAP Accelerator for GPUs. And this autumn I discovered ParStream's new GPU accelerated database (I blogged about it last week). Finally in December we saw the announcement of a new class of Amazon EC2 instance featuring a GPU as part of the package.

Based on these weak signals, I think 2011 will be the year that GPU processing and GPU acceleration starts to become a widely accepted part of business computing. The most recent GPU cards from Nvidia and AMD offer many hundreds (512+) of processing cores and multiple cards can be used in a single server. There is a large class of business computing problems that could be addressed by GPUs: analytic calculations (e.g. SAS / R), anything related to MapReduce / Hadoop, anything related to enterprise search / e-discovery, anything related to stream processing / CEP, etc. As final note I would strongly suggest that vendors who sell columnar databases or in-memory BI products (or are losing sales to such) should point their R&D team at GPUs and get something together quickly. Niche vendors have an opportunity to push the price/perform baseline up by an order of magnitude and take market share while Big BI vendors try to catch up.

3) Data Warehousing morphs into Data Intensive Computing

I once asked Netezza CTO Justin Lindsey if he considers Netezza machines to be supercomputers. He said no he didn't but that the scientific computing 'guys' call it a "Data Intensive Supercomputer" and use it in applications where the ratio of data to calculations is very high, i.e., the opposite of classical supercomputing applications. That phrase really stuck with me and it seems to describe the direction that data warehousing is headed.

If you've been around BI-DW for a while you'll be familiar with the Inmon v Kimball ideology war. That fight illustrates the idea that data warehouses had a well defined purpose simply because we could argue about the right way to do 'it'. I've noticed the purpose of the data warehouse stretching out over the last few years. The rise of analytics and ever increasing data volumes mean that more activities are finding a home on the data warehouse as a platform. Either the activity cannot be done elsewhere or the data warehouse is the most accessible platform for data driven projects with short term data processing needs.

In 2011 we need to borrow this term from the supercomputing guys and apply it to ourselves. We need to change our thinking from delivering and supporting a data warehouse to offering a Data Intensive Computing service (that enables a data warehouse). Those that fail to make the change should not be surprised when departments implement their own analytic database, make it available to the wider business and start competing with them for funding.

2) SharePoint destabilises incumbent BI platforms

SharePoint is not typically considered a BI product and is rarely mentioned when I talk to fellow BI people. Those who specialise in Microsoft's products occasionally mention the special challenges (read headaches) associated with supporting it but it's "just a portal". Right? Not quite. Microsoft has managed to drive a nuclear Trojan horse into the safety of incumbent BI installations. SharePoint contains extensive BI capabilities and enables BI capabilities in other Microsoft products (like, um, Excel!). Worst of all, if you're the incumbent BI vendor, SharePoint is everywhere! It has something like 75% market share overall and effectively 100% market share in big companies.

So what? Well, when you want to deploy a dashboard solution where is the natural home for such content? The intranet portal. When you need to collaborate on analysis with widely dispersed teams, what can you use that's better than email? Excel docs on the portal. If report bursting is filling up your inboxes like sand in an hourglass, where can you put reports instead? Maybe the intranet? You get the point. We have a history in BI of pushing yet another friggin' portal onto the business when we select our BI platform. Our chosen platform comes with such a nice portal, heck that's part of why we bought it. A year later we wonder why it doesn't get used. We wonder why we spend more time unlocking expired logins than answering questions about reports.

Right now businesses are only using a small fraction of SharePoint's capability. But they pay for all of them and I expect business to push for more return from SharePoint investments in 2011. I expect a lot of these initiatives to involve communicating business performance (BI) and collaborating on performance analysis (BI again). The trouble for incumbent vendors is clear: SharePoint has no substitute; your BI suite has direct substitutes, Microsoft offers some substitutes for free, your BI content is going to end up on SharePoint, once it's there its SharePoint content. BI vendors should expect hard conversation about maintenance fees and upgrade cycles in any account where dashboards are being hosted on SharePoint.

As a final note, I would suggest that vendors who sell to large customers need to have a compelling SharePoint story. It's basically a case of "if you can't beat them, join them". If you have a portal as part of your suite you need to integrate with SharePoint (yesterday). You need to make you products work better with SharePoint than Microsoft's own products do. This will be a huge, expensive PITA - do it anyway. You must find a way to embrace SharePoint without letting it own you. Good luck.

1) BI starts to dissolve into other systems

My final trend for 2011 is about BI becoming bifurcated (love that word) between the strategic stuff (dashboards and analysis) and everything else. That "everything else" doesn't naturally live on a portal or in a report that gets emailed out. It belongs in the system that generates the data in the first place; it belongs right at the point of interaction. James Taylor and Neil Raden talked about this idea in the book "Smart Enough Systems". I won't repeat their arguments here but I will outline some of the reason why I think it's happening now.

First, 'greenfield' BI sites are a thing of the past. Everyone now has BI, it may not work very well but they have it. New companies use BI from day 1. The market is effectively saturated. Second, most of the Big BI vendors are now part of large companies that sell line of business systems. There is a natural concern about diluting the value of the BI suite, however "BI for the masses" is a dead-end and I think they probably get that. Third, deep integration is one of the last remaining levers that Big BI vendors can use against nimble niche vendors and against SharePoint. They will essentially have to go down this route at some point. Finally, many system vendors have reached an impasse with their customers regarding upgrades. Customers are simply refusing to upgrade systems that work perfectly well. These vendors must create a real, tangible reason for the customers to move. I suspect that deep BI integration is their best bet.

I have had too many conversations about 'completing the circle' and feeding the results of analysis back into source systems. Sadly it never happens in practice, the walls are just too high. Once the data has left the source system it is considered tainted and pushing tainted data into production systems is never taken lightly. Thus the ultimate answer seems to be to push the "smarts" that have been generated by analysis down into the source system instead. Expect to see plenty of marketing talk in 2011 about systems getting 'smarter' and more integrated.