Friday, 8 April 2011

Data Management - Data is Data is Data is…

[Sometimes I want to write about one topic but I end up writing 1,500 words of background before I even touch on the subject at hand. Sometimes the background turns out to be more interesting; hopefully this is one of those times.]

In this post I talk about the problems with mainstream data management, especially SQL databases. I then touch on the advantages of SQL databases and the good attributes we need to retain.

Data is Data is Data is…
Current IT practice splits data management into lots of niches: SQL databases, Email platforms, Network file systems, enterprise search, etc. There is plenty of overlap between niches and, in truth, the separations are artificial. It merely reflects the way systems are implemented, not fundamental data differences. Have a look at your email client; see those headers in the messages list (From, Subject, etc) they're just database field names and the message body is simply a BLOB field. Some email clients, e.g., Gmail, can also parse that blob and find links to previous messages, which is very much like a foreign key link.

File systems seem less like a database at first glance but let's consider the big file system developments of the last 10 years ZFS and BTRFS. Both of these introduce database-like ideas to the file system such as copy-on-write (a la MVCC), deduplication (a la normalisation), data integrity guarantees (a la ACID) and enhanced file metadata (a la SQL DDL).

The basic point I'm making is that data is data. Simple as that. It may be more or less 'structured' but structure and meaning are essentially equivalent. The most 'unstructured' file I can imagine is just plain text but the written word is still very structured. At a high level it has a lot of metadata (name, created, changed, size, etc.), it has structure embedded in the text itself (language, punctuation, words used, etc.) and, looking deeper, we can analyse the semantic content of the text using techniques like NLP.

Data is data; it needs to be stored, changed, versioned, retrieved, backed up, restored, searched, indexed, etc. The methods may vary but it's all just data.


The SQL Database Black Box
All data cannot be kept in databases because, amongst other things, SQL databases are opaque to other applications. Enterprise search illustrates the issue. Most enterprise search apps can look into JDBC/ODBC accessible databases, profile the data and include its content in search results. However, access to any given database is typically highly restricted and there is a DBA whose job hangs on keeping that data safe and secure. The DBA must be convinced that the search system will not compromise the security of his data and this typically means limiting search access to the people who also have database access. This is a time consuming process and we have to repeat it for every database in the company.

So a year later, when we have access to all SQL databases and a process to mirror access credentials, the next problem is that SQL provides no mechanism to trace data history. For example, I search for 'John Doe' and find a result from the CRM database. I look in the database and the record now has a name of 'Jane Doe'. Why did it change? When did it change? Who changed it? There is no baseline answer to these questions. The CRM application may record some of this information but how much? The database has internal mechanisms that trace some of this but each product has its own scheme and, worse, the tables are often not user accessible for security reasons.

In my experience, 80% of the value actually gained from a data warehouse comes from resolving this issue in a single place and in a consistent way. Hence the growth of the MDM industry, but I won't digress on that. The data warehouse doesn't actually solve the problem, it merely limits the number of SQL databases that must be queried to 1. And, of course, we never manage to get everything in the DW.

There are many other black box attributes of SQL databases such as: 2 very similar queries may perform in drastically different ways; background tasks can make the database extremely slow without warning; the database disk format cannot be accessed by other applications; the database  may bypass the filesystem making us entirely reliant on the database to detect disk errors, etc., etc.


The SQL Database Choke Point
Current SQL databases are also a very real constraint on day-to-day operation. For example, a large company may only be able to process bulk updates against a few percent of the customer base each night. SQL databases must be highly tuned towards high performance for single type of  access query and that tuning usually makes other access styles unworkable.

Further the schema of a production SQL database is effectively set in stone. Although SQL provides ALTER statements the performance and risk of using ALTER is so bad that it's never used. Instead we either add a new small table and use a join when we need the additional data, or we create a new table and export the existing data into it. Both of these operations impose significant overheads when all we really want is a new field. So, in practice, production SQL databases satisfy a single type of access, are very resistant to other access patterns and are very difficult to change.

These issues are well recognised and the answer has come back that we need specialist SQL databases for each use case. Michael Stonebraker, in particular, has been beating a drum about this for at least 5 years (and, credit where it's due, Vertica paid off in spades). However, we haven't seen a huge uptake in specialist databases for markets other than analytics. In particular the mainstream OLTP market has very few specialist offerings. Perhaps it's a more difficult problem or perhaps the structure of SQL itself is less amenable to secondary innovation around OLTP. I sense a growing recognition that improvements in the OLTP space require significant re-engineering of existing applications.

Specialist databases have succeeded to some extent in the data warehouse and business intelligence sphere. I think this exception proves the observation. 15 years ago I would add another complaint to my black box attributes: it was impossible to get reports and analysis from my production systems. The data warehouse was invented and gained popular acceptance simply because this was such a serious problem. The great thing about selling analytic databases for the last 15 years was that you weren't displacing a production system. Businesses don't immediately start losing money if the DW goes down. The same cannot be said of most other uses for SQL databases and that's why they will only be replaced slowly and only when there is a compelling reason (mainframes are still around, right?).


There's a baby in this bathwater!
It's worth remembering the SQL databases offer a lot advantages. Codd outlined 12 rules that relational databases should follow. I won't list them all here but at a high level a relational database maintains the absolute integrity of the data it stores and allows us to place constraints on that data, such as the type and length of the data or it's relation to other data. We take it for granted now but this was a real breakthrough and it took years to implement in practice.

Just for kicks imagine a CRM system based on Word docs. When you want to update a customer's information you open their file and make whatever changes you want and then save it. The system only checks that the doc exists, you can change whatever you want and the system won't care. If you want the system to make sure you only change the right things you'll have to build that function yourself. That's more or less what data management was like before SQL databases.


What to keep & what to throw away
So what would our ideal data management platform look like? It persists data in a format that can be freely parsed by other applications, i.e., plain text (XML? JSON? Protocol Buffers? ). It maintains data integrity at an atomic level probably by storing checksums alongside each item. It lets define stored data as strictly or loosely as we want but it enforces the definitions we set. All changes to our stored data actually create new versions and the system keeps a linked history of changes.

I think we're starting to see systems emerge that address some of the issues above. It's still early days but I'm excited about projects like Ceph and the very new Acunu.


In my next post I'll look about how the new breed of NoSQL databases display some of the traits we need for our ideal data management platform.

Wednesday, 6 April 2011

Unsolicited advice for MarkLogic - Pivot!

[This is actually a really long comment on Curt Monash's post but I think it's worth cross posting here.] 


Seeing as I have been doing a lot of thinking about database opportunities lately I'll wade in on MarkLogic's as well. I can't really comment about the specific verticals that MarkLogic sells into or should sell into. However, I see 2 fundamental problems with MarkLogic's product positioning.

First problem; they backed the wrong horse by focusing exclusively on XML and XQuery. This has been toned down a lot but the die is cast. People who know about MarkLogic (not many) know of it as a 'really expensive XML database that you need if you have lots of XML and eXist-db is too slow for you'. They've put themselves into a niche within a niche, kind of like a talkative version of Ab Initio.

This problem is obvious if you compare them to the 'document oriented' NoSQLs such as CouchDB and MongoDB. Admittedly they were created long after MarkLogic but the NoSQLs offer far greater flexibility, talk about XML only as a problem to be dealt with and use a storage format that the market finds more appealing (JSON).

Second problem; 'Enterprise class' pricing is past its sell by date. What does MarkLogic actually cost? You won't find any pricing on the website. I presume that the answer is that old standby 'whatever you're looking to spend'. Again, the contrast with the new NoSQLs couldn't be more stark - they're all either pure open source or open core, e.g., free to start.

MarkLogic was essentially an accumulator bet: 1st bet - XML will flood the enterprise, 2nd bet - organisations will want to persist XML as XML, 3rd bet - an early, high quality XML product will move into an Oracle-like position.

The first bet was a win, XML certainly has flooded the enterprise. The second bet was a loss; XML has become almost a wire protocol rather than a persistence format. Rightly or not, very few organisations choose to persist significant volumes of data in XML. And the third bet was loss as well; the huge growth of open source and the open core model make it extremely unlikely that we'll see another Oracle in the data persistence market.

The new MarkLogic CEO needs to acknowledge that the founding premise of the company has failed and they must pivot the product to find a much larger addressable market. Their underlying technology is probably very good and could certainly be put to use in other ways (Curt gives some examples). I would be tempted to split the company in 2; leaving a small company to continue selling and supporting MarkLogic at maximum margins (making them an acquisition target) and a new company to build a new product in start-up mode on the foundations of the existing tech.

Disqus for @joeharris76