Friday 8 April 2011

Data Management - Data is Data is Data is…

[Sometimes I want to write about one topic but I end up writing 1,500 words of background before I even touch on the subject at hand. Sometimes the background turns out to be more interesting; hopefully this is one of those times.]

In this post I talk about the problems with mainstream data management, especially SQL databases. I then touch on the advantages of SQL databases and the good attributes we need to retain.

Data is Data is Data is…
Current IT practice splits data management into lots of niches: SQL databases, Email platforms, Network file systems, enterprise search, etc. There is plenty of overlap between niches and, in truth, the separations are artificial. It merely reflects the way systems are implemented, not fundamental data differences. Have a look at your email client; see those headers in the messages list (From, Subject, etc) they're just database field names and the message body is simply a BLOB field. Some email clients, e.g., Gmail, can also parse that blob and find links to previous messages, which is very much like a foreign key link.

File systems seem less like a database at first glance but let's consider the big file system developments of the last 10 years ZFS and BTRFS. Both of these introduce database-like ideas to the file system such as copy-on-write (a la MVCC), deduplication (a la normalisation), data integrity guarantees (a la ACID) and enhanced file metadata (a la SQL DDL).

The basic point I'm making is that data is data. Simple as that. It may be more or less 'structured' but structure and meaning are essentially equivalent. The most 'unstructured' file I can imagine is just plain text but the written word is still very structured. At a high level it has a lot of metadata (name, created, changed, size, etc.), it has structure embedded in the text itself (language, punctuation, words used, etc.) and, looking deeper, we can analyse the semantic content of the text using techniques like NLP.

Data is data; it needs to be stored, changed, versioned, retrieved, backed up, restored, searched, indexed, etc. The methods may vary but it's all just data.


The SQL Database Black Box
All data cannot be kept in databases because, amongst other things, SQL databases are opaque to other applications. Enterprise search illustrates the issue. Most enterprise search apps can look into JDBC/ODBC accessible databases, profile the data and include its content in search results. However, access to any given database is typically highly restricted and there is a DBA whose job hangs on keeping that data safe and secure. The DBA must be convinced that the search system will not compromise the security of his data and this typically means limiting search access to the people who also have database access. This is a time consuming process and we have to repeat it for every database in the company.

So a year later, when we have access to all SQL databases and a process to mirror access credentials, the next problem is that SQL provides no mechanism to trace data history. For example, I search for 'John Doe' and find a result from the CRM database. I look in the database and the record now has a name of 'Jane Doe'. Why did it change? When did it change? Who changed it? There is no baseline answer to these questions. The CRM application may record some of this information but how much? The database has internal mechanisms that trace some of this but each product has its own scheme and, worse, the tables are often not user accessible for security reasons.

In my experience, 80% of the value actually gained from a data warehouse comes from resolving this issue in a single place and in a consistent way. Hence the growth of the MDM industry, but I won't digress on that. The data warehouse doesn't actually solve the problem, it merely limits the number of SQL databases that must be queried to 1. And, of course, we never manage to get everything in the DW.

There are many other black box attributes of SQL databases such as: 2 very similar queries may perform in drastically different ways; background tasks can make the database extremely slow without warning; the database disk format cannot be accessed by other applications; the database  may bypass the filesystem making us entirely reliant on the database to detect disk errors, etc., etc.


The SQL Database Choke Point
Current SQL databases are also a very real constraint on day-to-day operation. For example, a large company may only be able to process bulk updates against a few percent of the customer base each night. SQL databases must be highly tuned towards high performance for single type of  access query and that tuning usually makes other access styles unworkable.

Further the schema of a production SQL database is effectively set in stone. Although SQL provides ALTER statements the performance and risk of using ALTER is so bad that it's never used. Instead we either add a new small table and use a join when we need the additional data, or we create a new table and export the existing data into it. Both of these operations impose significant overheads when all we really want is a new field. So, in practice, production SQL databases satisfy a single type of access, are very resistant to other access patterns and are very difficult to change.

These issues are well recognised and the answer has come back that we need specialist SQL databases for each use case. Michael Stonebraker, in particular, has been beating a drum about this for at least 5 years (and, credit where it's due, Vertica paid off in spades). However, we haven't seen a huge uptake in specialist databases for markets other than analytics. In particular the mainstream OLTP market has very few specialist offerings. Perhaps it's a more difficult problem or perhaps the structure of SQL itself is less amenable to secondary innovation around OLTP. I sense a growing recognition that improvements in the OLTP space require significant re-engineering of existing applications.

Specialist databases have succeeded to some extent in the data warehouse and business intelligence sphere. I think this exception proves the observation. 15 years ago I would add another complaint to my black box attributes: it was impossible to get reports and analysis from my production systems. The data warehouse was invented and gained popular acceptance simply because this was such a serious problem. The great thing about selling analytic databases for the last 15 years was that you weren't displacing a production system. Businesses don't immediately start losing money if the DW goes down. The same cannot be said of most other uses for SQL databases and that's why they will only be replaced slowly and only when there is a compelling reason (mainframes are still around, right?).


There's a baby in this bathwater!
It's worth remembering the SQL databases offer a lot advantages. Codd outlined 12 rules that relational databases should follow. I won't list them all here but at a high level a relational database maintains the absolute integrity of the data it stores and allows us to place constraints on that data, such as the type and length of the data or it's relation to other data. We take it for granted now but this was a real breakthrough and it took years to implement in practice.

Just for kicks imagine a CRM system based on Word docs. When you want to update a customer's information you open their file and make whatever changes you want and then save it. The system only checks that the doc exists, you can change whatever you want and the system won't care. If you want the system to make sure you only change the right things you'll have to build that function yourself. That's more or less what data management was like before SQL databases.


What to keep & what to throw away
So what would our ideal data management platform look like? It persists data in a format that can be freely parsed by other applications, i.e., plain text (XML? JSON? Protocol Buffers? ). It maintains data integrity at an atomic level probably by storing checksums alongside each item. It lets define stored data as strictly or loosely as we want but it enforces the definitions we set. All changes to our stored data actually create new versions and the system keeps a linked history of changes.

I think we're starting to see systems emerge that address some of the issues above. It's still early days but I'm excited about projects like Ceph and the very new Acunu.


In my next post I'll look about how the new breed of NoSQL databases display some of the traits we need for our ideal data management platform.

3 comments:

  1. Interesting post!

    I share same ideas that data is the primary matter and that systems that work with data should rely on some common method of representing and, preferably, defining data.

    Unfortunately, these aspects have been always blended in various data storage and processing systems (including programming languages) that expose or export data in custom formats using highly specialized interfaces (e.g. SQL).

    As a consequence, substantial programming and integratoin efforts are usually needed in order to provide data portability by data providers or to deal with custom data formats for data consumers. Systems interoperation, in general, falls under this category as well.

    I'm working to address these aspects in the course of the Piqi project (http://piqi.org). It includes a powerful data definition language (Piqi), a standard way to represent data in XML, JSON and Protocol Buffers format, tools for type-based conversion between the three formats, etc.

    Piqi also provides methods for non-disruptive extension of data schemas, and schemas themselves are serializable. This, in turn, allows creation of self-describing backward- and forward- compatible data bundles.

    ReplyDelete
  2. Alvarik,

    Thaks for commenting. Piqi looks very cool, almost exactly what I was envisioning.

    The next post in this sequence is going to about various NoSQL databases and how they're kind of half way there.

    Joe

    ReplyDelete
  3. Hi Joe,

    I'm glad you liked Piqi. My primary use-case for it a cross-language data serialization system and an RPC system for OCaml and Erlang. I don't currently use it for creating self-describing data bundles so this part is still experimental and some details will be changing as I keep working on it. But I'm very pleased what I've got so far and the goal is to make it even more portable and robust.

    Looking forward to your next post!

    Anton

    ReplyDelete

Disqus for @joeharris76