Friday 22 October 2010

Disrespecting the database? ORM as disruptive technology

The premise of this post is that ORMs are a disruptive innovation for the all parts of the IT industry that utilise databases, particularly relational databases. I'm particularly interested in the ultimate impact of ORMs on my work in the BI-DW-OLAP-DSS-{insert acronym here} industry.

What makes a 'disruptive technology'?
     In case you are not familiar with the "innovator's dilemma" concept; it was originally expressed in those terms by Clayton M. Christensen in the article 'Disruptive Technologies: Catching the Wave'.
     "Generally, disruptive innovations were technologically straightforward, consisting of off-the-shelf components put together in a product architecture that was often simpler than prior approaches. They offered less of what customers in established markets wanted and so could rarely be initially employed there. They offered a different package of attributes valued only in emerging markets remote from, and unimportant to, the mainstream."

Let's talk about ORMs
     If you are a true BI-DW person you may not have heard of ORM and are unlikely to have come across one directly in your work. An ORM [Object-Relational Mapper] is simply a set of code routines that 'map' tables and columns in a relational database to objects, attributes and methods in a programming language. Programmers can then interact with the data stored by the underlying database without writing any SQL.

An ugly history with DBAs
     Now, as you'd probably imagine, DBA's hate ORMs and for good reason. They have typically produced horrible SQL and correspondingly awful performance problems for the DBAs to deal with. ORM use in "enterprise" IT environments is patchy and somewhat limited. It seems like a lot enterprise ORM use is kept out of sight and only comes to light when the DBAs get really fired up about some bad SQL that keeps reappearing every time an update to the software is released.
     The early ORMs were rightly criticised (Hibernate seems to have taken the most heat) but ORMs haven't gone away. The sweet spot for early ORMs was small and 'simple' transactional applications. The kind of app that is needed quickly and where imperfect SQL was not a huge issue. But ORMs keep evolving and becoming more sophisticated in the way they generate SQL and deal with databases. This is where the disruptive innovation part comes in.

The hockey stick graph
     Looking at the graph from the Wikipedia article I linked above you can see that ORMs started in the bottom left "low quality use" corner. My entire point for this post is that ORMs are going to follow the "disruptive technology" curve and eventually they will come be the dominate way in which ALL database access occurs. Seriously.
     Let me explain why I see this happening. There are 3 good technical reasons and a human reason. As usual the human reason is the trump card.

ORMs are getting better quickly
     First, we're seeing much better ORMs developed. In particular I want to draw your attention to Datamapper. It's a Ruby ORM that's been around for about 2 and a half years. The interesting thing about Datamapper (for me) is how much respect it has for the database. DB access is designed to minimise the number of queries hitting the backend and at the same to minimise the data being pulled out unnecessarily (i.e. only get Text/Blob fields if you really want them). Here's the kicker though: it supports foreign keys. Real (honest-to-goodness, enforced-by-the-database) foreign keys. Nice.

ORMs are the ultimate metadata layer
    Second, because an  ORM is deeply involved in the application itself it can contain a much richer set of metadata about the data that's being stored. Compare the following SQL DDL with the equivalent ORM setup code.

CREATE TABLE users (
  id           int(10)     NOT NULL AUTO_INCREMENT,
  tenant       int(10)     NOT NULL,
  first_name   varchar(50) NOT NULL,
  last_name    varchar(50) NOT NULL,
  title        varchar(50) DEFAULT NULL,
  email        varchar(99) DEFAULT NULL,
  passhash     varchar(50) DEFAULT NULL,
  salt         varchar(50) DEFAULT NULL,
  permission   int(11)     DEFAULT '1',
  created_at   datetime    NOT NULL,
  updated_at   datetime    DEFAULT NULL,
  deleted_at   datetime    NOT NULL DEFAULT '2999-12-31',
  manager_id   int(10)     NOT NULL,
  PRIMARY KEY (id, tenant),
  UNIQUE INDEX  unique_users_email (email),
  INDEX index_users_manager (manager_id),
  INDEX users_tenant_fk (tenant),
  CONSTRAINT users_tenant_fk  FOREIGN KEY (tenant)
                              REFERENCES  tenants (id) 
                              ON DELETE NO ACTION 
                              ON UPDATE NO ACTION,
  CONSTRAINT users_manager_fk FOREIGN KEY (manager_id)
                              REFERENCES  users (id) 
                              ON DELETE NO ACTION 
                              ON UPDATE NO ACTION
);
    What can we tell about this table? It's got an FK to itself on 'manager_id' and another to 'tenants' on 'tenant'.  We don't gain a lot of insight. Here's the Datamapper syntax:
class User
  include DataMapper::Resource

  property :id, Serial
  property :tenant, Integer,  :min => 0, :required => true, :key => true
  property :first_name, String,   :required => true     
  property :last_name, String,   :required => true     
  property :title,      String                       
  property :email,      String,   :length => (5..99), :unique => true, 
                   :format => :email_address,
                   :messages => {:presence => 'We need your email address.',
                                 :is_unique => 'That email is already registered.',
                                 :format    => "That's not an email address" }
  property :passhash,   String
  property :salt,       String
  property :permission, Integer,  :default => 1   
  property :phone, String                         
  property :mobile, String                         
  property :created_at, DateTime, :required => true
  property :updated_at, DateTime
  property :deleted_at, ParanoidDateTime, :required => true

  has 1,  :manager
  has n,  :authorities
  has n,  :subscriptions, :through => :authorities


  belongs_to :tenant,  :parent_key => [:id], :child_key => [:tenant]
  belongs_to :manager, self 
end
    An awful lot more insightful you ask me, and I actually stripped out 50% of the metadata to avoid distracting you. We can see that:
  Columns
     > Email must be unique, it has a Min and Max length and a specific format.
     > Deleted_At has a special type ParanoidDateTime, which means deletes are logical not physical.
  Child Tables {Try finding this out in SQL…}
     > The table has an FK that depends on it from Authorities (1 to many)
     > a relationship to Subscriptions through Authorities (many to many)
    We're getting a much richer set of metadata here and it's being specified this way because it's useful for the developer not because we're trying to specify a top-down data dictionary. The really interesting thing about the ORM example is that nothing prevents us from enriching this further. We are not bound by the constraints of SQL 99/2003/etc and the way it's been implemented by the vendor.
     We've been talking about the power and importance of metadata for at least 10 years and, truthfully, we've made almost no progress. Every new BI-DW-ETL-OLAP project I work on still has to start more or less from nothing. The rise of ORMs creates an inflection point where we can change that if we become involved in the systems early in their lifecycle.

An aside on "metadata layers"
    There is another point here and it's important. We could make a lot of progress simply by using ORMs to interact with our existing (so called 'legacy') databases. Datamapper has a lot features and tricks for accommodating theses databases and it's open source so we can add anything else we need.
    Imagine an ETL tool that interacted with the database via Datamapper instead of using ODBC/JDBC plus it's own metadata. You would start by declaring a very simple model, just table-column-datatype, and then as you learned more about the data you would specify that learning (new metadata) in the the ORM itself. I think that's an incredibly powerful concept. The ETL becomes purely for orchestration and all of the knowledge about how to interact with sources and destinations is held in a way that is usable by downstream tools (like a reporting tool or another ETL process).
    I imagine this is what the Business Objects guys had in mind when they created their metadata layer ('Universes' in BO parlance) back in the 90s. To my reckoning they didn't quite get there. The re-use of Universe metadata for other processes is (in my experience) non-existent. Yet here is a universal metadata layer; spontaneously realised and completely open for everyone to take advantage of.

ORMs will ultimately write better SQL
    Third, ORMs can generate much better SQL than people do. The human element in any technology is the most unpredictable. It's generally not a scheduled report that brings down the database. It's usually a badly formed query submitted ad-hoc by a user. Maybe this isn't the case right now but the existence of Datamapper indicates we're getting close.
    Clearly SQL is a complex domain and it will take time for ORMs to be able to cover all of the edge cases, particularly in analytics. However, let me refer you to the previous discussion of Business Objects Universes. If you review the SQL BO generates you'll see that the bar is not set all very high.

ORMs are blessed by the kingmakers
    Fourth, developers love ORMs. Stephen O'Grady from RedMonk wrote recently that developers are the new/old kingmakers. He has a great quote from Abraham Lincoln “With public sentiment, nothing can fail; without it nothing can succeed.” ORMs have the kind of positive sentiment that your fancy data dictionary / master data / shared metadata project could never dream of. Developers want to use an ORM for their projects because it helps them. They want to stuff all of the beautiful metadata in there. They want to take huge chunks business logic out of spaghetti code and put into a place where we can get at it and reuse it. Who are we to say they shouldn't?

A final thought on BI becoming "operationalised"
    My final thought for you is that the new 'edge' of BI is in putting classical BI functionality into operational apps, particularly web apps. If you think this post is a call for existing BI companies to get onboard with ORMs then you are only half right. It's also a warning that the way data is accessed is changing and a lot of core BI-DW skillsets may feel a bit like being a mainframe specialist someday soon.

No comments:

Post a Comment

Disqus for @joeharris76