The premise of this post is that ORMs are a disruptive innovation for the all parts of the IT industry that utilise databases, particularly relational databases. I'm particularly interested in the ultimate impact of ORMs on my work in the BI-DW-OLAP-DSS-{insert acronym here} industry.
What makes a 'disruptive technology'?
In case you are not familiar with the "innovator's dilemma" concept; it was originally expressed in those terms by Clayton M. Christensen in the article 'Disruptive Technologies: Catching the Wave'.
"Generally, disruptive innovations were technologically straightforward, consisting of off-the-shelf components put together in a product architecture that was often simpler than prior approaches. They offered less of what customers in established markets wanted and so could rarely be initially employed there. They offered a different package of attributes valued only in emerging markets remote from, and unimportant to, the mainstream."
Let's talk about ORMs
If you are a true BI-DW person you may not have heard of ORM and are unlikely to have come across one directly in your work. An ORM [Object-Relational Mapper] is simply a set of code routines that 'map' tables and columns in a relational database to objects, attributes and methods in a programming language. Programmers can then interact with the data stored by the underlying database without writing any SQL.
An ugly history with DBAs
Now, as you'd probably imagine, DBA's hate ORMs and for good reason. They have typically produced horrible SQL and correspondingly awful performance problems for the DBAs to deal with. ORM use in "enterprise" IT environments is patchy and somewhat limited. It seems like a lot enterprise ORM use is kept out of sight and only comes to light when the DBAs get really fired up about some bad SQL that keeps reappearing every time an update to the software is released.
The early ORMs were rightly criticised (Hibernate seems to have taken the most heat) but ORMs haven't gone away. The sweet spot for early ORMs was small and 'simple' transactional applications. The kind of app that is needed quickly and where imperfect SQL was not a huge issue. But ORMs keep evolving and becoming more sophisticated in the way they generate SQL and deal with databases. This is where the disruptive innovation part comes in.
The hockey stick graph
Looking at the graph from the Wikipedia article I linked above you can see that ORMs started in the bottom left "low quality use" corner. My entire point for this post is that ORMs are going to follow the "disruptive technology" curve and eventually they will come be the dominate way in which ALL database access occurs. Seriously.
Let me explain why I see this happening. There are 3 good technical reasons and a human reason. As usual the human reason is the trump card.
ORMs are getting better quickly
First, we're seeing much better ORMs developed. In particular I want to draw your attention to Datamapper. It's a Ruby ORM that's been around for about 2 and a half years. The interesting thing about Datamapper (for me) is how much respect it has for the database. DB access is designed to minimise the number of queries hitting the backend and at the same to minimise the data being pulled out unnecessarily (i.e. only get Text/Blob fields if you really want them). Here's the kicker though: it supports foreign keys. Real (honest-to-goodness, enforced-by-the-database) foreign keys. Nice.
ORMs are the ultimate metadata layer
Second, because an ORM is deeply involved in the application itself it can contain a much richer set of metadata about the data that's being stored. Compare the following SQL DDL with the equivalent ORM setup code.
CREATE TABLE users (
id int(10) NOT NULL AUTO_INCREMENT,
tenant int(10) NOT NULL,
first_name varchar(50) NOT NULL,
last_name varchar(50) NOT NULL,
title varchar(50) DEFAULT NULL,
email varchar(99) DEFAULT NULL,
passhash varchar(50) DEFAULT NULL,
salt varchar(50) DEFAULT NULL,
permission int(11) DEFAULT '1',
created_at datetime NOT NULL,
updated_at datetime DEFAULT NULL,
deleted_at datetime NOT NULL DEFAULT '2999-12-31',
manager_id int(10) NOT NULL,
PRIMARY KEY (id, tenant),
UNIQUE INDEX unique_users_email (email),
INDEX index_users_manager (manager_id),
INDEX users_tenant_fk (tenant),
CONSTRAINT users_tenant_fk FOREIGN KEY (tenant)
REFERENCES tenants (id)
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT users_manager_fk FOREIGN KEY (manager_id)
REFERENCES users (id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
);
What can we tell about this table? It's got an FK to itself on 'manager_id' and another to 'tenants' on 'tenant'. We don't gain a lot of insight. Here's the Datamapper syntax:
class User
include DataMapper::Resource
property :id, Serial
property :tenant, Integer, :min => 0, :required => true, :key => true
property :first_name, String, :required => true
property :last_name, String, :required => true
property :title, String
property :email, String, :length => (5..99), :unique => true,
:format => :email_address,
:messages => {:presence => 'We need your email address.',
:is_unique => 'That email is already registered.',
:format => "That's not an email address" }
property :passhash, String
property :salt, String
property :permission, Integer, :default => 1
property :phone, String
property :mobile, String
property :created_at, DateTime, :required => true
property :updated_at, DateTime
property :deleted_at, ParanoidDateTime, :required => true
has 1, :manager
has n, :authorities
has n, :subscriptions, :through => :authorities
belongs_to :tenant, :parent_key => [:id], :child_key => [:tenant]
belongs_to :manager, self
end
An awful lot more insightful you ask me, and I actually stripped out 50% of the metadata to avoid distracting you. We can see that:
Columns
> Email must be unique, it has a Min and Max length and a specific format.
> Deleted_At has a special type ParanoidDateTime, which means deletes are logical not physical.
Child Tables {Try finding this out in SQL…}
> The table has an FK that depends on it from Authorities (1 to many)
> a relationship to Subscriptions through Authorities (many to many)
We're getting a much richer set of metadata here and it's being specified this way because it's useful for the developer not because we're trying to specify a top-down data dictionary. The really interesting thing about the ORM example is that nothing prevents us from enriching this further. We are not bound by the constraints of SQL 99/2003/etc and the way it's been implemented by the vendor.
We've been talking about the power and importance of metadata for at least 10 years and, truthfully, we've made almost no progress. Every new BI-DW-ETL-OLAP project I work on still has to start more or less from nothing. The rise of ORMs creates an inflection point where we can change that if we become involved in the systems early in their lifecycle.
An aside on "metadata layers"
There is another point here and it's important. We could make a lot of progress simply by using ORMs to interact with our existing (so called 'legacy') databases. Datamapper has a lot features and tricks for accommodating theses databases and it's open source so we can add anything else we need.
Imagine an ETL tool that interacted with the database via Datamapper instead of using ODBC/JDBC plus it's own metadata. You would start by declaring a very simple model, just table-column-datatype, and then as you learned more about the data you would specify that learning (new metadata) in the the ORM itself. I think that's an incredibly powerful concept. The ETL becomes purely for orchestration and all of the knowledge about how to interact with sources and destinations is held in a way that is usable by downstream tools (like a reporting tool or another ETL process).
I imagine this is what the Business Objects guys had in mind when they created their metadata layer ('Universes' in BO parlance) back in the 90s. To my reckoning they didn't quite get there. The re-use of Universe metadata for other processes is (in my experience) non-existent. Yet here is a universal metadata layer; spontaneously realised and completely open for everyone to take advantage of.
ORMs will ultimately write better SQL
Third, ORMs can generate much better SQL than people do. The human element in any technology is the most unpredictable. It's generally not a scheduled report that brings down the database. It's usually a badly formed query submitted ad-hoc by a user. Maybe this isn't the case right now but the existence of Datamapper indicates we're getting close.
Clearly SQL is a complex domain and it will take time for ORMs to be able to cover all of the edge cases, particularly in analytics. However, let me refer you to the previous discussion of Business Objects Universes. If you review the SQL BO generates you'll see that the bar is not set all very high.
ORMs are blessed by the kingmakers
Fourth, developers love ORMs. Stephen O'Grady from RedMonk wrote recently that developers are the new/old kingmakers. He has a great quote from Abraham Lincoln “With public sentiment, nothing can fail; without it nothing can succeed.” ORMs have the kind of positive sentiment that your fancy data dictionary / master data / shared metadata project could never dream of. Developers want to use an ORM for their projects because it helps them. They want to stuff all of the beautiful metadata in there. They want to take huge chunks business logic out of spaghetti code and put into a place where we can get at it and reuse it. Who are we to say they shouldn't?
A final thought on BI becoming "operationalised"
My final thought for you is that the new 'edge' of BI is in putting classical BI functionality into operational apps, particularly web apps. If you think this post is a call for existing BI companies to get onboard with ORMs then you are only half right. It's also a warning that the way data is accessed is changing and a lot of core BI-DW skillsets may feel a bit like being a mainframe specialist someday soon.
Friday, 22 October 2010
Friday, 8 October 2010
The easy way to go open source in BI-DW: slipsteaming
I'd like to propose a slightly devious strategy for getting open source Business Intelligence & Data Warehousing into your company. You've probably heard a lot about open source BI / DW offerings in the last few years. You're kind of self-selected into that group by simply reading this post! However, just in case, I'll wrap up a few of the leading lights for you. This is by no means comprehensive, consider it an invitation to do some 'googling'.
Open Source Reporting / Dashboards
> Pentaho: Open Core offering, very complete BI suite, the standard bearer IMO
> Jaspersoft: Open Core offering, very complete BI suite
> Actuate/BIRT: BIRT is very open, other offerings less clear, more OEM focused
> SpagoBI: The 'most' open of the FLOSS BI offerings, possibly less mature
> Palo: I'm kind of confused about what's open/closed but I hear good things
Open Source ETL / Data Integration
> Pentaho PDI/Kettle: Stream based, Open Core, excellent for multi-core/server
> Talend: Code generator, Open Core, suited to single big machine processing
> Palo ETL: Tightly integrated with Palo suite, if you like Palo give it a look
> CloverETL: New offering with a lot of features 'checked off'
> Apatar: Another new offering with a lot of features claimed
Open Source Analytic Databases
> Infobright: MySQL storage engine, Open Core, good compression, limitedDDL SQL
> InfiniDB: MySQL storage engine, Open Core, maximises modern CPUs , limitedDDL SQL
> LucidDB: Java based, completely open, good compression, supports most of SQL
> MonetDB: completely open but not very active, good compression, likes a lot of RAM
> VectorWise: promising to become open soon, maximises modern CPUs, good 'buzz'
> Greenplum: kinda-sorta open, free 'single node edition', good SQL support
Open Source OLAP
> Pentaho Analysis/Mondrian: Mature tool, Open Core, likes a fast DB underneath
> Palo: Well regarded OLAP, nice options for Excel use, tightly integrated with suite
How do you bring it in?
OK, with that out of the way, how can we bring open source into businesses that already have some sort of BI-DW infrastructure in place? One of the problems that open source faces is free licenses don't buy fast talking salespeople who'll come and woo senior managers and executives. So we often have to bring it in by the back door. You're not going to rip out the existing software and replace it with your new shiny open source alternative. You need to find pain points where the business is not getting what it needs but is blocked from getting something better, usually for political or financial reasons.
Typical pain points
Let's talk about some typical pain points. Is the main DW constantly overloaded by queries? Are some team's queries throttled because they're not considered important enough? Do you have an analysis team that is not allowed to run the complex queries that they'd like to? Do you have policy of killing queries that run over an certain time and it is killing a lot of queries? Does it take a *very* long time to produce the daily report burst? Has a certain team asked for Excel ODBC access to the DW and been blocked? Do some teams want to load their own data in the DW but are not allowed to? Do more people want access to ad-hoc reporting but you can't afford the licenses? Is your ETL development slow because you can't afford any more server licenses for your expensive tool? Are you still doing your ETL jobs as hand coded SQL?
Finally - the splistreaming strategy
If your company has more than 500 people I bet I could easily find at least 3 of those. These are the areas where you can implement open source first. You will be using a strategy that I call 'slipstreaming'. Have you ever watched the Tour De France on television? Did you notice that Lance Armstrong almost never rode at the front of the group? He always sat behind his team mates (in the slipstream) to conserve energy so he could attack at the end or breakaway on the climbs. Sitting behind his team reportedly saves 25% of his energy.
Open source as the 'domestique'
The bad news: your open source efforts are not in Lance's position. You are going to be the team mate out in front cutting the wind (a domestique). You need to find a pain point where you can put the open source solution in front of the existing solution to 'cut the wind'. Essentially you are going to make the existing solution work better by taking some of the demand away. You will then present this as a 'business-as-usual' or 'tactical' solution to your problem. You need to be very careful to position the work correctly. You goal is to run this as a small project within your team. Be careful to keep the scope down. Talk about fire-fighting, taking the pressure off, etc. I'm sure you'll know how to position it in you company. You don't want project managers or architecture astronauts getting involved and making things complicated.
How about some examples?
The data warehouse edge copy:
You have an Oracle based DW. It's been around for a few while and is suffering, despite hardware upgrades. The overnight load barely finishes by 8:30 and the daily report burst has been getting bigger and only finishes around 10:30 (sometimes 11). The customer insight team has been completely banned from running anything until the reports are out. They're not happy about starting their queries later and later.
Edge copy slipstream strategy
The slipstream solution to this is to add a edge copy of the DW between either the daily report run or the insight team. You should be able to make use of a reclaimed server (that the DW ran on previously) or you can purchase a "super PC" (basically a gaming machine with extra hard disks). The edge copy will run one of the analytic databases I mentioned. On an older machine I'd lean towards LucidDB or Infobright because of their compression. You then add a new step to the ETL that copies over just the changed data from the DW, or a time-limited subset, to the edge machine. Finally you switch them over to the edge copy. If your edge copy takes a while to load (for whatever reason) then talk to the insight team about running an extra day behind everyone else. You'll probably find that they're happy to run a day behind if they have a database to themselves, no restrictions.
The ETL co-worker:
You use Business Object's Data Integrator for your ETL processing. You've got a 4-core license and the word has come down that you are not getting any more. Your processing window is completely taken up with the exsiting run. ETL development has become a 1-in-1-out affair where new requests can only be delivered by killing something else. The DW devs have started using hand coded routines in the warehouse to deliver work that has political priority.
Co-worker slipstream strategy
The slipstream solution to this is to add an open source ETL tool as a co-worker to the existing processing. The idea is to leave all of the existing processing on BODI but put new requests onto the new open source package. Again you need to identify either a older server that you can reclaim or source a super-PC to run on. Think carefully about the kind of work that can be best done on the co-worker process. Isolated processes are best. You can also do a lot of post loading activities like data cleanup and aggregations. Once you've established the co-worker as a valid and reliable ETL solution then you should aim to set a policy that any existing ETL processing that has to be changed is moved to the new tool at the same time.
Be devious. Be political. But be nice.
Don't tell them, show them.
Ask forgiveness, not permission.
I wish you luck.
Open Source Reporting / Dashboards
> Pentaho: Open Core offering, very complete BI suite, the standard bearer IMO
> Jaspersoft: Open Core offering, very complete BI suite
> Actuate/BIRT: BIRT is very open, other offerings less clear, more OEM focused
> SpagoBI: The 'most' open of the FLOSS BI offerings, possibly less mature
> Palo: I'm kind of confused about what's open/closed but I hear good things
Open Source ETL / Data Integration
> Pentaho PDI/Kettle: Stream based, Open Core, excellent for multi-core/server
> Talend: Code generator, Open Core, suited to single big machine processing
> Palo ETL: Tightly integrated with Palo suite, if you like Palo give it a look
> CloverETL: New offering with a lot of features 'checked off'
> Apatar: Another new offering with a lot of features claimed
Open Source Analytic Databases
> Infobright: MySQL storage engine, Open Core, good compression, limited
> InfiniDB: MySQL storage engine, Open Core, maximises modern CPUs , limited
> LucidDB: Java based, completely open, good compression, supports most of SQL
> MonetDB: completely open but not very active, good compression, likes a lot of RAM
> VectorWise: promising to become open soon, maximises modern CPUs, good 'buzz'
> Greenplum: kinda-sorta open, free 'single node edition', good SQL support
Open Source OLAP
> Pentaho Analysis/Mondrian: Mature tool, Open Core, likes a fast DB underneath
> Palo: Well regarded OLAP, nice options for Excel use, tightly integrated with suite
How do you bring it in?
OK, with that out of the way, how can we bring open source into businesses that already have some sort of BI-DW infrastructure in place? One of the problems that open source faces is free licenses don't buy fast talking salespeople who'll come and woo senior managers and executives. So we often have to bring it in by the back door. You're not going to rip out the existing software and replace it with your new shiny open source alternative. You need to find pain points where the business is not getting what it needs but is blocked from getting something better, usually for political or financial reasons.
Typical pain points
Let's talk about some typical pain points. Is the main DW constantly overloaded by queries? Are some team's queries throttled because they're not considered important enough? Do you have an analysis team that is not allowed to run the complex queries that they'd like to? Do you have policy of killing queries that run over an certain time and it is killing a lot of queries? Does it take a *very* long time to produce the daily report burst? Has a certain team asked for Excel ODBC access to the DW and been blocked? Do some teams want to load their own data in the DW but are not allowed to? Do more people want access to ad-hoc reporting but you can't afford the licenses? Is your ETL development slow because you can't afford any more server licenses for your expensive tool? Are you still doing your ETL jobs as hand coded SQL?
Finally - the splistreaming strategy
If your company has more than 500 people I bet I could easily find at least 3 of those. These are the areas where you can implement open source first. You will be using a strategy that I call 'slipstreaming'. Have you ever watched the Tour De France on television? Did you notice that Lance Armstrong almost never rode at the front of the group? He always sat behind his team mates (in the slipstream) to conserve energy so he could attack at the end or breakaway on the climbs. Sitting behind his team reportedly saves 25% of his energy.
Open source as the 'domestique'
The bad news: your open source efforts are not in Lance's position. You are going to be the team mate out in front cutting the wind (a domestique). You need to find a pain point where you can put the open source solution in front of the existing solution to 'cut the wind'. Essentially you are going to make the existing solution work better by taking some of the demand away. You will then present this as a 'business-as-usual' or 'tactical' solution to your problem. You need to be very careful to position the work correctly. You goal is to run this as a small project within your team. Be careful to keep the scope down. Talk about fire-fighting, taking the pressure off, etc. I'm sure you'll know how to position it in you company. You don't want project managers or architecture astronauts getting involved and making things complicated.
How about some examples?
The data warehouse edge copy:
You have an Oracle based DW. It's been around for a few while and is suffering, despite hardware upgrades. The overnight load barely finishes by 8:30 and the daily report burst has been getting bigger and only finishes around 10:30 (sometimes 11). The customer insight team has been completely banned from running anything until the reports are out. They're not happy about starting their queries later and later.
Edge copy slipstream strategy
The slipstream solution to this is to add a edge copy of the DW between either the daily report run or the insight team. You should be able to make use of a reclaimed server (that the DW ran on previously) or you can purchase a "super PC" (basically a gaming machine with extra hard disks). The edge copy will run one of the analytic databases I mentioned. On an older machine I'd lean towards LucidDB or Infobright because of their compression. You then add a new step to the ETL that copies over just the changed data from the DW, or a time-limited subset, to the edge machine. Finally you switch them over to the edge copy. If your edge copy takes a while to load (for whatever reason) then talk to the insight team about running an extra day behind everyone else. You'll probably find that they're happy to run a day behind if they have a database to themselves, no restrictions.
The ETL co-worker:
You use Business Object's Data Integrator for your ETL processing. You've got a 4-core license and the word has come down that you are not getting any more. Your processing window is completely taken up with the exsiting run. ETL development has become a 1-in-1-out affair where new requests can only be delivered by killing something else. The DW devs have started using hand coded routines in the warehouse to deliver work that has political priority.
Co-worker slipstream strategy
The slipstream solution to this is to add an open source ETL tool as a co-worker to the existing processing. The idea is to leave all of the existing processing on BODI but put new requests onto the new open source package. Again you need to identify either a older server that you can reclaim or source a super-PC to run on. Think carefully about the kind of work that can be best done on the co-worker process. Isolated processes are best. You can also do a lot of post loading activities like data cleanup and aggregations. Once you've established the co-worker as a valid and reliable ETL solution then you should aim to set a policy that any existing ETL processing that has to be changed is moved to the new tool at the same time.
Be devious. Be political. But be nice.
Don't tell them, show them.
Ask forgiveness, not permission.
I wish you luck.
Tuesday, 5 October 2010
The trouble with SaaS BI - it's all about the data
Some data was released yesterday that purports to show that SaaS BI customer's are very pleased with it's ease of use, etc., etc. Boring. Seriously, I really like the idea of SaaS BI but I haven't seen anyone making great leaps forward. I'd say that they *can't* take us forward because of the box that they've painted themselves into. The box actually has a name: it's called BI.
The BI sandbox
Eh? What? Here's the thing; BI as we currently know it is the last stage in the information pipeline. It's the beautiful colours on the box that holds the cereal. But it's not the cereal and it's not even the box. It is *very* important (who would buy cereal in a plain cardboard box?) but is also *very* dependent on other elements in the pipeline.
I don't want to get into a long discussion about definitions of BI. Suffice it to say this: why are terms like 'data warehouse' and 'OLAP cube' still prevalent? Simply because BI does not imply data gathering, preparation and storage. Last example on this theme. If I tell you I'm a Business Intelligence manager, what would you guess is my remit? Does it include the entire data warehouse? The OLAP cubes? All of the ETL processing? No? It could but it rarely does.
It's all about the data
I once worked for a clever chap who's mantra was "it's all about the data". His daily struggle was to get the business to invest more time, effort and money into the data itself. It was a hard fight. We had a very fast data warehouse (NZ) and some perfectly serviceable BI software (BO) and nearly a dozen newly minted graduates to turn out our reports. What we did not have was a strong mandate to get the data itself absolutely sorted, to get every term clearly defined and to remove all of the wiggle room from the data. As a consequence we had the same problems that so many BI teams have. Conflicting numbers, conflicting metrics, and political battles using our data as ammunition.
Data is the 'other' 90%
I'd estimate that gathering, preparing, and storing the data for BI represents at least 90% of the total effort, with analysis and presentation being the last 10%. I really hope no one is surprised by that figure. I'd think that figure is consistent for any situation in which decisions need to be made from data. For instance a scientist in a lab would have to spend a lot of time collecting and collating measurements before she could do the final work of analyzing the results. A research doctor conducting a study will have to collect, organize and standardize all of the results study data before he can begin to evaluate the outcome.
It's NOT about speed
One of the tragedies of the Inmon-Kimball data warehouse definition war is the data warehouse has been conceived as something that you create because you want to speed up your data access. It's implied that we'd prefer to leave the data in it's original systems if we could, but alas that would be too slow to do anything with. What a load of tosh! Anyone who's been in the trenches knows that the *real* purpose of a data warehouse is to organize and preserve the data somewhere safe away from the many delete-ers and archive-ers of the IT world. We value the data for it's own sake and believe it deserves the respect of being properly stored and treated.
Nibbling at the edges
So, back to the topic, how does SaaS BI help with this issue? Let's assume that SaaS BI does what it claims and makes it much easier for "users" to produce reporting and analysis. Great, how much effort have we saved? Even if it takes half as much time and effort we've only knocked 5% off our total.
The real opportunity
And finally I come to my point: the great untapped opportunity for the SaaS [BI-DW-OLAP-ETL] acronym feast is the other 90% where the most of the hard work happens. Customers are increasingly using online applications in place of their old in-house apps. Everything from ERP to Invoicing to call centre IVRs and diallers are moving to a SaaS model. And every SaaS service that's worth it's salt offers an open API for accessing the data that they hold.
The holy grail - instant data
This is the mother-load, the shining path for data people. Imagine an end to custom integrations for each customer. Imagine an end to customers having to configure they're own ETL and design their own data warehouse before they can actually do anything with their data. The customer simply signs up to the service and you instantly present them with ready to use data. Magic. Sounds like a service worth paying for.
The BI sandbox
Eh? What? Here's the thing; BI as we currently know it is the last stage in the information pipeline. It's the beautiful colours on the box that holds the cereal. But it's not the cereal and it's not even the box. It is *very* important (who would buy cereal in a plain cardboard box?) but is also *very* dependent on other elements in the pipeline.
I don't want to get into a long discussion about definitions of BI. Suffice it to say this: why are terms like 'data warehouse' and 'OLAP cube' still prevalent? Simply because BI does not imply data gathering, preparation and storage. Last example on this theme. If I tell you I'm a Business Intelligence manager, what would you guess is my remit? Does it include the entire data warehouse? The OLAP cubes? All of the ETL processing? No? It could but it rarely does.
It's all about the data
I once worked for a clever chap who's mantra was "it's all about the data". His daily struggle was to get the business to invest more time, effort and money into the data itself. It was a hard fight. We had a very fast data warehouse (NZ) and some perfectly serviceable BI software (BO) and nearly a dozen newly minted graduates to turn out our reports. What we did not have was a strong mandate to get the data itself absolutely sorted, to get every term clearly defined and to remove all of the wiggle room from the data. As a consequence we had the same problems that so many BI teams have. Conflicting numbers, conflicting metrics, and political battles using our data as ammunition.
Data is the 'other' 90%
I'd estimate that gathering, preparing, and storing the data for BI represents at least 90% of the total effort, with analysis and presentation being the last 10%. I really hope no one is surprised by that figure. I'd think that figure is consistent for any situation in which decisions need to be made from data. For instance a scientist in a lab would have to spend a lot of time collecting and collating measurements before she could do the final work of analyzing the results. A research doctor conducting a study will have to collect, organize and standardize all of the results study data before he can begin to evaluate the outcome.
It's NOT about speed
One of the tragedies of the Inmon-Kimball data warehouse definition war is the data warehouse has been conceived as something that you create because you want to speed up your data access. It's implied that we'd prefer to leave the data in it's original systems if we could, but alas that would be too slow to do anything with. What a load of tosh! Anyone who's been in the trenches knows that the *real* purpose of a data warehouse is to organize and preserve the data somewhere safe away from the many delete-ers and archive-ers of the IT world. We value the data for it's own sake and believe it deserves the respect of being properly stored and treated.
Nibbling at the edges
So, back to the topic, how does SaaS BI help with this issue? Let's assume that SaaS BI does what it claims and makes it much easier for "users" to produce reporting and analysis. Great, how much effort have we saved? Even if it takes half as much time and effort we've only knocked 5% off our total.
The real opportunity
And finally I come to my point: the great untapped opportunity for the SaaS [BI-DW-OLAP-ETL] acronym feast is the other 90% where the most of the hard work happens. Customers are increasingly using online applications in place of their old in-house apps. Everything from ERP to Invoicing to call centre IVRs and diallers are moving to a SaaS model. And every SaaS service that's worth it's salt offers an open API for accessing the data that they hold.
The holy grail - instant data
This is the mother-load, the shining path for data people. Imagine an end to custom integrations for each customer. Imagine an end to customers having to configure they're own ETL and design their own data warehouse before they can actually do anything with their data. The customer simply signs up to the service and you instantly present them with ready to use data. Magic. Sounds like a service worth paying for.
Monday, 4 October 2010
Buying results versus buying potential in business IT
It's all about potential
You have probably noticed that business IT (in general) is very expensive. In fact we (the IT industry) invented a special word to justify the expense: Enterprise. We use the word Enterpise to imply that a product or service is: robust, sophisticated, reliable, professional, complex and (most of all) valuable. But if you look deeper you might notice something interesting about "enterprise" IT; it's all about potential. The most successful IT products and companies make all their money by selling potential. Oracle databases have the potential to handle to world's largest workloads. SAP software has the potential to handle the processes of the world's biggest companies. IBM servers have the potential to run the world's most demanding calculations and applications.You're buying what could be
Selling potential is insanely lucrative. After all you're not buying what actually is, you're buying what could be. You're not buying Oracle database to simply keep track of your local scrap metal business; you're buying a Oracle to keep working while you become the biggest scrap metal business in the world. You're not buying SAP to tame your 20 site tool hire business processes, you're buying SAP to help you become the world leader in tool hire. And when the purchase is framed like this customers actually want to pay more. I've been involved in more than one discussion where suppliers were eliminated from consideration for being too cheap. It was understood that they couldn't be "enterprise enough" at that price point.
Really expensive DIY
The funny thing is that buying potential actually means you'll have to do it yourself. This is actually the defining characteristic of enterprise IT; whatever it costs you'll spend the same again to get it working. You don't simply install the Oracle database and press run. You can't just install SAP on everyone's desktop. You need to hire experts, create a strategy, run a long project, put it live over a long weekend, perform extensive tuning, etc., etc. Of course your enterprise IT supplier will be happy to help with all this, but that'll cost extra.
Small business need results
Ironically, small businesses want to buy the exact opposite of potential. They want to buy results. Strike that; they need to buy results. A specific action should result in a specific outcome. In a small business you can't afford to waste time and money on something that might be great. What is required is something that is OK right now. I think the web is perfect for delivering this kind of service and I think that's why small businesses are embracing web services like Basecamp, Freshbooks and (hopefully) my very own AppConductor.com. I think there is latent demand for better technology in small businesses. They can't risk spending big money on potential but they're happy to try out a service will a small monthly fee. If it helps them they'll keep paying the monthly fee and if it doesn't then they won't have lost much.
Harris law of IT spending
I think this is so powerful I'm going to make it a law. The Harris law of IT spending: "Pay only for results. Never buy potential".
P.S. You might notice that open source embodies my law completely. The thing that has potential is free, and then you spend time/money on results.
Selling potential is insanely lucrative. After all you're not buying what actually is, you're buying what could be. You're not buying Oracle database to simply keep track of your local scrap metal business; you're buying a Oracle to keep working while you become the biggest scrap metal business in the world. You're not buying SAP to tame your 20 site tool hire business processes, you're buying SAP to help you become the world leader in tool hire. And when the purchase is framed like this customers actually want to pay more. I've been involved in more than one discussion where suppliers were eliminated from consideration for being too cheap. It was understood that they couldn't be "enterprise enough" at that price point.
Really expensive DIY
The funny thing is that buying potential actually means you'll have to do it yourself. This is actually the defining characteristic of enterprise IT; whatever it costs you'll spend the same again to get it working. You don't simply install the Oracle database and press run. You can't just install SAP on everyone's desktop. You need to hire experts, create a strategy, run a long project, put it live over a long weekend, perform extensive tuning, etc., etc. Of course your enterprise IT supplier will be happy to help with all this, but that'll cost extra.
Small business need results
Ironically, small businesses want to buy the exact opposite of potential. They want to buy results. Strike that; they need to buy results. A specific action should result in a specific outcome. In a small business you can't afford to waste time and money on something that might be great. What is required is something that is OK right now. I think the web is perfect for delivering this kind of service and I think that's why small businesses are embracing web services like Basecamp, Freshbooks and (hopefully) my very own AppConductor.com. I think there is latent demand for better technology in small businesses. They can't risk spending big money on potential but they're happy to try out a service will a small monthly fee. If it helps them they'll keep paying the monthly fee and if it doesn't then they won't have lost much.
Harris law of IT spending
I think this is so powerful I'm going to make it a law. The Harris law of IT spending: "Pay only for results. Never buy potential".
P.S. You might notice that open source embodies my law completely. The thing that has potential is free, and then you spend time/money on results.
Friday, 1 October 2010
Further thoughts on real-time ETL, the Magic of Queues
Now that we have polling out of the way, let's talk about integrating the data once you know something has changed. If you are doing this kind of thing for the first time then your inclination will be towards "one giant script"; do this > then this > then this > then this > ad infinitum. Avoid this pattern. It's not not practical and it's hard to change. However, in real-time ETL it will come back to bite you - hard.
Real-time = inconsistent
The hard thing about real-time ETL, in my view, is not the continuous processing. It's the inconsistency. One system may receive a large number of changes in a short spell, another system may go hours without a single update. Or all of your systems could (easily) burst into life at exactly 9 AM and hammer you with updates for an hour. But these are just the obvious options.
The dreaded 'mysterious slowdown'
A more insidious problems is the dreaded 'mysterious slowdown'. Let's say you have a dependency on address checking using an app from MoMoneySys. The app runs on it's own server, was installed by the MMS engineers and is not to be interfered with on penalty of death. The app is like a foreign footballer, brilliant when it works, liable to collapse under pressure. If your real-time ETL flows directly through this app then your performance is directly tied to it's performance, not a nice situation to be in.
Time-to-live? Get over it
I mentioned in the previous post that I'm not really interest in talking about how far my ETL process is behind live and this pretty much explains why; it's just meaningless talk. If no updates are happening on the source and my ETL process is empty then my 'time-to-live' is 0. If the app suddenly saturates me with changes then my TTL is going to go down the pan. The best I can do is monitor the average performance and make sure my ETL 'pipe' is 'wide' enough to deliver a performance I'm happy with (another post perhaps…).
Behold the QUEUE
So you need a way to cope with the vagaries of sporadic updates in the source, patchy performance in dependencies and, simply, your desire to remain sane. And behold I present to you the QUEUE. Queueing is an excellent way to break up your processing flow and handle the inconsistent, patchy nature of real-time ETL.
Contrary to the title, there is no magic involved in queues. It's simply a list of jobs that need to be worked. There are any number queueing apps/protocols available; from the complex, like AMQP (http://en.wikipedia.org/wiki/AMQP), to the fairly simple, like BeanstalkD (http://kr.github.com/beanstalkd/), to my personal favourite - the queue table. Jobs are added to the queue and then worked in order. Typically you also give them a priority and add some logic where a job is checked out of the queue to be worked and only deleted on success.
Queues beat the real-time blues
So how can we use queues to help our real-time blues? Basically we use queues to buffer each processing step from the next step. Let's think about processing the steps I've mentioned (above and previous post) in a real-time ETL process that uses internal and external sources.
A real-time ETL example
A. Check for changes
A.1 Poll external web service for changes
A.2 Receive change notices from internal database
B. Retrieve changed data
B.1 Extract changed data from web service
B.2 Extract changed data from internal database
C. Validate changed data
C.1 Append geocoding information (using geonames.org data)
C.2 Process new personal data with external identity verification service
D. Load data into final database (e.g. data warehouse)
This time with queues
A. Check for changes
> Take job from the {poll} queue
A.1 Poll external web service for changes (based on job)
> Add job to the {retrieve} queue
A.2 Receive change notices from internal database
> Add job to the {retrieve} queue
B. Retrieve changed data
> Take job from the {retrieve} queue
> Choose worker 1 or 2 for the job
B.1 Extract changed data from web service
> Load raw changes to the staging area
> Add job to the {validate} queue
B.2 Extract changed data from internal database
> Load raw changes to the staging area
> Add job to the {validate} queue
C. Validate changed data
> Take job from the {validate} queue
> Add job to {geocode} or {identity} queues (or both)
C.1 Append geocoding information (using geonames.org data)
> Take job from the {geocode} queue
> Update staging tables with geocoding.
> Add job to the {final} queue
C.2.a Submit new personal data to external identity verification service
> Take job from the {identity} queue
> Update staging tables with geocoding.
> Add job to the {ready} queue
C.2.b Retrieve verified personal data from verification service
> DON'T WAIT! Use a separate listener to receive the data back
> Update staging tables as required (or any other work…)
> Add job to the {final} queue
D. Load data into final database (e.g. data warehouse)
> Take jobs from {final}
> Load data in the final database
A mental model to finish
I've glossed over some issues here but you get the picture. You never want to have any part of your real-time ETL process waiting on another process that is still working. I like to picture it as a warehouse 'pick line': In normal ETL you ask each person to hand off directly to the next person. They do their work and then, if the next person isn't ready, they have to wait to hand off. In real-time ETL, there is a 'rolling table' (queue) between each person. They pick their items off the end of their table, do their work, and then place at the start of the next table.
Real-time = inconsistent
The hard thing about real-time ETL, in my view, is not the continuous processing. It's the inconsistency. One system may receive a large number of changes in a short spell, another system may go hours without a single update. Or all of your systems could (easily) burst into life at exactly 9 AM and hammer you with updates for an hour. But these are just the obvious options.
The dreaded 'mysterious slowdown'
A more insidious problems is the dreaded 'mysterious slowdown'. Let's say you have a dependency on address checking using an app from MoMoneySys. The app runs on it's own server, was installed by the MMS engineers and is not to be interfered with on penalty of death. The app is like a foreign footballer, brilliant when it works, liable to collapse under pressure. If your real-time ETL flows directly through this app then your performance is directly tied to it's performance, not a nice situation to be in.
Time-to-live? Get over it
I mentioned in the previous post that I'm not really interest in talking about how far my ETL process is behind live and this pretty much explains why; it's just meaningless talk. If no updates are happening on the source and my ETL process is empty then my 'time-to-live' is 0. If the app suddenly saturates me with changes then my TTL is going to go down the pan. The best I can do is monitor the average performance and make sure my ETL 'pipe' is 'wide' enough to deliver a performance I'm happy with (another post perhaps…).
Behold the QUEUE
So you need a way to cope with the vagaries of sporadic updates in the source, patchy performance in dependencies and, simply, your desire to remain sane. And behold I present to you the QUEUE. Queueing is an excellent way to break up your processing flow and handle the inconsistent, patchy nature of real-time ETL.
Contrary to the title, there is no magic involved in queues. It's simply a list of jobs that need to be worked. There are any number queueing apps/protocols available; from the complex, like AMQP (http://en.wikipedia.org/wiki/AMQP), to the fairly simple, like BeanstalkD (http://kr.github.com/beanstalkd/), to my personal favourite - the queue table. Jobs are added to the queue and then worked in order. Typically you also give them a priority and add some logic where a job is checked out of the queue to be worked and only deleted on success.
Queues beat the real-time blues
So how can we use queues to help our real-time blues? Basically we use queues to buffer each processing step from the next step. Let's think about processing the steps I've mentioned (above and previous post) in a real-time ETL process that uses internal and external sources.
A real-time ETL example
A. Check for changes
A.1 Poll external web service for changes
A.2 Receive change notices from internal database
B. Retrieve changed data
B.1 Extract changed data from web service
B.2 Extract changed data from internal database
C. Validate changed data
C.1 Append geocoding information (using geonames.org data)
C.2 Process new personal data with external identity verification service
D. Load data into final database (e.g. data warehouse)
This time with queues
A. Check for changes
> Take job from the {poll} queue
A.1 Poll external web service for changes (based on job)
> Add job to the {retrieve} queue
A.2 Receive change notices from internal database
> Add job to the {retrieve} queue
B. Retrieve changed data
> Take job from the {retrieve} queue
> Choose worker 1 or 2 for the job
B.1 Extract changed data from web service
> Load raw changes to the staging area
> Add job to the {validate} queue
B.2 Extract changed data from internal database
> Load raw changes to the staging area
> Add job to the {validate} queue
C. Validate changed data
> Take job from the {validate} queue
> Add job to {geocode} or {identity} queues (or both)
C.1 Append geocoding information (using geonames.org data)
> Take job from the {geocode} queue
> Update staging tables with geocoding.
> Add job to the {final} queue
C.2.a Submit new personal data to external identity verification service
> Take job from the {identity} queue
> Update staging tables with geocoding.
> Add job to the {ready} queue
C.2.b Retrieve verified personal data from verification service
> DON'T WAIT! Use a separate listener to receive the data back
> Update staging tables as required (or any other work…)
> Add job to the {final} queue
D. Load data into final database (e.g. data warehouse)
> Take jobs from {final}
> Load data in the final database
A mental model to finish
I've glossed over some issues here but you get the picture. You never want to have any part of your real-time ETL process waiting on another process that is still working. I like to picture it as a warehouse 'pick line': In normal ETL you ask each person to hand off directly to the next person. They do their work and then, if the next person isn't ready, they have to wait to hand off. In real-time ETL, there is a 'rolling table' (queue) between each person. They pick their items off the end of their table, do their work, and then place at the start of the next table.
Subscribe to:
Posts (Atom)