Check out this very nice piece by Rick about the reasons why companies have failed to get the most out of their ETL tools.
My take is from the other side of the fence. As a business user I'm often frustated by ETL tools and have been known to campaign against them for the following reasons:
> ETL tools have been too focussed on Extract-Transform-Load and too little focused on actual data integration. I have complex integration challenges that are not necessarily a good fit for the ETL strategy and sometimes I feel like I'm pushing a square peg into a round hole.
> It's still very challenging to generate reusable logic inside ETL tools and this really should be the easiest thing in the world (ever heard the mantra Don't Repeat Yourself!). Often the hoops that have to be jumped through are more trouble than they are worth.
> Some ETL tools are a hodge podge of technologies and approaches with different data types and different syntaxes wherever you look. (SSIS I'm looking at you! This still is not being addressed in Denali.)
> ETL tools are too focused on their own execution engines and fail miserably to take advantage of the processing power of columnar and MPP databases by running processes on the database. This is understandable in open source tools (database specific SQL may be a bridge too far) but in commercial tools it's pathetic.
> Finally, where is the ETL equivalent of SQL? Why are we stuck with incompatible formats for each tool. The design graphs in each tool look very similar and the data they capture is near identical. Even the open source projects have failed to utilise a common format. Very poor show. This is the single biggest obstacle to more widespread ETL. Right now it's much easier for other parts of the stack to stick with SQL and pretend that ETL doesn't exist.
There is a model (The Common Warehouse Metamodel, CWM) for metadata exchange between ETL tools. You're right, none of the open source ETL tools commit to it, but some of the closed source ETL tools do like Informatica and SAS. Sure, it would be nice if there is a SQL equivalent of ETL, but to use the absence of it to promote hard coding SQL will be not the way to go I hope. Several features of ETL tools are so powerful, in serious data integration projects it will give a proper return on investment.ReplyDelete
Thanks for your comment. I'm aware of the CWM but it's one of those things that I've *never* seen used in real life and if you look at the implementations they all seem a bit 'dusty' with no updates in years.
In calling for a SQL for ETL what I'd like to see is something very light weight that leaves the actual execution details up to the implementation. A human readable form of the XML in Kettle (PDI) kjb and ktr files would be a great start.
One final note: I'm not necessarily promoting "hard coded SQL" although I don't actually think that's the end of the world. You can get a better feel for my take on this in a previous post: "Disrespecting the database? ORMs as disruptive technology"