Thursday, 30 December 2010

2010 Review: a BI-DW Top 5

This post is written completely 'off the cuff' without any fact checking or referring back to sources. Just sayin'…

Top 5 from 2010

5) Big BI consolidation is finished
  There were no significant acquisitions of "Big BI" vendors in 2010.  Since Cognos went to IBM and BO went to SAP, the last remaining member of the old guard is MicroStrategy. (It's interesting to consider why they have not been acquired but that's for another post.)  In many ways the very definition of Big BI has shifted to encompass smaller players. Analysts, in particular, need things to talk about and they have effectively elevated a few companies to Big BI status that were previously somewhat ignored, e.g., SAS (as a BI provider), InformationBuilders, Pentaho, Acuate, etc.  All of the major conglomerates now have a 'serious' BI element in their offerings and so I don't see further big spending on BI acquisitions in 2011.  The only dark horse in this race seems to be HP and it's very unclear what their intentions are, particularly with the rumours of Neoview being cancelled; if HP were to move I see them going for either a few niche players or someone like InformationBuilders with solid software but lacking in name recognition.

4) Analytic database consolidation began
  We've seen an explosion of specialist Analytic databases over the last ~5 years and 2010 saw the start of a consolidation phase amongst these players. The first big acquisition of 2010 was Sybase by SAP; everyone assumed Sybase's IQ product (the original columnar database) was the target but the talk since then has been largely about the Sybase mobile offerings. I suspect both products are of interest to SAP; IQ allows them to move some of their ageing product lines forward and Mobile will be an enabler for taking both SAP and Business Objects to smartphones going forward.
  The banner acquisition was Netezza by IBM. I've long been very critical/sceptical of IBM's claims in the Data Warehouse / Analytic space. Particularly as I've worked with a number of DW's that were taken off DB2 (onto Teradata) but never come across one actively running on DB2. I'm a big Netezza fan so my hope is that they survive the integration and are able to leverage the resources of IBM going forward.
  We also saw Teradata acquiring the dry husk of Kickfire's ill-fated MySQL 'DW appliance'. Kickfire's fundamental technology appeared to be quite good but sadly their market strategy was quite bad. I think this a good sign from Teradata that they are open to external ideas and they see where the market is going. The competition with Netezza seems to have revitalised them and given them a new enemy to focus on. A new version of Teradata database that incorporated some columnar features (and an 'free' performance boost) could be just the ticket to get their very conservative customers migrated onto the latest version.

3) BI vendors started thinking about mobile
  Mobile BI became a 'front of mind' issue in 2010. MicroStrategy has marketed aggressively in this space but other vendors are in the hunt and have more or less complete mobile offerings. Business Objects also made some big noise about mobile but everything seemed to be demos and prototypes. Cognos has had a 'mobile' offering for some time but they remained strangely quiet, my impression is that their mobile offerings are not designed for the iOS/Android touchscreen world.
  Niche vendors have been somewhat quiet on the mobile front, possibly waiting to see how it plays out before investing, with the notable exception of Qlikview who have embraced it with both arms. This is a great strategic move for Qlikview (who IMHO prove the koan that 'strategy trumps product') because newer mobile platforms are being embraced by their mid-market customers far faster than at Global 5000 companies that the Big BI vendors focus on. Other niche and mid-market vendors should take note of this move and get something (anything!) ready as quickly as possible.

2) Hadoop became the one true MapReduce
  I remain somewhat non-plussed by MapReduce personally, however a lot of attention has been lavished on it over the last 2 years and during the course of 2010 the industry has settled on Hadoop as the MapReduce of choice.  From Daniel Adabadi's HadoopDB project to Pentaho's extensive Hadoop integration to Aster's "seamless connectivity" with Hadoop to Paraccel's announcement of the same thing coming soon and on and on.  The basic story of MapReduce was very sexy but in practice the details turned out to be "a bit more complicated" (as Ben Goldacre [read his book!] would say).  It's not clear that Hadoop is the best possible MR implementation but it looks likely to become the SQL of MapReduce. Expect other MapReduce implementations to start talking about Hadoop compatibility ad nauseum.
  All of this casts Cloudera in an interesting light. They are after all "the Hadoop company" according to themselves. It's far too early for a 'good' acquisition in this space however money talks and I wonder if we might see something happen in 2011.

1) The Cloud got real and we all got sick of hearing about it
  I'm not sure whether 2010 was truly the "year of the Cloud" but it certainly was the peak of it's hype cycle.  In 2010 the reality of cloud pricing hit home; the short version is that a lot of the fundamental cost of cloud computing is operational and we shouldn't expect to see continuous price/performance gains like we have seen in the hardware world.  Savvy observers have noted that the bulk of enterprise IT spending has been non-hardware for a long time but the existence of cloud offerings brings those costs into focus.
  Ultimately, my hope for the Cloud is that it will drive companies toward buying results, e.g., SaaS services that require little-to-no customisation, and away from buying potential, e.g. faster hardware and COTS software that is rarely fit for purpose. The cycle should go something like: "This Cloud stuff seems expensive, how much does it cost us to do the same thing?" > "OMG are you frickin' serious, we really spend that?!" > "Is there anyone out there that can provide the exact same thing for a monthly fee?".  Honestly, big companies are incredibly bad at hardware and even worse at software. The Cloud (as provided by Amazon, et al) is IMHO just a half step towards then endpoint which is the use of SaaS offerings for everything.

Wednesday, 15 December 2010

Initial thoughts about ParStream

So here are my thoughts about ParStream based on researching their product on the internet only. I have not used the product, so I am simply assuming it lives up to all claims. As an analytics user and a BI-DW practitioner I sincerely hope that ParStream succeeds.

I'm a GPU believer
I'm a long time believer in the importance of utilising GPU for challenging database problems. I wrote a post in July 2009 about using GPUs for databases and implored database vendors to move in that direction: "Why GPUs matter for DW/BI" (http://joeharris76.blogspot.com/2009/07/why-gpus-matter-for-dwbi.html).  Here's the key quote - "There's a new world coming. It has a lot of cores. It will require new approaches. That world is accessible today through GPUs. Database vendors who move in this direction now will gain market share and momentum. Those who think they can wait on Intel and 'traditional' CPUs to 'catch up' may live to regret it."

On the right track
I think ParStream is *fundamentally* on the right track with a GPU accelerated analytic database. The ParStream presentation from Mike Hummel (http://www.youtube.com/watch?v=knicXkXd9hQ) talks about a query that took 12 minutes on Oracle taking just a few *miliseconds* on ParStream. If that is even half right the potential to shake up the industry and radically raise the bar on database performance is very exciting.

Reminiscent of Netezza
I remember the first time I used Netezza back in 2004. I had just taken a new role and my new company had recently installed a first generation Netezza appliance. In my previous job we had an Oracle data warehouse that was updated *weekly* and contained roughly 100 million rows. Queries commonly took *hours* to return. The Netezza machine held just less than 1 *billion* rows. I ran the following query: "SELECT month,  COUNT(*), SUM(call_value) FROM cdr GROUP BY month;". It came back in 15 seconds! I was literally blown away.

A fast database changes the game
When you have a very fast analytic databases it totally changes the game. You can ask more questions, ask more complex questions and ask them more often. Analytics requires a lot of trial and error and removing time spent waiting on the database enables a new spectrum of possibilities. For example, Netezza enabled me to reprice _every_ call in our database against _every_ one of our competitors tariffs (i.e. an 'explosive' operation: 50 mil records in => 800 mil records out) and then calculate the best *possible* price for each customer on any tariff. I used that information to benchmark my company on "value for money" and to understand the hidden drivers for customer churn.

ParStream appliance strategy:
So, given that background, let's look at the positioning of ParStream, the potential problems they may face, and the opportunities they need to pursue.

ParStream is not Netezza
I've positively compared ParStream to Netezza above so you might expect me to applaud ParStream for offering an appliance. Sadly not; Netezza's appliance success was due to unique factors that ParStream cannot replicate. Netezza had to use custom hardware because they use a custom FPGA chip. Customers were (and are) nervous about investing heavily in such hardware, however Netezza goes to great lengths to reassure them; providing service guarantees, plenty of spare parts and using commodity components wherever possible (power supplies, disks, host server, etc.). Also we must remember that most customers looking at Netezza were using very large servers (or server clusters) and required *very many* disks to get reasonable I/O performance for their databases. Netezza was actually reducing complexity for those customers.

The world has changed going into 2011
ParStream cannot replicate those market conditions. The world has changed considerably going into 2011 and different factors need to be emphasised. ParStream relies on Nvidia GPUs that are widely available and installed on commodity interconnects (e.g. PCIe). Moreover there are high quality server offerings available in 2 form factors that make the appliance strategy more of a liability than an asset. First, Nvidia (and others) sell 1U rack mounted 'server' that contain 4 GPUs and connect to 'host' server via a PCIe card. Second Supermicro (and others) sell 4U 'super' servers that contain 2 Intel Xeons and  4 GPUs in a pre-integrated package. The ParStream appliance may well be superior to these offerings in some key way however such advantages will be quickly wiped by out as the server manufactures continuously refresh their product line.

Focus on the database software business
ParStream should focus on the database software business where they have a huge advantage not the server business where they have huge disadvantages. You should read this article if you have any further doubts: "The Power of Commodity Hardware" (http://www.svadventure.com/svadventure/2009/01/the-power-of-commodity-hardware.html). Key quotes: "Customers love commodity hardware.", "Competing with HP, IBM, and Dell is dumb.", "Commodity hardware is much more capital efficient".  Also consider the fates of Kickfire and Dataupia who floundered on a database appliance strategy, and ParAccel who is going strong after initially offering an appliance and quickly moving to emphasise software-only.

Position GPUs as a new commodity
ParStream must position GPUs and GPU acceleration as a new commodity. Explain that GPUs are an essential part of all serious supercomputers and the technology is being embraced by everyone; Intel with Larabee, AMD with Fusion, etc. Emphasise the option to add 'commodity' 4 GPU pizza boxes servers alongside a customer's existing Xeon/Opteron servers and, using ParStream, make huge performance gains. Talk to Dell customers about using a single Dell PowerEdge C410x GPU chasis (http://www.dell.com/us/en/enterprise/servers/poweredge-c410x/pd.aspx) to accelerate an entire rack of "standard" servers running ParStream. The message must be clear: ParStream runs on commodity hardware; you may not have purchased GPU hardware before but you can get exactly what ParStream needs from your preferred vendor.

One final point here; ParStream needs to make Windows support a priority. This is probably not going to be fun, technically speaking, but Windows support will be important for the markets that ParStream should target (which will have to be another post, sadly).

UPDATE - I followed this post up with:
An overview of the analytic database market, a simple segmentation of the main analytic database vendors, and a summary of the key opportunities I see in the analytic databases market (esp. for ParStream and RainStor)

Thursday, 9 December 2010

Comment regarding Infobright's performance problems

UPDATE: This is a classic case of the comments being better than the post; make sure you read them! In summary, Jeff explained better and a lightbulb went off for me: Infobright is for OLAP in the classical sense with the huge advantage of being managed with a SQL interface. Cool.

I made a comment over on Tom Barber's blog post about a Columnar DB benchmarking exercise: http://pentahomusings.blogspot.com/2010/12/my-very-dodgy-col-store-database.html


Jeff Kibler said...
Tom –

Thanks for diving in! As indicated in your results, I believe your tests cater well to databases designed for star-schemas and full table-scan queries. Because a few of the benchmarked databases are engineered specifically for table scans, I would anticipate their lower query execution time. However, in analytics, companies overwhelmingly use aggregates, especially in ad-hoc fashion. Plus, they often go much higher than 90 gigs.

That said, Infobright caters to the full fledged analytic. As needed by the standard ad-hoc analytic query, Infobright uses software intelligence to drastically reduce the required query I/O. With denormalization and a larger data set, Infobright will show its dominance.

Cheers,

Jeff
Infobright Community Manager
8 December 2010 17:04


Joe Harris said...
Tom,

Awesome work, this is the first benchmark I've seen for VectorWise and it does look very good. Although, I'm actually surprised how close InfiniDB and LucidDB are, based on all the VW hype.

NFS on Dell Equilogic though? I always cringe when I see a database living on a SAN. So much potential for trouble (and really, really slow I/O).


Jeff,

I have to say that your comment is off base. I'm glad that Infobright has a community manager who's speaking for them but this comment is *not* helping.

First, your statement that "in analytics, companies overwhelmingly use aggregates" is plain wrong. We use aggregates as a fallback when absolutely necessary. Aggregates are a maintenance nightmare and introduce a huge "average of an average" issue that is difficult to work around. I'm sure I remember reading some Infobright PR about removing the need for aggregate tables.

Second, you guys have a very real performance problem with certain types of queries that should be straightforward. Just looking at it prima facie it seems that Infobright starts to struggle as soon as we introduce multiple joins and string or range predicates. The irony of the poor Infobright performance is that your compression is so good that the data could *almost* fit in RAM.

What I'd like to see from Infobright is: 1) a recognition of the issue as being real. 2) An explanation of why Infobright is not as fast in these circumstances. 3) An explanation of how to rewrite the queries to get better performance (if possible). 4) A statement about how Infobright is going to address the issues and when.

I like Infobright; I like MySQL; I'm an open source fan; I want you to succeed. The Star Schema Benchmark is not going away, Infobright needs to have a better response to it.

Joe

Thursday, 18 November 2010

Google Instant: how I *wish* it worked

There's something very grating about a product that could be really useful but just isn't. It's like the really promising kid on X-Factor / American Idol who keeps falling apart and forgetting their song. The first couple of times your rooting for them but after a blowing it repeatedly you start to wish they'd just give up. For me, this is a perfect metaphor for the current iteration of Google Instant.

Google Instant how do I hate thee? Let me cout the ways.
>> Too damn fast
I'm on a fast connection and if anything GI is just too damn fast. Results are constantly flickering just below my field of focus. I tend to use very exact (e.g. long) search phrases and this gets old quick. I find myself pausing while typing to look at GI results that are irrelevant to what I actually need. The cynic in me wonders whether this is what Google want. Are they trying to be 'sticky' now, despite a decade of saying this isn't their goal?
>> Very generic phrases
It's frustrating is how useless the GI suggestions are. GI only gives you very generic phrases and they seem to be based on an average global searches. The trouble is that I'm not an average global searcher. I've been using Google forever, they have a huge trove of data about what I've searched for and which results I've clicked on. They have know which subjects I'm interested in and which ones I'm not. They put this info to use in 'normal' search in a variety of ways but apparently not in GI.

Technologically impressive
GI is clearly a very impressive bit of technology. The number of elements that have to work in harmony for it to return results that fast is honestly a little mind boggling. I take my hat off to the clever clogs who made this happen. Nevertheless, I'd prefer to wait (like a whole *second*) for even slightly better results.

An incomplete puzzle
Having said that, I'm quite sure that GI can be made a lot better and, as it's bad form to complain without offering a solution, I have a some suggestions about how it could be better. None of my suggestions are particularly original (or insightful?), mostly I'm just suggesting that existing elements be combined in better ways.

Instant Example
Here's a sample of Google Instant in action. 5 suggestions and a ton of dead space.  It's not clear how the  suggestions are ordered or whether the order has some hidden meaning. (I should probably check out whether more costly PPC terms appear first…)


Related Searches
Here's a sample of the Related Searches option. This is buried under "More Search Tools" on the left.  Obviously there are more suggestions here but they are also different from Instant and in a different (also non-obvious) order.
Wonder Wheel (of Doom!)
Here's a sample of the Wonder Wheel option also buried under "More Search Tools". Again there is no context around any of the terms and the underlines suggest links but actually trigger a new 'wheel', the results are displayed on the left but you need a very wide screen otherwise they only get ~200 pixels of width.





A mockup example
Here's my mockup for your amusement. Points to note:
1) Search suggestions appear in columns, each column depends on the column to it's left. If you've used Finder on the Mac you know the score here. User can navigate with the mouse or arrow keys.
2) Suggested terms are greyscale to indicate some hidden metric that may help the user choose between terms. Possible metrics: number of results, popularity of the term, previous visits, etc. Previously used terms could appear in purple. There are lots of possibilities here.
3) When a term is selected (using with space or arrow right) it's added to the search box. Terms can be removed the same way (arrow left or backspace). I strongly feel that users should be encouraged to build long and specific search terms. Long terms are far more likely to result in quality responses in my experience.
4) Note that all aspects of Googles offering can be integrated in the Instant Search experience. I've noticed that the video, image and social aspects have dropped to the bottom of the results. My mockup allows them to become much more front and center.


*Do* be dense
Ultimately the Instant Search experience needs to become much more information dense. Sure "your mom" might not appreciate the color coding of the suggestions but it doesn't detract from her experience. Google needs to think much more holistically about Instant. Just getting any old crap faster is not an improvement regardless of how impressive it is, but getting the exact right result faster would be invaluable.

Tuesday, 2 November 2010

Thoughts on 37signals requirements for a "Business Analyst"

Jason Fried posted a job requirement on Friday evening for a new "Business Analyst" role at 37signals, although in reality the role is more of a Business Intelligence Analyst than a typical BA as I have experienced it. The role presents something of a conundrum for me and I thought it would be interesting to pick it apart in writing for your enjoyment.


UPDATE: I'm made a follow-up comment on the 37s blog that pretty is a good summary for this post - "I was reflecting on the requirement from years of experience doing this kind of thing (sifting meaning from piles of data). I actually said that they’ve described 2 roles not often combined in a single person.

So let me give an actual suggestion: > First, do the basics, make sure the data is organised and reliable. > Second, establish your metrics, build a performance baseline. > Third, outsource the hard analytics on a “pay for performance” basis. > Finally, if that works, then think about bringing analytic talent in-house.

It’s hard to describe the indignity of hiring a genius and then forcing them to spend 95% of their time just pushing the data around.
"

I call myself a Business Intelligence professional and my last title was "Solution Architect". You can review my claim to that title on my Linked In profile.  This should be a great opportunity for me. I'm a fan of 37signals products; I love the Rework book and I pretty much agree with all of it; and I really like their take on business practices like global teams and firing the workaholics.

However, it doesn't seem like this role is for me. Why not? It seems like they've left out the step where I do my best work: gather, integrate, prepare and verify the data. In the Business Intelligence industry we usually call the results of this phase the "data warehouse". A data warehouse, in practice, can't actually be defined in more detail than that. It's simply the place where we make the prepared data available for use. Nevertheless, the way that you choose to prepare the data inherently defines the outcomes that you get. It's the garbage in, garbage out axiom.

Jason tells us a little bit about where their data comes from: "[their] own databases, raw usage logs, Google Analytics, and occasional qualitative surveys." We're looking at very raw data sources here. Making good use of these data sources will take a lot of preparation (GA excepted) and will require a serious investment of time (and therefore money). The key is to create structures and processes that are automated and repeatable. This may seem obvious to you, but there's a sizeable number of white collar workers whose sole job is wrangling data between spreadsheets [ e.g. accountants ;) ].

The content and structure of the data is largely defined by the questions that you want to answer. Jason has at least given us an indication of their questions:  "How many customers that joined 6 months ago are still active?"; "What’s the average lifetime value of a Basecamp customer?"; "Which upgrade paths generate the most revenue?". Thats a good start. I can easily imagine where I'd get that data from and how I'd organise it. This is 'meat & potatoes' BI and it's where 80% of the business value is found. These are the things you put on your "dashboard" and track closely over time.

Another question is trickier: "In the long term would it be worth picking up 20% more free customers at the expense of 5% pay customers?" There's a lot of implied data packed into that question: what's the long term?; what does 'worth' mean?; can you easily change that mix?; are those variables even related?; etc. This is more of a classical business analysis situation where we'd build a model of the business (usually in a spreadsheet) and then flex the various parameters to see what happens. If you want to get fancy you then run a Monte Carlo simulation where you (effectively) jitter  the variables at random to see the 'shape' of all possible outcomes. This type of analysis requires a lot experience with the business. It's also high risk because you have to decide on the allowed range of many variables and guessing wrong invalidates the model. It can reveal very interesting structural limits to growth and revenue if  done correctly. Often the credibility of these models is defined by the credibility of the person who produced it, for better or worse. Would that be the same in 37signals?

Now we move on to slippery territory: "What are the key drivers that encourage people to upgrade?"; "What usage patterns lead to long-term customers?". We're basically moving into operational research here. We want to split customers into various cohorts and analyse the differences in behaviour. The primary success factor in this kind of analysis is experimental design and this is a specialist skill. Think briefly about the factors involved and they make the business model seem tame. How do we define usage patterns? Will we discover them via very clever statistics or just create them _a priori_? What are the implications of both approaches? The people who can do this correctly from a base of zero are pretty rare, in my experience. However, this is an area where you can outsource the work very effectively if you have already put the effort in to capture and organise the underlying information.

And the coup de grace: "Which customers are likely to cancel their account in the next 7 days?". It sounds reasonable on it's face. However, consider your own actions: Do you subscribe to any services you no longer need or use? The odds are good that you do. Why haven't you cancelled them? Have you said to yourself: "I really need to cancel that when I get a second."? And yet didn't do it. When you finally did cancel it, was it because you started paying for a substitute service? Or were you reminded of it at just the right time when you could take action? Predicting the behaviour of single human is a fools game. The best you can do is group similar people together and treat them as a whole ("we expect to lose 10% of this group"). Anyone who tells you they can do better than that is probably pulling your leg, in my humble opinion.

Finally, let's have a look at their questions for the applicants cover letter:
1. Explain the process of determining the value of a visitor to the basecamphq.com home page.
>> Very open ended. How do we define value in this context? The cost per visit / cost per click? The cost per conversion (visitors needed to deliver a certain number of new paid/free signups)? Or perhaps the acquisition cost (usually marketing expenses as a % of year one revenue)? I'd say we need to track all of those metrics but this is pretty much baseline stuff.
2. How would you figure out which industry to target for a Highrise marketing campaign?
>> This isn't particularly analytical, I'm pretty sure the standard dogma is sell to the people who already love your stuff. It would be interesting to know how much demographic data 37signals has about their customers industry. This is an area where you typically need to spend money to get good data.
3. How would you segment our customer base and what can we do with that information?
>> This is a classical analytic piece of work. I've seen some amazing stuff done with segmentation (self organising maps spring to mind). However, in my experience models based on simple demographics (for individuals) or industry & company size (for businesses) perform nearly as well and are much easier to update and maintain.

As far as I can tell they want to hire a split personality. Someone who'll A) create a reliable infrastructure for common analysis requirements, B) build high quality models of business processes and C) do deep diving 'hard stats' analytics that can throw up unexpected insights. Good luck to them. Such people do exist, simple probability essentially dictates that that is the case. And 37signals seems to have a magnetic attraction for talent so I wouldn't bet against it. On the other hand one of their koans is not hiring rockstars. This sounds like a rockstar to me.

Full disclosure: I'm currently working on a web service (soon to be at appconductor.com) that synchronises  various web apps and also backs them up. It's going to launch with support for Basecamp, Highrise and Freshbooks (i.e. 2/3rds 37signals products). Make of that what you will.

Friday, 22 October 2010

Disrespecting the database? ORM as disruptive technology

The premise of this post is that ORMs are a disruptive innovation for the all parts of the IT industry that utilise databases, particularly relational databases. I'm particularly interested in the ultimate impact of ORMs on my work in the BI-DW-OLAP-DSS-{insert acronym here} industry.

What makes a 'disruptive technology'?
     In case you are not familiar with the "innovator's dilemma" concept; it was originally expressed in those terms by Clayton M. Christensen in the article 'Disruptive Technologies: Catching the Wave'.
     "Generally, disruptive innovations were technologically straightforward, consisting of off-the-shelf components put together in a product architecture that was often simpler than prior approaches. They offered less of what customers in established markets wanted and so could rarely be initially employed there. They offered a different package of attributes valued only in emerging markets remote from, and unimportant to, the mainstream."

Let's talk about ORMs
     If you are a true BI-DW person you may not have heard of ORM and are unlikely to have come across one directly in your work. An ORM [Object-Relational Mapper] is simply a set of code routines that 'map' tables and columns in a relational database to objects, attributes and methods in a programming language. Programmers can then interact with the data stored by the underlying database without writing any SQL.

An ugly history with DBAs
     Now, as you'd probably imagine, DBA's hate ORMs and for good reason. They have typically produced horrible SQL and correspondingly awful performance problems for the DBAs to deal with. ORM use in "enterprise" IT environments is patchy and somewhat limited. It seems like a lot enterprise ORM use is kept out of sight and only comes to light when the DBAs get really fired up about some bad SQL that keeps reappearing every time an update to the software is released.
     The early ORMs were rightly criticised (Hibernate seems to have taken the most heat) but ORMs haven't gone away. The sweet spot for early ORMs was small and 'simple' transactional applications. The kind of app that is needed quickly and where imperfect SQL was not a huge issue. But ORMs keep evolving and becoming more sophisticated in the way they generate SQL and deal with databases. This is where the disruptive innovation part comes in.

The hockey stick graph
     Looking at the graph from the Wikipedia article I linked above you can see that ORMs started in the bottom left "low quality use" corner. My entire point for this post is that ORMs are going to follow the "disruptive technology" curve and eventually they will come be the dominate way in which ALL database access occurs. Seriously.
     Let me explain why I see this happening. There are 3 good technical reasons and a human reason. As usual the human reason is the trump card.

ORMs are getting better quickly
     First, we're seeing much better ORMs developed. In particular I want to draw your attention to Datamapper. It's a Ruby ORM that's been around for about 2 and a half years. The interesting thing about Datamapper (for me) is how much respect it has for the database. DB access is designed to minimise the number of queries hitting the backend and at the same to minimise the data being pulled out unnecessarily (i.e. only get Text/Blob fields if you really want them). Here's the kicker though: it supports foreign keys. Real (honest-to-goodness, enforced-by-the-database) foreign keys. Nice.

ORMs are the ultimate metadata layer
    Second, because an  ORM is deeply involved in the application itself it can contain a much richer set of metadata about the data that's being stored. Compare the following SQL DDL with the equivalent ORM setup code.

CREATE TABLE users (
  id           int(10)     NOT NULL AUTO_INCREMENT,
  tenant       int(10)     NOT NULL,
  first_name   varchar(50) NOT NULL,
  last_name    varchar(50) NOT NULL,
  title        varchar(50) DEFAULT NULL,
  email        varchar(99) DEFAULT NULL,
  passhash     varchar(50) DEFAULT NULL,
  salt         varchar(50) DEFAULT NULL,
  permission   int(11)     DEFAULT '1',
  created_at   datetime    NOT NULL,
  updated_at   datetime    DEFAULT NULL,
  deleted_at   datetime    NOT NULL DEFAULT '2999-12-31',
  manager_id   int(10)     NOT NULL,
  PRIMARY KEY (id, tenant),
  UNIQUE INDEX  unique_users_email (email),
  INDEX index_users_manager (manager_id),
  INDEX users_tenant_fk (tenant),
  CONSTRAINT users_tenant_fk  FOREIGN KEY (tenant)
                              REFERENCES  tenants (id) 
                              ON DELETE NO ACTION 
                              ON UPDATE NO ACTION,
  CONSTRAINT users_manager_fk FOREIGN KEY (manager_id)
                              REFERENCES  users (id) 
                              ON DELETE NO ACTION 
                              ON UPDATE NO ACTION
);
    What can we tell about this table? It's got an FK to itself on 'manager_id' and another to 'tenants' on 'tenant'.  We don't gain a lot of insight. Here's the Datamapper syntax:
class User
  include DataMapper::Resource

  property :id, Serial
  property :tenant, Integer,  :min => 0, :required => true, :key => true
  property :first_name, String,   :required => true     
  property :last_name, String,   :required => true     
  property :title,      String                       
  property :email,      String,   :length => (5..99), :unique => true, 
                   :format => :email_address,
                   :messages => {:presence => 'We need your email address.',
                                 :is_unique => 'That email is already registered.',
                                 :format    => "That's not an email address" }
  property :passhash,   String
  property :salt,       String
  property :permission, Integer,  :default => 1   
  property :phone, String                         
  property :mobile, String                         
  property :created_at, DateTime, :required => true
  property :updated_at, DateTime
  property :deleted_at, ParanoidDateTime, :required => true

  has 1,  :manager
  has n,  :authorities
  has n,  :subscriptions, :through => :authorities


  belongs_to :tenant,  :parent_key => [:id], :child_key => [:tenant]
  belongs_to :manager, self 
end
    An awful lot more insightful you ask me, and I actually stripped out 50% of the metadata to avoid distracting you. We can see that:
  Columns
     > Email must be unique, it has a Min and Max length and a specific format.
     > Deleted_At has a special type ParanoidDateTime, which means deletes are logical not physical.
  Child Tables {Try finding this out in SQL…}
     > The table has an FK that depends on it from Authorities (1 to many)
     > a relationship to Subscriptions through Authorities (many to many)
    We're getting a much richer set of metadata here and it's being specified this way because it's useful for the developer not because we're trying to specify a top-down data dictionary. The really interesting thing about the ORM example is that nothing prevents us from enriching this further. We are not bound by the constraints of SQL 99/2003/etc and the way it's been implemented by the vendor.
     We've been talking about the power and importance of metadata for at least 10 years and, truthfully, we've made almost no progress. Every new BI-DW-ETL-OLAP project I work on still has to start more or less from nothing. The rise of ORMs creates an inflection point where we can change that if we become involved in the systems early in their lifecycle.

An aside on "metadata layers"
    There is another point here and it's important. We could make a lot of progress simply by using ORMs to interact with our existing (so called 'legacy') databases. Datamapper has a lot features and tricks for accommodating theses databases and it's open source so we can add anything else we need.
    Imagine an ETL tool that interacted with the database via Datamapper instead of using ODBC/JDBC plus it's own metadata. You would start by declaring a very simple model, just table-column-datatype, and then as you learned more about the data you would specify that learning (new metadata) in the the ORM itself. I think that's an incredibly powerful concept. The ETL becomes purely for orchestration and all of the knowledge about how to interact with sources and destinations is held in a way that is usable by downstream tools (like a reporting tool or another ETL process).
    I imagine this is what the Business Objects guys had in mind when they created their metadata layer ('Universes' in BO parlance) back in the 90s. To my reckoning they didn't quite get there. The re-use of Universe metadata for other processes is (in my experience) non-existent. Yet here is a universal metadata layer; spontaneously realised and completely open for everyone to take advantage of.

ORMs will ultimately write better SQL
    Third, ORMs can generate much better SQL than people do. The human element in any technology is the most unpredictable. It's generally not a scheduled report that brings down the database. It's usually a badly formed query submitted ad-hoc by a user. Maybe this isn't the case right now but the existence of Datamapper indicates we're getting close.
    Clearly SQL is a complex domain and it will take time for ORMs to be able to cover all of the edge cases, particularly in analytics. However, let me refer you to the previous discussion of Business Objects Universes. If you review the SQL BO generates you'll see that the bar is not set all very high.

ORMs are blessed by the kingmakers
    Fourth, developers love ORMs. Stephen O'Grady from RedMonk wrote recently that developers are the new/old kingmakers. He has a great quote from Abraham Lincoln “With public sentiment, nothing can fail; without it nothing can succeed.” ORMs have the kind of positive sentiment that your fancy data dictionary / master data / shared metadata project could never dream of. Developers want to use an ORM for their projects because it helps them. They want to stuff all of the beautiful metadata in there. They want to take huge chunks business logic out of spaghetti code and put into a place where we can get at it and reuse it. Who are we to say they shouldn't?

A final thought on BI becoming "operationalised"
    My final thought for you is that the new 'edge' of BI is in putting classical BI functionality into operational apps, particularly web apps. If you think this post is a call for existing BI companies to get onboard with ORMs then you are only half right. It's also a warning that the way data is accessed is changing and a lot of core BI-DW skillsets may feel a bit like being a mainframe specialist someday soon.

Friday, 8 October 2010

The easy way to go open source in BI-DW: slipsteaming

I'd like to propose a slightly devious strategy for getting open source Business Intelligence & Data Warehousing into your company. You've probably heard a lot about open source BI / DW offerings in the last few years. You're kind of self-selected into that group by simply reading this post! However, just in case, I'll wrap up a few of the leading lights for you. This is by no means comprehensive, consider it an invitation to do some 'googling'.

Open Source Reporting / Dashboards
> Pentaho: Open Core offering, very complete BI suite, the standard bearer IMO
> Jaspersoft: Open Core offering, very complete BI suite
> Actuate/BIRT: BIRT is very open, other offerings less clear, more OEM focused 
> SpagoBI: The 'most' open of the FLOSS BI offerings, possibly less mature 
> Palo: I'm kind of confused about what's open/closed but I hear good things

Open Source ETL / Data Integration
> Pentaho PDI/Kettle: Stream based, Open Core, excellent for multi-core/server
> Talend: Code generator, Open Core, suited to single big machine processing 
> Palo ETL: Tightly integrated with Palo suite, if you like Palo give it a look
> CloverETL: New offering with a lot of features 'checked off'
> Apatar: Another new offering with a lot of features claimed

Open Source Analytic Databases
> Infobright: MySQL storage engine, Open Core, good compression, limited DDL SQL
> InfiniDB: MySQL storage engine, Open Core, maximises modern CPUs , limited DDL SQL
> LucidDB: Java based, completely open, good compression, supports most of SQL
> MonetDB: completely open but not very active, good compression, likes a lot of RAM
> VectorWise:  promising to become open soon, maximises modern CPUs, good 'buzz'
> Greenplum: kinda-sorta open, free 'single node edition', good SQL support

Open Source OLAP
> Pentaho Analysis/Mondrian: Mature tool, Open Core, likes a fast DB underneath
> Palo: Well regarded OLAP, nice options for Excel use, tightly integrated with suite

How do you bring it in?
OK, with that out of the way, how can we bring open source into businesses that already have some sort of BI-DW infrastructure in place? One of the problems that open source faces is free licenses don't buy fast talking salespeople who'll come and woo senior managers and executives. So we often have to bring it in by the back door. You're not going to rip out the existing software and replace it with your new shiny open source alternative. You need to find pain points where the business is not getting what it needs but is blocked from getting something better, usually for political or financial reasons.

Typical pain points
Let's talk about some typical pain points. Is the main DW constantly overloaded by queries? Are some team's queries throttled because they're not considered important enough? Do you have an analysis team that is not allowed to run the complex queries that they'd like to? Do you have policy of killing queries that run over an certain time and it is killing a lot of queries? Does it take a *very* long time to produce the daily report burst? Has a certain team asked for Excel ODBC access to the DW and been blocked? Do some teams want to load their own data in the DW but are not allowed to? Do more people want access to ad-hoc reporting but you can't afford the licenses? Is your ETL development slow because you can't afford any more server licenses for your expensive tool? Are you still doing your ETL jobs as hand coded SQL? 

Finally - the splistreaming strategy
If your company has more than 500 people I bet I could easily find at least 3 of those. These are the areas where you can implement open source first. You will be using a strategy that I call 'slipstreaming'. Have you ever watched the Tour De France on television? Did you notice that Lance Armstrong almost never rode at the front of the group? He always sat behind his team mates (in the slipstream) to conserve energy so he could attack at the end or breakaway on the climbs. Sitting behind his team reportedly saves 25% of his energy.

Open source as the 'domestique'
The bad news: your open source efforts are not in Lance's position. You are going to be the team mate out in front cutting the wind (a domestique). You need to find a pain point where you can put the open source solution in front of the existing solution to 'cut the wind'. Essentially you are going to make the existing solution work better by taking some of the demand away. You will then present this as a 'business-as-usual' or 'tactical' solution to your problem. You need to be very careful to position the work correctly. You goal is to run this as a small project within your team. Be careful to keep the scope down. Talk about fire-fighting, taking the pressure off, etc. I'm sure you'll know how to position it in you company. You don't want project managers or architecture astronauts getting involved and making things complicated.

How about some examples?

The data warehouse edge copy:
You have an Oracle based DW. It's been around for a few while and is suffering, despite hardware upgrades. The overnight load barely finishes by 8:30 and the daily report burst has been getting bigger and only finishes around 10:30 (sometimes 11). The customer insight team has been completely banned from running anything until the reports are out. They're not happy about starting their queries later and later.

Edge copy slipstream strategy
The slipstream solution to this is to add a edge copy of the DW between either the daily report run or the insight team. You should be able to make use of a reclaimed server (that the DW ran on previously) or you can purchase a "super PC" (basically a gaming machine with extra hard disks). The edge copy will run one of the analytic databases I mentioned. On an older machine I'd lean towards LucidDB or Infobright because of their compression. You then add a new step to the ETL that copies over just the changed data from the DW, or a time-limited subset, to the edge machine. Finally you switch them over to the edge copy. If your edge copy takes a while to load (for whatever reason) then talk to the insight team about running an extra day behind everyone else. You'll probably find that they're happy to run a day behind if they have a database to themselves, no restrictions.

The ETL co-worker:
You use Business Object's Data Integrator for your ETL processing. You've got a 4-core license and the word has come down that you are not getting any more. Your processing window is completely taken up with the exsiting run. ETL development has become a 1-in-1-out affair where new requests can only be delivered by killing something else. The DW devs have started using hand coded routines in the warehouse to deliver work that has political priority.

Co-worker slipstream strategy
The slipstream solution to this is to add an open source ETL tool as a co-worker to the existing processing. The idea is to leave all of the existing processing on BODI but put new requests onto the new open source package. Again you need to identify either a older server that you can reclaim or source a super-PC to run on. Think carefully about the kind of work that can be best done on the co-worker process. Isolated processes are best. You can also do a lot of post loading activities like data cleanup and aggregations. Once you've established the co-worker as a valid and reliable ETL solution then you should aim to set a policy that any existing ETL processing that has to be changed is moved to the new tool at the same time.

Be devious. Be political.  But be nice.
Don't tell them, show them.
Ask forgiveness, not permission.
I wish you luck.

Tuesday, 5 October 2010

The trouble with SaaS BI - it's all about the data

    Some data was released yesterday that purports to show that SaaS BI customer's are very pleased with it's ease of use, etc., etc. Boring. Seriously, I really like the idea of SaaS BI but I haven't seen anyone making great leaps forward. I'd say that they *can't* take us forward because of the box that they've painted themselves into. The box actually has a name: it's called BI.


The BI sandbox
    Eh? What? Here's the thing; BI as we currently know it is the last stage in the information pipeline. It's the beautiful colours on the box that holds the cereal. But it's not the cereal and it's not even the box. It is *very* important (who would buy cereal in a plain cardboard box?) but is also *very* dependent on other elements in the pipeline.
    I don't want to get into a long discussion about definitions of BI. Suffice it to say this: why are terms like 'data warehouse' and 'OLAP cube' still prevalent? Simply because BI does not imply data gathering, preparation and storage. Last example on this theme. If I tell you I'm a Business Intelligence manager, what would you guess is my remit? Does it include the entire data warehouse? The OLAP cubes? All of the ETL processing? No? It could but it rarely does.

It's all about the data
    I once worked for a clever chap who's mantra was "it's all about the data". His daily struggle was to get the business to invest more time, effort and money into the data itself. It was a hard fight. We had a very fast data warehouse (NZ) and some perfectly serviceable BI software (BO) and nearly a dozen newly minted graduates to turn out our reports. What we did not have was a strong mandate to get the data itself absolutely sorted, to get every term clearly defined and to remove all of the wiggle room from the data. As a consequence we had the same problems that so many BI teams have. Conflicting numbers, conflicting metrics, and political battles using our data as ammunition.

Data is the 'other' 90%
    I'd estimate that gathering, preparing, and storing the data for BI represents at least 90% of the total effort, with analysis and presentation being the last 10%. I really hope no one is surprised by that figure. I'd think that figure is consistent for any situation in which decisions need to be made from data. For instance a scientist in a lab would have to spend a lot of time collecting and collating measurements before she could do the final work of analyzing the results. A research doctor conducting a study will have to collect, organize and standardize all of the results study data before he can begin to evaluate the outcome.


It's NOT about speed
    One of the tragedies of the Inmon-Kimball data warehouse definition war is the data warehouse has been conceived as something that you create because you want to speed up your data access. It's implied that we'd prefer to leave the data in it's original systems if we could, but alas that would be too slow to do anything with. What a load of tosh! Anyone who's been in the trenches knows that the *real* purpose of a data warehouse is to organize and preserve the data somewhere safe away from the many delete-ers and archive-ers of the IT world. We value the data for it's own sake and believe it deserves the respect of being properly stored and treated.

Nibbling at the edges
    So, back to the topic, how does SaaS BI help with this issue?  Let's assume that SaaS BI does what it claims and makes it much easier for "users" to produce reporting and analysis. Great, how much effort have we saved? Even if it takes half as much time and effort we've only knocked 5% off our total.

The real opportunity
   And finally I come to my point: the great untapped opportunity for the SaaS [BI-DW-OLAP-ETL] acronym feast is the other 90% where the most of the hard work happens. Customers are increasingly using online applications in place of their old in-house apps. Everything from ERP to Invoicing to call centre IVRs and diallers are moving to a SaaS model. And every SaaS service that's worth it's salt offers an open API for accessing the data that they hold.

The holy grail - instant data
    This is the mother-load, the shining path for data people. Imagine an end to custom integrations for each customer. Imagine an end to customers having to configure they're own ETL and design their own data warehouse before they can actually do anything with their data. The customer simply signs up to the service and you instantly present them with ready to use data. Magic. Sounds like a service worth paying for.

Monday, 4 October 2010

Buying results versus buying potential in business IT

It's all about potential
    You have probably noticed that business IT (in general) is very expensive. In fact we (the IT industry) invented a special word to justify the expense: Enterprise. We use the word Enterpise to imply that a product or service is: robust, sophisticated, reliable, professional, complex and (most of all) valuable. But if you look deeper you might notice something interesting about "enterprise" IT; it's all about potential. The most successful IT products and companies make all their money by selling potential. Oracle databases have the potential to handle to world's largest workloads. SAP software has the potential to handle the processes of the world's biggest companies. IBM servers have the potential to run the world's most demanding calculations and applications.

You're buying what could be
    Selling potential is insanely lucrative. After all you're not buying what actually is, you're buying what could be. You're not buying Oracle database to simply keep track of your local scrap metal business; you're buying a Oracle to keep working while you become the biggest scrap metal business in the world. You're not buying SAP to tame your 20 site tool hire business processes, you're buying SAP to help you become the world leader in tool hire. And when the purchase is framed like this customers actually want to pay more. I've been involved in more than one discussion where suppliers were eliminated from consideration for being too cheap. It was understood that they couldn't be "enterprise enough" at that price point.

Really expensive DIY
    The funny thing is that buying potential actually means you'll have to do it yourself. This is actually the defining characteristic of enterprise IT; whatever it costs you'll spend the same again to get it working. You don't simply install the Oracle database and press run. You can't just install SAP on everyone's desktop. You need to hire experts, create a strategy, run a long project, put it live over a long weekend, perform extensive tuning, etc., etc. Of course your enterprise IT supplier will be happy to help with all this, but that'll cost extra.

Small business need results
    Ironically, small businesses want to buy the exact opposite of potential. They want to buy results. Strike that; they need to buy results. A specific action should result in a specific outcome. In a small business you can't afford to waste time and money on something that might be great. What is required is something that is OK right now. I think the web is perfect for delivering this kind of service and I think that's why small businesses are embracing web services like Basecamp, Freshbooks and (hopefully) my very own AppConductor.com. I think there is latent demand for better technology in small businesses. They can't risk spending big money on potential but they're happy to try out a service will a small monthly fee. If it helps them they'll keep paying the monthly fee and if it doesn't then they won't have lost much.

Harris law of IT spending
    I think this is so powerful I'm going to make it a law. The Harris law of IT spending: "Pay only for results. Never buy potential".

P.S. You might notice that open source embodies my law completely. The thing that has potential is free, and then you  spend time/money on results.

Friday, 1 October 2010

Further thoughts on real-time ETL, the Magic of Queues

    Now that we have polling out of the way, let's talk about integrating the data once you know something has changed. If you are doing this kind of thing for the first time then your inclination will be towards "one giant script"; do this > then this > then this > then this > ad infinitum. Avoid this pattern. It's not not practical and it's hard to change. However, in real-time ETL it will come back to bite you - hard.


Real-time = inconsistent
    The hard thing about real-time ETL, in my view, is not the continuous processing. It's the inconsistency. One system may receive a large number of changes in a short spell, another system may go hours without a single update. Or all of your systems could (easily) burst into life at exactly 9 AM and hammer you with updates for an hour. But these are just the obvious options.


The dreaded 'mysterious slowdown'
    A more insidious problems is the dreaded 'mysterious slowdown'. Let's say you have a dependency on address checking using an app from MoMoneySys. The app runs on it's own server, was installed by the MMS engineers and is not to be interfered with on penalty of death. The app is like a foreign footballer, brilliant when it works, liable to collapse under pressure. If your real-time ETL flows directly through this app then your performance is directly tied to it's performance, not a nice situation to be in.

Time-to-live? Get over it
    I mentioned in the previous post that I'm not really interest in talking about how far my ETL process is behind live and this pretty much explains why; it's just meaningless talk. If no updates are happening on the source and my ETL process is empty then my 'time-to-live' is 0. If the app suddenly saturates me with changes then my TTL is going to go down the pan. The best I can do is monitor the average performance and make sure my ETL 'pipe' is 'wide' enough to deliver a performance I'm happy with (another post perhaps…).

Behold the QUEUE
      So you need a way to cope with the vagaries of sporadic updates in the source, patchy performance in dependencies and, simply, your desire to remain sane. And behold I present to you the QUEUE. Queueing is an excellent way to break up your processing flow and handle the inconsistent, patchy nature of real-time ETL.
     Contrary to the title, there is no magic involved in queues. It's simply a list of jobs that need to be worked. There are any number queueing apps/protocols available; from the complex, like AMQP (http://en.wikipedia.org/wiki/AMQP), to the fairly simple, like BeanstalkD (http://kr.github.com/beanstalkd/), to my personal favourite - the queue table. Jobs are added to the queue and then worked in order. Typically you also give them a priority and add some logic where a job is checked out of the queue to be worked and only deleted on success.

Queues beat the real-time blues
    So how can we use queues to help our real-time blues? Basically we use queues to buffer each processing step from the next step. Let's think about processing the steps I've mentioned (above and previous post) in a real-time ETL process that uses internal and external sources.

A real-time ETL example
A. Check for changes
A.1 Poll external web service for changes
A.2 Receive change notices from internal database
B. Retrieve changed data
B.1 Extract changed data from web service
B.2 Extract changed data from internal database
C. Validate changed data
C.1 Append geocoding information (using geonames.org data)
C.2 Process new personal data with external identity verification service
D. Load data into final database (e.g. data warehouse)

This time with queues
A. Check for changes
> Take job from the {poll} queue
A.1 Poll external web service for changes (based on job)
> Add job to the {retrieve} queue
A.2 Receive change notices from internal database
> Add job to the {retrieve} queue
B. Retrieve changed data
> Take job from the {retrieve} queue
> Choose worker 1 or 2 for the job
B.1 Extract changed data from web service
> Load raw changes to the staging area
> Add job to the {validate} queue
B.2 Extract changed data from internal database
> Load raw changes to the staging area
> Add job to the {validate} queue
C. Validate changed data
> Take job from the {validate} queue
> Add job to {geocode} or {identity} queues (or both) 
C.1 Append geocoding information (using geonames.org data)
> Take job from the {geocode} queue
> Update staging tables with geocoding.
> Add job to the {final} queue
C.2.a Submit new personal data to external identity verification service
> Take job from the {identity} queue
> Update staging tables with geocoding.
> Add job to the {ready} queue
C.2.b Retrieve verified personal data from verification service
> DON'T WAIT! Use a separate listener to receive the data back
> Update staging tables as required (or any other work…)
> Add job to the {final} queue
D. Load data into final database (e.g. data warehouse)
> Take jobs from {final}
> Load data in the final database

A mental model to finish
    I've glossed over some issues here but you get the picture. You never want to have any part of your real-time ETL process waiting on another process that is still working. I like to picture it as a warehouse 'pick line': In normal ETL you ask each person to hand off directly to the next person. They do their work and then, if the next person isn't ready, they have to wait to hand off. In real-time ETL, there is a 'rolling table' (queue) between each person. They pick their items off the end of their table, do their work, and then place at the start of the next table.

Thursday, 30 September 2010

Getting started with real-time ETL and the dark art of polling:

    There has been a lot of discussion about real-time ETL over the last few years and a lot of it can be summarised as "don't do it unless you REALLY need to". Helpful, eh? I recently had the need to deal with real-time for the first time so I thought I would summarise my approach to give you some food for thought if you're starting (or struggling) on this same journey.


Is it really real-time?
    The question often asked is "how far behind the source (in time) can you be, and still call it real-time?". I don't really care about this kind of latency. I thinks it's basically posturing; "I'm more real-time than you are". My feeling is that I want something that works continuously first and I'll worry about latency later. As long as the process is always working to catch up to the source that's a good start.


Old options are out
    Next question (and one that actually matters): "How will you know when the data has changed on the source?" This is an old classic from batch ETL; the difference is that we have taken some of our traditional options away. In batch ETL we could periodically extract the whole resource and do a complete compare. Once you go real-time, this approach will actually miss a large number of changes that update the same resource multiple times. In fact, I would say that repeated updates of a single resource are the main type of insight that real-time adds, so you better make sure you're getting it.


CDC:  awesome and out of reach
    What can you do to capture changes? Your first (and best) option is change data capture. CDC itself is beyond the scope of this discussion, however the main point is that it is tightly bound to the source system. If you've been around data warehousing or data integration for more than 5 minutes you can see how that could be a problem. There are numerous half-way house approaches which I'm won't go over; suffice it to say that most enterprise databases have metadata tables and pseudo-column values that they use internally to keep track of changes and these can be a rich seem of information for your real-time ETL quest.


Polling: painful but necessary
    You will inevitably come across some system which allows you no detailed interaction with it's backend. Web based services are the perfect case here - you're not going to get access to the remote database so you just have to cope with using their API. And that leaves you with - POLLING. Basically asking the source system: 'has this resource changed' or (when you can't aks that) extracting the resource and comparing it to your copy.
    A naive approach would be to simply iterate through the entire list of resources over a given interval. The time it takes to complete an iteration would be, roughly speaking, your latency from live. However, DON'T DO THIS unless you want to be strangled by the SysAdmin for the source or banned from API access to the web service.


My 'First law of real-time ETL'
    So I would propose the following heuristic: data changed by humans follows Nexton's first law. Restated:
'Data in motion will stay in motion, data at rest will stay at rest.' 
    Basically a resource that has changed is more likely be changed again when you next check. Conversely a resource which has not changed since you last checked is less likely to changed when you check again. To implment this in your polling process you would simply track how many times you've checked the resource without finding a change and adjust your retry interval accordingly.
For example:
> Check resource - no change - unchanged count = 1 - next retry = 4 min
> Check resource - no change - unchanged count = 2 - next retry = 8 min
> Check resource - no change - unchanged count = 3 - next retry = 16 min
> Check resource - no change - unchanged count = 4 - next retry = 32 min
> Check resource - CHANGED - unchanged count = 0 - next retry = 1 min


Keep it simple stupid
    This a simplistic approach but it can massively reduce the strain you place on the source system. You should also be aware of system driven changes (i.e. invoice generation, etc.) and data relationships (i.e. company address changes > you need to check all other company elements sooner than scheduled). You should also note that changes which are not made by humans are much less likely to obey this heuristic.


A note for the web dudes
    Finally, if you are mostly working with web services then familarise yourself with the following:
> Webhooks, basically change data capture for the web. You subscribe to a resource and changes are notified to a location you specify. Sadly, webhooks are not widely supported right now.
> RSS, that little orange icon that you see on every blog you read. Many services offer RSS feeds of recently changed data and this is a good comprise.
> E-tag and If-Modified-Since headers, HTTP header elements that push the burden of looking for changes off to the remote service (which is nice).


Good luck.

Disqus for @joeharris76