Friday 1 October 2010

Further thoughts on real-time ETL, the Magic of Queues

    Now that we have polling out of the way, let's talk about integrating the data once you know something has changed. If you are doing this kind of thing for the first time then your inclination will be towards "one giant script"; do this > then this > then this > then this > ad infinitum. Avoid this pattern. It's not not practical and it's hard to change. However, in real-time ETL it will come back to bite you - hard.


Real-time = inconsistent
    The hard thing about real-time ETL, in my view, is not the continuous processing. It's the inconsistency. One system may receive a large number of changes in a short spell, another system may go hours without a single update. Or all of your systems could (easily) burst into life at exactly 9 AM and hammer you with updates for an hour. But these are just the obvious options.


The dreaded 'mysterious slowdown'
    A more insidious problems is the dreaded 'mysterious slowdown'. Let's say you have a dependency on address checking using an app from MoMoneySys. The app runs on it's own server, was installed by the MMS engineers and is not to be interfered with on penalty of death. The app is like a foreign footballer, brilliant when it works, liable to collapse under pressure. If your real-time ETL flows directly through this app then your performance is directly tied to it's performance, not a nice situation to be in.

Time-to-live? Get over it
    I mentioned in the previous post that I'm not really interest in talking about how far my ETL process is behind live and this pretty much explains why; it's just meaningless talk. If no updates are happening on the source and my ETL process is empty then my 'time-to-live' is 0. If the app suddenly saturates me with changes then my TTL is going to go down the pan. The best I can do is monitor the average performance and make sure my ETL 'pipe' is 'wide' enough to deliver a performance I'm happy with (another post perhaps…).

Behold the QUEUE
      So you need a way to cope with the vagaries of sporadic updates in the source, patchy performance in dependencies and, simply, your desire to remain sane. And behold I present to you the QUEUE. Queueing is an excellent way to break up your processing flow and handle the inconsistent, patchy nature of real-time ETL.
     Contrary to the title, there is no magic involved in queues. It's simply a list of jobs that need to be worked. There are any number queueing apps/protocols available; from the complex, like AMQP (http://en.wikipedia.org/wiki/AMQP), to the fairly simple, like BeanstalkD (http://kr.github.com/beanstalkd/), to my personal favourite - the queue table. Jobs are added to the queue and then worked in order. Typically you also give them a priority and add some logic where a job is checked out of the queue to be worked and only deleted on success.

Queues beat the real-time blues
    So how can we use queues to help our real-time blues? Basically we use queues to buffer each processing step from the next step. Let's think about processing the steps I've mentioned (above and previous post) in a real-time ETL process that uses internal and external sources.

A real-time ETL example
A. Check for changes
A.1 Poll external web service for changes
A.2 Receive change notices from internal database
B. Retrieve changed data
B.1 Extract changed data from web service
B.2 Extract changed data from internal database
C. Validate changed data
C.1 Append geocoding information (using geonames.org data)
C.2 Process new personal data with external identity verification service
D. Load data into final database (e.g. data warehouse)

This time with queues
A. Check for changes
> Take job from the {poll} queue
A.1 Poll external web service for changes (based on job)
> Add job to the {retrieve} queue
A.2 Receive change notices from internal database
> Add job to the {retrieve} queue
B. Retrieve changed data
> Take job from the {retrieve} queue
> Choose worker 1 or 2 for the job
B.1 Extract changed data from web service
> Load raw changes to the staging area
> Add job to the {validate} queue
B.2 Extract changed data from internal database
> Load raw changes to the staging area
> Add job to the {validate} queue
C. Validate changed data
> Take job from the {validate} queue
> Add job to {geocode} or {identity} queues (or both) 
C.1 Append geocoding information (using geonames.org data)
> Take job from the {geocode} queue
> Update staging tables with geocoding.
> Add job to the {final} queue
C.2.a Submit new personal data to external identity verification service
> Take job from the {identity} queue
> Update staging tables with geocoding.
> Add job to the {ready} queue
C.2.b Retrieve verified personal data from verification service
> DON'T WAIT! Use a separate listener to receive the data back
> Update staging tables as required (or any other work…)
> Add job to the {final} queue
D. Load data into final database (e.g. data warehouse)
> Take jobs from {final}
> Load data in the final database

A mental model to finish
    I've glossed over some issues here but you get the picture. You never want to have any part of your real-time ETL process waiting on another process that is still working. I like to picture it as a warehouse 'pick line': In normal ETL you ask each person to hand off directly to the next person. They do their work and then, if the next person isn't ready, they have to wait to hand off. In real-time ETL, there is a 'rolling table' (queue) between each person. They pick their items off the end of their table, do their work, and then place at the start of the next table.

No comments:

Post a Comment

Disqus for @joeharris76