Thursday, 30 September 2010

Getting started with real-time ETL and the dark art of polling:

    There has been a lot of discussion about real-time ETL over the last few years and a lot of it can be summarised as "don't do it unless you REALLY need to". Helpful, eh? I recently had the need to deal with real-time for the first time so I thought I would summarise my approach to give you some food for thought if you're starting (or struggling) on this same journey.


Is it really real-time?
    The question often asked is "how far behind the source (in time) can you be, and still call it real-time?". I don't really care about this kind of latency. I thinks it's basically posturing; "I'm more real-time than you are". My feeling is that I want something that works continuously first and I'll worry about latency later. As long as the process is always working to catch up to the source that's a good start.


Old options are out
    Next question (and one that actually matters): "How will you know when the data has changed on the source?" This is an old classic from batch ETL; the difference is that we have taken some of our traditional options away. In batch ETL we could periodically extract the whole resource and do a complete compare. Once you go real-time, this approach will actually miss a large number of changes that update the same resource multiple times. In fact, I would say that repeated updates of a single resource are the main type of insight that real-time adds, so you better make sure you're getting it.


CDC:  awesome and out of reach
    What can you do to capture changes? Your first (and best) option is change data capture. CDC itself is beyond the scope of this discussion, however the main point is that it is tightly bound to the source system. If you've been around data warehousing or data integration for more than 5 minutes you can see how that could be a problem. There are numerous half-way house approaches which I'm won't go over; suffice it to say that most enterprise databases have metadata tables and pseudo-column values that they use internally to keep track of changes and these can be a rich seem of information for your real-time ETL quest.


Polling: painful but necessary
    You will inevitably come across some system which allows you no detailed interaction with it's backend. Web based services are the perfect case here - you're not going to get access to the remote database so you just have to cope with using their API. And that leaves you with - POLLING. Basically asking the source system: 'has this resource changed' or (when you can't aks that) extracting the resource and comparing it to your copy.
    A naive approach would be to simply iterate through the entire list of resources over a given interval. The time it takes to complete an iteration would be, roughly speaking, your latency from live. However, DON'T DO THIS unless you want to be strangled by the SysAdmin for the source or banned from API access to the web service.


My 'First law of real-time ETL'
    So I would propose the following heuristic: data changed by humans follows Nexton's first law. Restated:
'Data in motion will stay in motion, data at rest will stay at rest.' 
    Basically a resource that has changed is more likely be changed again when you next check. Conversely a resource which has not changed since you last checked is less likely to changed when you check again. To implment this in your polling process you would simply track how many times you've checked the resource without finding a change and adjust your retry interval accordingly.
For example:
> Check resource - no change - unchanged count = 1 - next retry = 4 min
> Check resource - no change - unchanged count = 2 - next retry = 8 min
> Check resource - no change - unchanged count = 3 - next retry = 16 min
> Check resource - no change - unchanged count = 4 - next retry = 32 min
> Check resource - CHANGED - unchanged count = 0 - next retry = 1 min


Keep it simple stupid
    This a simplistic approach but it can massively reduce the strain you place on the source system. You should also be aware of system driven changes (i.e. invoice generation, etc.) and data relationships (i.e. company address changes > you need to check all other company elements sooner than scheduled). You should also note that changes which are not made by humans are much less likely to obey this heuristic.


A note for the web dudes
    Finally, if you are mostly working with web services then familarise yourself with the following:
> Webhooks, basically change data capture for the web. You subscribe to a resource and changes are notified to a location you specify. Sadly, webhooks are not widely supported right now.
> RSS, that little orange icon that you see on every blog you read. Many services offer RSS feeds of recently changed data and this is a good comprise.
> E-tag and If-Modified-Since headers, HTTP header elements that push the burden of looking for changes off to the remote service (which is nice).


Good luck.

Monday, 1 March 2010

Financialisaton: Optimised to death.

A few previous tweets:
>> Today's theme: financialisation. Not really a word, made it up to describe the trend of running businesses as if they were hedge funds.
>> Financialisation: #1 benchmark industry for advanced analytics is (still) the financial industry & financial markets. Is this a good sign?
>> Financialisation: For me Enron is the base case of financialising the energy business. There was deceit but at core it was models run amok.
>> Financialisation: CEP, Real-time BI, etc: Can you filter signal from noise in real time? Processing delay may prevent overreactions.

Financialisation; a term that I invented (AFAIK) to describe running 'normal' businesses like they're hedge funds. E.g. the use of statistics / 'quantitative models' to wring every bit of excess / waste / inefficiency from a business. Traders on the financial markets attempt to profit from small price movements by developing complex predictive models and the ability to move on changes very, very quickly. The problem is that financial markets are not like normal businesses. They are probably more like a casino than a business that provides tangible goods or services (e.g. dentists, dry cleaners, demolition, design, etc.).

Many business executives seem envious of this apparent ability to turn thin air into money using leverage and very fast moving transactions. One suspects they would love to turn their own business so quickly and, perhaps, avoid messy interactions with opinionated customers. Business Intelligence and Analytic Database companies haven't failed to notice this desire and heavily market their reference customers from Finance in other sectors.

My thoughts on this are influenced by Nicholas Nassim Taleb's books "Fooled by Randomness" and "The Black Swan". His premise is that the world is much more random and much less predictable than it appears to human observers and events often come out of left field to completely upset our ideas (hence the black swan). Taleb never mentions Business Intelligence or Analytics, but I'm struck by the relevance of his ideas to our industry.

On the other side of the fence; Thomas H. Davenport's "Competing on Analytics" is the standard bearer for financialisation and a favourite handout of BI vendors (e.g. Oracle, Microsoft). The choice quote: "Employees hired for their expertise with numbers …are armed with the best evidence… As a result, they make the best decisions." Really? Simply applying the power of numbers to a business, using very clever people of course, is a sure fire way to success? Does that mesh with your experience?

Consider the case of Enron. They were principally involved in energy supply, which has a very real need to analyse and forecast future demand. Enron got into trouble by using models (and modellers) to make highly leveraged plays on the energy futures market. Ultimately their crimes were about deceit (they used shell companies to inflate profits and conceal losses); however, it is my understanding that their losses stemmed from deals based on very sophisticated models that did not turn out as predicted.

A counter example very relevant to Business Intelligence is disaster recovery. It's common practice in IT to run a disaster recovery copy of important systems. We keep an exact duplicate of the system in another data centre far from the primary system so that, if the worst happens, business can carry on by switching to the DR instance. This is inherently excess capacity that bears significant costs and yet we hope it will never be used. We carry the cost of all this "excess" equipment for a very good reason; the cost of not having it is potentially much, much higher.

This sums up the risk of fincialisation: Can you be certain that the what looks like excess (on the cost side) is not actually very important? Can you be sure that what looks like new profit (on the opportunity side) is not exposing you to a large unexpected loss?

{This is where I was going to go over some common types of analysis and discuss whether they are more or less likely financialised. But this has been in draft for long enough so that will have to wait. TTFN}

Friday, 1 January 2010

Unsolicited advice for Linked In and Stack Overflow - MERGE!

I think Linked In and Stack Overflow are on a collision course. They have both established impressive beachheads in the nascent market for professional reputation services, in particular for reputation that cannot be faked.

Linked In comes at this from the position of an online C.V./resume service that allows you to "connect" to people you've worked with. In theory it's a business version of Facebook, in practice it's actually a reputation service. I do not maintain highly active relationships with former colleagues and customers. We connect on Linked In because it allows us to keep in touch with little effort and verifies who we are, the roles we've held, and the work we've done.

Linked In have recently expanded their group functionality to create discussion forums so people can converse with "real" people. Sadly these groups are trending heavily towards spammy selling posts. There is no way to remove the noise from these groups and no obvious reward for high quality contributors.

The business model for Linked In seems to be selling premium access to user data for recruitment and sales professionals. In my opinion this is a short term model. They are really in direct competition with their customers. The recruitment industry exists because of a lack of quality information about potential employees and is ripe for disintermediation. They are also in competition with 'outside' sales professionals which represent a huge cost burden on B2B sales. The rewards for moving sales 'inside' are potentially huge.

Stack Overflow comes at reputation from the other side. They have created a high quality answer board for technical questions and with Stack Exchange have expanded the product into virtually any topic. Their key innovation is to reward high quality answers and to encourage quality contributors with incentives like badges and points.

Stack Overflow users generate a different form of reputation, they are verifying that they actually understand a specific subject. This is incredibly valuable because it verifies something that Linked In cannot: the ability to do something *again*. We've all worked with people who just scraped by, doing what they're told without necessarily understanding it. You can guarantee that those people could not generate a good reputation on Stack Overflow.

The business model for Stack Overflow seems to be around advertising and particularly job advertising via the Careers site. S.O. Careers allows users to create an online C.V./resume that is linked to their reputation. Sound familiar?

This is where they collide. Linked In has a lock on the verified C.V./resume side but the discussions functionality is poor. Stack Overflow has a lock on quality discussions and answer board functionality.

Building a reputation on Stack Overflow has some value but it's limited to that context. S.O. Careers may be relatively successful but it seems unlikely to eclipse Linked In in this respect, never mind the 800lb gorillas like Monster.

Likewise building a C.V. on Linked In has some value but participating in discussions is frustrating and has no clear payback. The depth of reputation is limited to fairly shallow "I'll scratch your back if you scratch mine" recommendations.

Potential options:
1. Stack Overflow adds social network functionality: seems unlikely to receive broad adoption for any number of reasons.
2. Linked In completely revamps their discussion functions to emulate S.O.: this would require their customers to recreate all of questions and answers that exist across all S.O. sites.
3. Integrate using APIs. Move Linked In groups/discussions to Stack Exchange and use Linked In profiles in place of re-creating a CV on Stack Overflow.

4. MERGE!!!! - I'm dead serious here. Merging these services would create a very strong network effect. It would be the natural home for professional questions and discussions and provide clear incentives for people to share their knowledge. The combination would pretty much own the professional reputation space. If this ever happens Monster are toast…

It's a new year, time for thinking big.

Thursday, 24 December 2009

Comment on Cringley's "DVD is Dead" post

In regards to http://tr.im/Ivrs

Apple's "plan" to be "front and center" of the living room seems a lot more like an outside bet than a central strategy to me. When you see as many TV spots for the AppleTV as the iPhone then you'll know the strategy has changed. The living room tech cycle is super-slow compared to Apple's "normal" and thus difficult to integrate.

Bob's kinda missed the point here though, this actually signals the *failure* of Blu-Ray. It's just going to take over from DVD in a smooth flattish decline, no one is out there re-buying their library in Blu-Ray.

There's a huge pent up demand for an "iTunes" experience for video content. I.e. put DVD in the machine, machine makes digital copy, moves copy to my device(s), I make future purchases as downloads and everything lives in a single library.

My guess is Apple hasn't been able to swing that with the Hollywood studios yet. It's still not clear whether you're legally allowed to make a backup copy of a DVD you bought. In the meantime they're just keeping the AppleTV on life support until something gives.

Sunday, 25 October 2009

Unsolicited advice for Kickfire

Following up on the Kickfire BBBT tweetstream on Friday (23-Oct), I want to lay out my thoughts about Kickfire's positioning. I should point out that I have little experience with MySQL, no experience with Kickfire and I'm not a marketer ( but I play one on TV… ;) ).

Kickfire should consider doing the following:

1. Emphasise the benefits of the FPGA
We now know that Kickfire's "SQL chip" is in fact an FPGA. Great! They need to bring this out in the open and even emphasise it. This is actually a strength, FPGA's have seen major advances recently and a good argument can be made that they are not "propietary hardware" but a commodity component advancing at Moore's Law speed (or better).
They should also obtain publishing rights to recent research about the speed advantages of executing SQL logic on an FPGA. Good research foundations and advances in FPGAs make Kickfire seem much more viable longterm.

2. Pull back on the hyperbole.
Dump the P&G style 'Boswelox' overstatement. A lot of the key phrases in their copy seem tired. How many time have we heard about "revolutionary" advances? My suggestion is to use more concrete statements. Example: "Crunch a 100 million web log records in under a minute". Focus on common tasks and provide concrete examples of improved performance.
Also, reign in the buzzwords: availability, scalability, sustainability, etc. If this is really for smaller shops and data marts then plain english is paramount. "Data mart" type customers will have to ram this down the throat of IT. They need to want it more than an iPhone or they'll just give up and go with the default.

3. Come up with a MapReduce story.
MapReduce is the new darling of the web industry. Google invented the term, Yahoo has released the main open source project and everyone just thinks it's yummy. Is it a mainstream practice? Probably not, but the bastion of MySQL is not mainstream either.
Kickfire's "natural" customers (e.g. web companies) may not have any experience with data warehousing. When they hit scaling issues with MySQL they may not go looking for a better MySQL. Even if they do they'll probably find and try Infobright in the first instance.
Kickfire needs a story about MapReduce and they need to insert themselves into the MapReduce dialogue. They need to start talking about things like "The power of MapReduce in a 4U server" or "Accelerating Hadoop with Kickfire".

4. Offer Kickfire as a service.
Kickfire needs to be available as a service. This may be a complete pain in the ass to do and it may seem like a distraction. I bet Kickfire policy is to offer free POCs. But IMHO their prices are too low to make this scalable.
Customers need to be able to try the product out for a small project or even some weekend analysis. When they get a taste of the amazing performance then they'll be fired up to get Kickfire onsite and willing to jump through the hoops in IT.
If this is absolutely out of the question, the bargain basement approach would be to put up a publicly accessible system (registration required) filled with everything from data.gov. Stick Pentaho/Jasper on top (nice PR for the partner…) and let people play around.

5. Deliver code compatibility with Oracle and SQL Server.
There are probably compelling reasons the choice of MySQL. However, many potential customers have never used it. They've never come across it in a previous role. It's not used anywhere in their company. Frankly, it makes them nervous.
Kickfire needs to maximise their code compatibility with Oracle and SQL Server and then they need to talk about it everywhere.

That is all. Comments?


Wednesday, 29 July 2009

Why GPUs matter for DW/BI

I tweeted a few days ago that I wasn't particularly excited about either the Groovy Corp or XtremeData announcements because I think any gains they achieve by using FPGAs will be swept away by GPGPU and related developments. I got a few replies either asking what GPGPU is and a few dismissing it as irrelevant (vis a vis Intel x64 progress). So I want to explain my thoughts on GPGPU, and how it may affect the Database / Business Intelligence / Analytics industry (industries?).

GPGPU stands for "general-purpose computing on graphics processing units". ([dead link]) GPGPU is also referred to as "stream processing" or "stream computing" in some contexts. The idea is that you can offload the processing normally done by the CPU to the computer's graphics card(s).

But why would you want to? Well, GPUs are on a roll. Their performance is increasing exponentially faster than the increase in CPU performance. I don't want to overload this post with background info but suffice to say that GPUs are *incredibly* powerful now and getting more powerful much faster than CPUs. If you doubt that this is the case have a look at this article on the Top500 supercomputing site, point 4 specifically. ([dead link])

This is not a novel insight on my part. I've been reading about this trend since at least 2004. There was a memorable post on Coding Horror in 2006 ([dead link]). Nvidia released their C compatibility layer "CUDA" in 2006 ([dead link]) and ATI (now AMD) released their alternative "Stream SDK" in 2007 ([dead link]). More recently the OpenCL project has been established to allow programmers to tap the power of *any* GPU (Nvidia, AMD, etc) from within high level languages. This is being driven by Apple and their next OSX update will delegate many tasks to the GPU using OpenCL.

That's the what.

Some people feel that GPGPU will fail to take hold because Intel will eventually catch up. This is a reasonable point of view and in fact Intel has a project called Larrabee ([dead link]). They are attempting to make a hybrid chip that effectively emulates a GPU within the main processor. It's worth noting that this is very similar to the approach IBM have taken with the Cell chip used in the Playstation3 and many new supercomputers. Intel will be introducing a new set of extensions (like SSE2) that will have to be used to tap into the full functionality. The prototypes that have been demo'ed are significantly slower than current pure GPUs. The point is that Intel are aware of GPGPU and are embracing it. The issue for Intel is that the exponential growth of GPU power looks like it's going to put them on the wrong side of a technology growth curve for once.


Why are GPUs important to databases and analytics?
  1. The multi-core future is here now.
  2. I'm sure you've heard the expression "the future is already here it's just unevenly distributed". Well that applies double to GPGPU. We can all see that multi-core chips are where computing is going. The clock speed race ended in 2004. Current high end CPUs now have 4 cores and 8 cores will arrive next year and on it goes. GPUs have been pushing this trend for longer and are much further out on this curve. High end GPUs now contain up to 128 cores and the core count is doubling faster than CPUs.

  3. Core scale out is hard.
  4. Utilizing more cores is not straightforward. Current software does not utilize even 2 cores effectively. If you have a huge spreadsheet calculating on your dual core machine you'll notice that it only uses one core. So half the available power of your PC is just sitting there while you're twiddling your thumbs.

    Database software has a certain amount of parallelism built in already, particularly the big 3 "enterprise" databases. But the parallel strategies they employ where designed for single core chips residing in their own sockets and having their own private supply of RAM. Can they use the cores we have right now? Yes, but the future now looks very different. Hundreds of cores on a single piece of silicon.

    Daniel Abadi's recent post about hadoopDB predicts a "scalability crisis for the parallel database system". His point is that current MPP databases don't scale well past 100 nodes ([dead link]). I'm predicting a similar crisis in scalability for *all database systems* at the CPU level. Strategies for dividing tasks up among 16 or 32 or even 64 processors with their own RAM will grind to a halt when used across 256 (and more) cores on a single chip with a single path to RAM.

  5. Main memory I/O is the new disk I/O.
  6. Disk access has long been our achilles heel in the database industry. The rule of thumb for improving performance is to minimize the amount of disk I/O that you perform. This weakness has become ever more problematic as disk speeds have increased very, very slowly compared to CPU speed. Curt Monash had a great post about this a while ago ([dead link])

    In our new multi-core world we will have a new problem. Every core we add increases the demand for data going into and out of RAM. Intel have doubled the width of this "pipe" in recent chips but practical considerations will constrain increases in this area in a similar manner to the constraints on disk speed seen in the past.

  7. Databases will have to change.
  8. Future databases will have to be heavily rewritten and probably re-architected to take advantage of multi-core processor improvements. Products that seek to fully utilize many cores will have to be just as parsimonious with RAM access as current generation columnar and "in-memory" databases are with disk. Further they will have to become just savvy about parallelizing the actions as current MPP databases but they will have to co-ordinate this parallelism at 2 levels instead of just 1.

    • 1st: Activity and data must be split and recombined across Servers/Instances (as currently)

    • 2nd: Activity and data must be split and recombined across Cores, which will probably have dedicated RAM "pools".

  9. 1st movers will gain all the momentum.
  10. So, finally, this is my basic point. There's a new world coming. It has a lot of cores. It will require new approaches. That world is accessible today through GPUs. Database vendors who move in this direction now will gain market share and momentum. Those who think they can wait on the Intel and "traditional" CPUs to "catch up" may live to regret it.

A few parting thoughts…

I said at the start that I feel FPGAs will be swept away. I should make 2 caveats to that. First, I can well imagine a world where FPGAs come to the fore as a means to co-ordinate very large numbers of small simple cores. But I think we're still quite a long way from that time. Second, Netezza use FPGAs in a very specific way between the disk and CPU/RAM. This seems like a grey area to me, however Vertica are able to achieve very good performance without resorting to such tricks.

Kickfire is a very interesting case as regards GPGPU. They are using a "GPU-like" chip as their workhorse. Justin Swanhart was very insistent that their chip is not a GPU (that is an analogy) and that it is truly a unique chip. For their sake I hope this is marketing spin and the chip is actually 99% standard GPU with small modifications. Otherwise, I can't imagine how a start-up can engage in the core count arms race long term, especially when it sells to the mid-market. Perhaps they have plans to move to a commodity GPU platform.

A very interesting paper was published recently about performing database operations on a GPU. You can find it here ([dead link]). I'd love to know what you think of the ideas presented.

Finally, I want to point out that I'm not a database researcher nor an industry analyst. My opinion is merely that of a casual observer, albeit an observer with a vested interest. I hope you will do me the kindness of pointing out the flaws in my arguments in the comments.

Monday, 6 July 2009

Useful benchmarks vs human nature. A final thought on the TPC-H dust-up.

The was a considerable flap recently on Twitter and in the blogosphere about TPC-H in general. It was all triggered by the new benchmark submitted by ParAccel in the 30TB class. You can relive the gory details on Curt Monash's DBMS2 site here (http://tr.im/rbCe), if you're interested.

I stayed out of the discussion because I'm kind of burned out on benchmarks in general. I got fired up about benchmarks a while ago and even sent an email with some proposals to Curt. He was kind enough to respond and his response can be summed up as "What's in it for the DB vendor?". Great question and, to be honest, not one I could find a good answer for.

For the database buyer; a perfect benchmark tells them which database has the best mix of cost and performance, especially in data warehousing. This is what TPC-H appears to offer (leaving aside the calculation of their metrics). However, a lot of vendors have not submitted a benchmark. It's interesting to note that vendors such as Teradata, Netezza and Vertica are TPC members but have no benchmarks. The question is why not.

For a database vendor; a perfect benchmark is a benchmark that they can win. Curt has referred to Oracle's reputed policy of WAR (win all reviews). This why their licenses specifically prohibit you from publishing benchmarks. There is simply no upside to being 3rd, 5th or anything but first in a benchmark. If Oracle are participating in a given benchmark the simple economic reality is that they know they can win it.

This is the very nature of the TPC-H, it is designed to be very elastic and to allow vendors wiggle room so that they can submit winning figures. I'm sure the TPC folks would disagree on principle but TPC is an industry group made of up of vendors. Anything that denied them this wiggle room will either be vetoed or get even less participation than we currently see.

This is a bitter pill to swallow but seems unlikely to change. These days I'm delivering identical solutions across Teradata, Netezza, Oracle and SQL Server. I have some very well formed thoughts on the relative cost and performance of these databases but of course I can't actually publish any data.

By the way, the benchmark I suggested to Curt was about reducing the hardware variables. Get a hardware vendor to stand up a few common configurations (mid-size SMP using a SAN, 12 server cluster using local storage, etc.) at a few storage levels (1TB, 10TB, 100TB) and then test each database using identical hardware. The metrics would be things like max user data, aggregate performance, concurrent simple queries, concurrent complex queries, etc. Basically trying to isolate the performance elements that are driven by the database software and establish some approximate performance boundaries. With many more metrics being produced there can be a lot more winners. Maybe the TPC should look into it…

Disqus for @joeharris76