@joeharris76: 2009

Thursday 24 December 2009

Comment on Cringley's "DVD is Dead" post

Apple's "plan" to be "front and center" of the living room seems a lot more like an outside bet than a central strategy to me. When you see as many TV spots for the AppleTV as the iPhone then you'll know the strategy has changed. The living room tech cycle is super-slow compared to Apple's "normal" and thus difficult to integrate.

Bob's kinda missed the point here though, this actually signals the *failure* of Blu-Ray. It's just going to take over from DVD in a smooth flattish decline, no one is out there re-buying their library in Blu-Ray.

There's a huge pent up demand for an "iTunes" experience for video content. I.e. put DVD in the machine, machine makes digital copy, moves copy to my device(s), I make future purchases as downloads and everything lives in a single library.

My guess is Apple hasn't been able to swing that with the Hollywood studios yet. It's still not clear whether you're legally allowed to make a backup copy of a DVD you bought. In the meantime they're just keeping the AppleTV on life support until something gives.

Sunday 25 October 2009

Unsolicited advice for Kickfire

Following up on the Kickfire BBBT tweetstream on Friday (23-Oct), I want to lay out my thoughts about Kickfire's positioning. I should point out that I have little experience with MySQL, no experience with Kickfire and I'm not a marketer ( but I play one on TV… ;) ).

Kickfire should consider doing the following:

1. Emphasise the benefits of the FPGA

We now know that Kickfire's "SQL chip" is in fact an FPGA. Great! They need to bring this out in the open and even emphasise it. This is actually a strength, FPGA's have seen major advances recently and a good argument can be made that they are not "propietary hardware" but a commodity component advancing at Moore's Law speed (or better).

They should also obtain publishing rights to recent research about the speed advantages of executing SQL logic on an FPGA. Good research foundations and advances in FPGAs make Kickfire seem much more viable longterm.

2. Pull back on the hyperbole.

Dump the P&G style 'Boswelox' overstatement. A lot of the key phrases in their copy seem tired. How many time have we heard about "revolutionary" advances? My suggestion is to use more concrete statements. Example: "Crunch a 100 million web log records in under a minute". Focus on common tasks and provide concrete examples of improved performance.

Also, reign in the buzzwords: availability, scalability, sustainability, etc. If this is really for smaller shops and data marts then plain english is paramount. "Data mart" type customers will have to ram this down the throat of IT. They need to want it more than an iPhone or they'll just give up and go with the default.

3. Come up with a MapReduce story.

MapReduce is the new darling of the web industry. Google invented the term, Yahoo has released the main open source project and everyone just thinks it's yummy. Is it a mainstream practice? Probably not, but the bastion of MySQL is not mainstream either.

Kickfire's "natural" customers (e.g. web companies) may not have any experience with data warehousing. When they hit scaling issues with MySQL they may not go looking for a better MySQL. Even if they do they'll probably find and try Infobright in the first instance.

Kickfire needs a story about MapReduce and they need to insert themselves into the MapReduce dialogue. They need to start talking about things like "The power of MapReduce in a 4U server" or "Accelerating Hadoop with Kickfire".

4. Offer Kickfire as a service.

Kickfire needs to be available as a service. This may be a complete pain in the ass to do and it may seem like a distraction. I bet Kickfire policy is to offer free POCs. But IMHO their prices are too low to make this scalable.

Customers need to be able to try the product out for a small project or even some weekend analysis. When they get a taste of the amazing performance then they'll be fired up to get Kickfire onsite and willing to jump through the hoops in IT.

If this is absolutely out of the question, the bargain basement approach would be to put up a publicly accessible system (registration required) filled with everything from data.gov. Stick Pentaho/Jasper on top (nice PR for the partner…) and let people play around.

5. Deliver code compatibility with Oracle and SQL Server.

There are probably compelling reasons the choice of MySQL. However, many potential customers have never used it. They've never come across it in a previous role. It's not used anywhere in their company. Frankly, it makes them nervous.

Kickfire needs to maximise their code compatibility with Oracle and SQL Server and then they need to talk about it everywhere.

That is all. Comments?

Wednesday 29 July 2009

Why GPUs matter for DW/BI

I tweeted a few days ago that I wasn't particularly excited about either the Groovy Corp or XtremeData announcements because I think any gains they achieve by using FPGAs will be swept away by GPGPU and related developments. I got a few replies either asking what GPGPU is and a few dismissing it as irrelevant (vis a vis Intel x64 progress). So I want to explain my thoughts on GPGPU, and how it may affect the Database / Business Intelligence / Analytics industry (industries?).

GPGPU stands for "general-purpose computing on graphics processing units". ([dead link]) GPGPU is also referred to as "stream processing" or "stream computing" in some contexts. The idea is that you can offload the processing normally done by the CPU to the computer's graphics card(s).

But why would you want to? Well, GPUs are on a roll. Their performance is increasing exponentially faster than the increase in CPU performance. I don't want to overload this post with background info but suffice to say that GPUs are *incredibly* powerful now and getting more powerful much faster than CPUs. If you doubt that this is the case have a look at this article on the Top500 supercomputing site, point 4 specifically. ([dead link])

This is not a novel insight on my part. I've been reading about this trend since at least 2004. There was a memorable post on Coding Horror in 2006 ([dead link]). Nvidia released their C compatibility layer "CUDA" in 2006 ([dead link]) and ATI (now AMD) released their alternative "Stream SDK" in 2007 ([dead link]). More recently the OpenCL project has been established to allow programmers to tap the power of *any* GPU (Nvidia, AMD, etc) from within high level languages. This is being driven by Apple and their next OSX update will delegate many tasks to the GPU using OpenCL.

That's the what.

Some people feel that GPGPU will fail to take hold because Intel will eventually catch up. This is a reasonable point of view and in fact Intel has a project called Larrabee ([dead link]). They are attempting to make a hybrid chip that effectively emulates a GPU within the main processor. It's worth noting that this is very similar to the approach IBM have taken with the Cell chip used in the Playstation3 and many new supercomputers. Intel will be introducing a new set of extensions (like SSE2) that will have to be used to tap into the full functionality. The prototypes that have been demo'ed are significantly slower than current pure GPUs. The point is that Intel are aware of GPGPU and are embracing it. The issue for Intel is that the exponential growth of GPU power looks like it's going to put them on the wrong side of a technology growth curve for once.

Why are GPUs important to databases and analytics?

The multi-core future is here now.

I'm sure you've heard the expression "the future is already here it's just unevenly distributed". Well that applies double to GPGPU. We can all see that multi-core chips are where computing is going. The clock speed race ended in 2004. Current high end CPUs now have 4 cores and 8 cores will arrive next year and on it goes. GPUs have been pushing this trend for longer and are much further out on this curve. High end GPUs now contain up to 128 cores and the core count is doubling faster than CPUs.

Core scale out is hard.

Utilizing more cores is not straightforward. Current software does not utilize even 2 cores effectively. If you have a huge spreadsheet calculating on your dual core machine you'll notice that it only uses one core. So half the available power of your PC is just sitting there while you're twiddling your thumbs.

Database software has a certain amount of parallelism built in already, particularly the big 3 "enterprise" databases. But the parallel strategies they employ where designed for single core chips residing in their own sockets and having their own private supply of RAM. Can they use the cores we have right now? Yes, but the future now looks very different. Hundreds of cores on a single piece of silicon.

Daniel Abadi's recent post about hadoopDB predicts a "scalability crisis for the parallel database system". His point is that current MPP databases don't scale well past 100 nodes ([dead link]). I'm predicting a similar crisis in scalability for *all database systems* at the CPU level. Strategies for dividing tasks up among 16 or 32 or even 64 processors with their own RAM will grind to a halt when used across 256 (and more) cores on a single chip with a single path to RAM.

Main memory I/O is the new disk I/O.

Disk access has long been our achilles heel in the database industry. The rule of thumb for improving performance is to minimize the amount of disk I/O that you perform. This weakness has become ever more problematic as disk speeds have increased very, very slowly compared to CPU speed. Curt Monash had a great post about this a while ago ([dead link])

In our new multi-core world we will have a new problem. Every core we add increases the demand for data going into and out of RAM. Intel have doubled the width of this "pipe" in recent chips but practical considerations will constrain increases in this area in a similar manner to the constraints on disk speed seen in the past.

Databases will have to change.

Future databases will have to be heavily rewritten and probably re-architected to take advantage of multi-core processor improvements. Products that seek to fully utilize many cores will have to be just as parsimonious with RAM access as current generation columnar and "in-memory" databases are with disk. Further they will have to become just savvy about parallelizing the actions as current MPP databases but they will have to co-ordinate this parallelism at 2 levels instead of just 1.

1st: Activity and data must be split and recombined across Servers/Instances (as currently)

2nd: Activity and data must be split and recombined across Cores, which will probably have dedicated RAM "pools".

1st movers will gain all the momentum.

So, finally, this is my basic point. There's a new world coming. It has a lot of cores. It will require new approaches. That world is accessible today through GPUs. Database vendors who move in this direction now will gain market share and momentum. Those who think they can wait on the Intel and "traditional" CPUs to "catch up" may live to regret it.

A few parting thoughts…

I said at the start that I feel FPGAs will be swept away. I should make 2 caveats to that. First, I can well imagine a world where FPGAs come to the fore as a means to co-ordinate very large numbers of small simple cores. But I think we're still quite a long way from that time. Second, Netezza use FPGAs in a very specific way between the disk and CPU/RAM. This seems like a grey area to me, however Vertica are able to achieve very good performance without resorting to such tricks.

Kickfire is a very interesting case as regards GPGPU. They are using a "GPU-like" chip as their workhorse. Justin Swanhart was very insistent that their chip is not a GPU (that is an analogy) and that it is truly a unique chip. For their sake I hope this is marketing spin and the chip is actually 99% standard GPU with small modifications. Otherwise, I can't imagine how a start-up can engage in the core count arms race long term, especially when it sells to the mid-market. Perhaps they have plans to move to a commodity GPU platform.

A very interesting paper was published recently about performing database operations on a GPU. You can find it here ([dead link]). I'd love to know what you think of the ideas presented.

Finally, I want to point out that I'm not a database researcher nor an industry analyst. My opinion is merely that of a casual observer, albeit an observer with a vested interest. I hope you will do me the kindness of pointing out the flaws in my arguments in the comments.

Monday 6 July 2009

Useful benchmarks vs human nature. A final thought on the TPC-H dust-up.

The was a considerable flap recently on Twitter and in the blogosphere about TPC-H in general. It was all triggered by the new benchmark submitted by ParAccel in the 30TB class. You can relive the gory details on Curt Monash's DBMS2 site here (http://tr.im/rbCe), if you're interested.

I stayed out of the discussion because I'm kind of burned out on benchmarks in general. I got fired up about benchmarks a while ago and even sent an email with some proposals to Curt. He was kind enough to respond and his response can be summed up as "What's in it for the DB vendor?". Great question and, to be honest, not one I could find a good answer for.

For the database buyer; a perfect benchmark tells them which database has the best mix of cost and performance, especially in data warehousing. This is what TPC-H appears to offer (leaving aside the calculation of their metrics). However, a lot of vendors have not submitted a benchmark. It's interesting to note that vendors such as Teradata, Netezza and Vertica are TPC members but have no benchmarks. The question is why not.

For a database vendor; a perfect benchmark is a benchmark that they can win. Curt has referred to Oracle's reputed policy of WAR (win all reviews). This why their licenses specifically prohibit you from publishing benchmarks. There is simply no upside to being 3rd, 5th or anything but first in a benchmark. If Oracle are participating in a given benchmark the simple economic reality is that they know they can win it.

This is the very nature of the TPC-H, it is designed to be very elastic and to allow vendors wiggle room so that they can submit winning figures. I'm sure the TPC folks would disagree on principle but TPC is an industry group made of up of vendors. Anything that denied them this wiggle room will either be vetoed or get even less participation than we currently see.

This is a bitter pill to swallow but seems unlikely to change. These days I'm delivering identical solutions across Teradata, Netezza, Oracle and SQL Server. I have some very well formed thoughts on the relative cost and performance of these databases but of course I can't actually publish any data.

By the way, the benchmark I suggested to Curt was about reducing the hardware variables. Get a hardware vendor to stand up a few common configurations (mid-size SMP using a SAN, 12 server cluster using local storage, etc.) at a few storage levels (1TB, 10TB, 100TB) and then test each database using identical hardware. The metrics would be things like max user data, aggregate performance, concurrent simple queries, concurrent complex queries, etc. Basically trying to isolate the performance elements that are driven by the database software and establish some approximate performance boundaries. With many more metrics being produced there can be a lot more winners. Maybe the TPC should look into it…

Sunday 5 July 2009

The future of BI? It has nothing to do with business…

I've been reading and re-reading Stuart Sutherland's excellent book Irrationality for several weeks (review to come - promise). One of the things he talks about is "making the wrong connections". His point is that humans can't mentally evaluate evidence and make connections. We focus on the elements that are unusual or different and we massively over value our initial guesses.

That really resonates with me. After all that's what Business Intelligence is about, right? We provide factual, numeric, and clean data in a format that allows the user to make reasonable, rational decisions. We lambast the BI nay-sayers who operate on "gut instinct" and rightly so. But we leave that hyper-rational approach at the office door and conduct the rest of our lives in our normal irrational way.

In truth we conduct 95% of our working lives that way as well. The minute-to-minute stuff that business is *really* made of is unrecorded, unanalysed and (of course) irrational. All those conversations, relationships, emails, phone calls and meaningful looks are dealt with by instinct.

Outside the office we're seeing an explosion in personal monitoring and self surveillance. Devices like the iPhone can track every interaction, accessories like Nike+ allow us to track every step we take, software like RescueTime continuously monitors our computer usage. Even Facebook is a way to monitor your relationships, something that seemed completely intangible a few years ago. Etc, etc, etc.

This is the future of BI: Rational Augmentation. Using tracking data to make faster, better and more rational decisions about everything in our lives. It's about dealing with huge volumes of hyper-personal data and finding the patterns that matter. It lives outside the office and outside the corporation. It's a dash of text-mining, a pinch of regression, a dollop of aggregation and spoonful advanced analytics and a heap of basic statistics.

Many people will feel uncomfortable about this but the young will adopt it without question and those who adopt it will do better. Let's face it, it's a sub-optimal world out there and an edge in rationality could be a very big edge indeed.

As a final thought, this has the makings of a classic innovators dilemma for the current BI players. Rational Augmentation (I'm loving this phrase but call it what you like…) is going to need to deal with large data volumes very cheaply and very locally. It will probably be service based. It will probably be free for at least some users. But ultimately it will be a huge market, dwarfing the current BI market. The current players may have the skills to take this on but they've been swallowed by the corporate quicksand and they will sit and watch it pass them by. C'est la vie.

Wednesday 20 May 2009

How to Fix the Newspaper Industry - everybody else is doing it…

NOTE: Don't expect me to be doing multiple posts per day. I don't know what's come over me!

Everyone seems to agree that newspapers are dead. Even here in the UK they're not doing great, although our papers seem to 'get' the web a lot more. One of the things that I hear quite a bit from the pundits is that they should make the physical paper free as well as the online version.

I was just reading a post on Tim Ferriss' blog about Alan Webber and his "RULE #24 - If you want to change the game, change the economics of how the game is played." In it he mentions the free paper theory.

This triggered a thought for me that giving the paper away is nowhere near a bold enough strategy. The problem with the paper is not that it costs too much (except on Sundays - £2! who are they kidding?). For a lot of people, especially the core newspaper market, the cost is not an issue. The issue is having to go get the damn thing, cart it around all day and then filter through the ads just to find a few interesting tidbits.

So here is my "fix": force people to take the paper. Stick it through *everyone's* mailbox every single day. Become *the* alternative delivery provider. I haven't bought a paper in ages but I can *guarantee* that if it came through my door I would look at it.

In the UK (and most of Europe) we have fairly strong opt-out regulation against so-called junk mail. However there is a huge loophole called the "door drop". Marketers are still allowed to put whatever they want through all of the doors in a given area. This allows a lot of room for targeting. Millionaires all live in the same neighborhood right? There is a big business around this. When I was involved (~2yrs ago) it cost about £0.05 per door. Now I get 3 or 4 drops a week, about 20 pieces in total. Hmm… that's sound like £1 of revenue per house minus delivery costs. Seems workable.

Now you wouldn't want to push your paper on literally everyone. You would target the exact slice of the population that already reads you. Plus your economics are now much more predictable. You know exactly how many papers to print and you can streamline your distribution arm. In fact you'd want to buy or partner with someone like DHL or TNT who are already doing alternative deliveries. You also need to get you deliveries done *very* early to catch the commuters.

This is a winner takes all play. There is only room for a handful of players in a market like this. Once they have your paper in their hands why would they buy a competing paper? If you get it right it should pay back in spades.

I don't really see anyone brave enough to make the switch right now. But they'll get more adventurous (desperate) as time goes on and profits dwindle.

Perhaps TNT should think about buying a newspaper group to beef up the delivery pipeline…

How To Fix Twitter - it came to me in the shower…

UPDATE: One Sentence Summary - It's possible to know in advance who will need to receive messages and therefore to structure the Twitter application and tweet data in a way that makes it much faster to deliver them.

So I got to thinking about Twitter and the ongoing problems they have keeping the service up and running smoothly. This line of thought was triggered by Twitter removing the ability to see all @ replies. This follows a long history of removing features to "streamline" the service (Google it if you care).

It's worth remembering that Twitter started out as a 'plain vanilla' Ruby On Rails app. Which is great, 'cuz RoR is great. But it means that Twitter was conceived as a database backed single instance app. There are tons of article out there about the architecture you need to scale such an app. Some of them where written by Twitter people who have since been ejected. (Again, Google it if you care).

The other thing to remember is that Twitter are only keeping a few weeks of tweets online (6-8 at last reporting). This may be a practical measure but it's also insane! There is huge value in all those old Tweets. I suspect they are doing this to limit the size of their databases. Which is a clue that they are still using a database (or probably a number of sharded databases) as the back-end.

Here's the thing though: Twitter is not a database app. It's a messaging platform. This is not an insight but it is important. We (the IT industry) know how to run messaging platforms at scale. We know how to run huge email services. We know how to run huge IM platforms. We know how to run huge IRC instances.

Of course Twitter is not exactly like any of those things. It's an asynchronous, asymmetric, instant micro-message stream. It's asynchronous because messages are simply pushed out (like email). It's asymmetric because there is no way to guarantee or confirm receipt (like IRC). But it's the instant streaming aspect that is key. That's what makes the experience unique.

My "fix" is based on the following observation: Twitter usage forms naturally into cliques. My wife tried out Twitter and found it boring. She didn't find a tribe that she connected with. I, on the other hand, love it because I can talk trash about Bikes, Business Intelligence and Data Warehousing all day long. What could be better?

Here's the architecture:

Load all of the data into a huge data warehouse (MPP of course!).
Cluster users into their natural cliques using data mining algorithms.
The cliques I follow might be:
- BI-DW (~2,000)
- UK Mountain Biking (~1,000)
- Web 2.0 (~5,000)
- Twitter Celebs (~1,000)
- Of course cliques wouldn't really have names…

The backend database only contains users info, not tweets.

Following, Followers, Clique memberships, Bloom filter of following, etc.

Tweets are stored in "clique streams": all tweets for a clique in reverse order.

New tweets are added to the top/front of the stream.

Tweets can exist in multiple streams as required.

Streams have a maximum message age.

To provide an update the system only has to filter a small number of streams.

This has got to be a 1000x reduction. (60m users to 60k possibles)

The system stores a bloom filter of people a user follows as the first filter for streams.

Probably another 10x reduction, removes bulk of non-following clique messages.

The detailed filter should now be running over a very small dataset.

Final step is to combine the filtered streams and remove any duplicates.

It should go without saying that all tweets are added to the data warehouse in real time. ;-)
This also answers the question of how Twitter can make money: sell access to the data in that killer data warehouse.

{I have refrained from naming any specific technologies or products in the post because that's not really what it's about. Very restrained of me, don't you think?

I also haven't talked about DMs, mentions, etc. because I think that they can easily fit in this architecture and this post doesn't need to be any longer.}

UPDATE 2: This approach also makes it a lot easier to spot spam accounts. Someone may *actually* want to follow 4,000 people but they will only be in a few cliques. A spam account would be following too many different cliques.

Wednesday 29 April 2009

Let the macro-blogging begin...

I'm setting this blog up as place to put thoughts that don't fit into Twitter's 140 character limit.

I've made a couple abortive blog starts in the past so… no promises! I'll also be putting up some essays that I've written in the past, probably reworked to save embarrassment.