@joeharris76: database

Showing posts with label database. Show all posts

Friday, 8 April 2011

Data Management - Data is Data is Data is…

[Sometimes I want to write about one topic but I end up writing 1,500 words of background before I even touch on the subject at hand. Sometimes the background turns out to be more interesting; hopefully this is one of those times.]

In this post I talk about the problems with mainstream data management, especially SQL databases. I then touch on the advantages of SQL databases and the good attributes we need to retain.

Data is Data is Data is…
Current IT practice splits data management into lots of niches: SQL databases, Email platforms, Network file systems, enterprise search, etc. There is plenty of overlap between niches and, in truth, the separations are artificial. It merely reflects the way systems are implemented, not fundamental data differences. Have a look at your email client; see those headers in the messages list (From, Subject, etc) they're just database field names and the message body is simply a BLOB field. Some email clients, e.g., Gmail, can also parse that blob and find links to previous messages, which is very much like a foreign key link.

File systems seem less like a database at first glance but let's consider the big file system developments of the last 10 years ZFS and BTRFS. Both of these introduce database-like ideas to the file system such as copy-on-write (a la MVCC), deduplication (a la normalisation), data integrity guarantees (a la ACID) and enhanced file metadata (a la SQL DDL).

The basic point I'm making is that data is data. Simple as that. It may be more or less 'structured' but structure and meaning are essentially equivalent. The most 'unstructured' file I can imagine is just plain text but the written word is still very structured. At a high level it has a lot of metadata (name, created, changed, size, etc.), it has structure embedded in the text itself (language, punctuation, words used, etc.) and, looking deeper, we can analyse the semantic content of the text using techniques like NLP.

Data is data; it needs to be stored, changed, versioned, retrieved, backed up, restored, searched, indexed, etc. The methods may vary but it's all just data.

The SQL Database Black Box
All data cannot be kept in databases because, amongst other things, SQL databases are opaque to other applications. Enterprise search illustrates the issue. Most enterprise search apps can look into JDBC/ODBC accessible databases, profile the data and include its content in search results. However, access to any given database is typically highly restricted and there is a DBA whose job hangs on keeping that data safe and secure. The DBA must be convinced that the search system will not compromise the security of his data and this typically means limiting search access to the people who also have database access. This is a time consuming process and we have to repeat it for every database in the company.

So a year later, when we have access to all SQL databases and a process to mirror access credentials, the next problem is that SQL provides no mechanism to trace data history. For example, I search for 'John Doe' and find a result from the CRM database. I look in the database and the record now has a name of 'Jane Doe'. Why did it change? When did it change? Who changed it? There is no baseline answer to these questions. The CRM application may record some of this information but how much? The database has internal mechanisms that trace some of this but each product has its own scheme and, worse, the tables are often not user accessible for security reasons.

In my experience, 80% of the value actually gained from a data warehouse comes from resolving this issue in a single place and in a consistent way. Hence the growth of the MDM industry, but I won't digress on that. The data warehouse doesn't actually solve the problem, it merely limits the number of SQL databases that must be queried to 1. And, of course, we never manage to get everything in the DW.

There are many other black box attributes of SQL databases such as: 2 very similar queries may perform in drastically different ways; background tasks can make the database extremely slow without warning; the database disk format cannot be accessed by other applications; the database may bypass the filesystem making us entirely reliant on the database to detect disk errors, etc., etc.

The SQL Database Choke Point
Current SQL databases are also a very real constraint on day-to-day operation. For example, a large company may only be able to process bulk updates against a few percent of the customer base each night. SQL databases must be highly tuned towards high performance for single type of access query and that tuning usually makes other access styles unworkable.

Further the schema of a production SQL database is effectively set in stone. Although SQL provides ALTER statements the performance and risk of using ALTER is so bad that it's never used. Instead we either add a new small table and use a join when we need the additional data, or we create a new table and export the existing data into it. Both of these operations impose significant overheads when all we really want is a new field. So, in practice, production SQL databases satisfy a single type of access, are very resistant to other access patterns and are very difficult to change.

These issues are well recognised and the answer has come back that we need specialist SQL databases for each use case. Michael Stonebraker, in particular, has been beating a drum about this for at least 5 years (and, credit where it's due, Vertica paid off in spades). However, we haven't seen a huge uptake in specialist databases for markets other than analytics. In particular the mainstream OLTP market has very few specialist offerings. Perhaps it's a more difficult problem or perhaps the structure of SQL itself is less amenable to secondary innovation around OLTP. I sense a growing recognition that improvements in the OLTP space require significant re-engineering of existing applications.

Specialist databases have succeeded to some extent in the data warehouse and business intelligence sphere. I think this exception proves the observation. 15 years ago I would add another complaint to my black box attributes: it was impossible to get reports and analysis from my production systems. The data warehouse was invented and gained popular acceptance simply because this was such a serious problem. The great thing about selling analytic databases for the last 15 years was that you weren't displacing a production system. Businesses don't immediately start losing money if the DW goes down. The same cannot be said of most other uses for SQL databases and that's why they will only be replaced slowly and only when there is a compelling reason (mainframes are still around, right?).

There's a baby in this bathwater!
It's worth remembering the SQL databases offer a lot advantages. Codd outlined 12 rules that relational databases should follow. I won't list them all here but at a high level a relational database maintains the absolute integrity of the data it stores and allows us to place constraints on that data, such as the type and length of the data or it's relation to other data. We take it for granted now but this was a real breakthrough and it took years to implement in practice.

Just for kicks imagine a CRM system based on Word docs. When you want to update a customer's information you open their file and make whatever changes you want and then save it. The system only checks that the doc exists, you can change whatever you want and the system won't care. If you want the system to make sure you only change the right things you'll have to build that function yourself. That's more or less what data management was like before SQL databases.

What to keep & what to throw away
So what would our ideal data management platform look like? It persists data in a format that can be freely parsed by other applications, i.e., plain text (XML? JSON? Protocol Buffers? ). It maintains data integrity at an atomic level probably by storing checksums alongside each item. It lets define stored data as strictly or loosely as we want but it enforces the definitions we set. All changes to our stored data actually create new versions and the system keeps a linked history of changes.

I think we're starting to see systems emerge that address some of the issues above. It's still early days but I'm excited about projects like Ceph and the very new Acunu.

In my next post I'll look about how the new breed of NoSQL databases display some of the traits we need for our ideal data management platform.

Wednesday, 6 April 2011

Unsolicited advice for MarkLogic - Pivot!

[This is actually a really long comment on Curt Monash's post but I think it's worth cross posting here.]

Seeing as I have been doing a lot of thinking about database opportunities lately I'll wade in on MarkLogic's as well. I can't really comment about the specific verticals that MarkLogic sells into or should sell into. However, I see 2 fundamental problems with MarkLogic's product positioning.

First problem; they backed the wrong horse by focusing exclusively on XML and XQuery. This has been toned down a lot but the die is cast. People who know about MarkLogic (not many) know of it as a 'really expensive XML database that you need if you have lots of XML and eXist-db is too slow for you'. They've put themselves into a niche within a niche, kind of like a talkative version of Ab Initio.

This problem is obvious if you compare them to the 'document oriented' NoSQLs such as CouchDB and MongoDB. Admittedly they were created long after MarkLogic but the NoSQLs offer far greater flexibility, talk about XML only as a problem to be dealt with and use a storage format that the market finds more appealing (JSON).

Second problem; 'Enterprise class' pricing is past its sell by date. What does MarkLogic actually cost? You won't find any pricing on the website. I presume that the answer is that old standby 'whatever you're looking to spend'. Again, the contrast with the new NoSQLs couldn't be more stark - they're all either pure open source or open core, e.g., free to start.

MarkLogic was essentially an accumulator bet: 1st bet - XML will flood the enterprise, 2nd bet - organisations will want to persist XML as XML, 3rd bet - an early, high quality XML product will move into an Oracle-like position.

The first bet was a win, XML certainly has flooded the enterprise. The second bet was a loss; XML has become almost a wire protocol rather than a persistence format. Rightly or not, very few organisations choose to persist significant volumes of data in XML. And the third bet was loss as well; the huge growth of open source and the open core model make it extremely unlikely that we'll see another Oracle in the data persistence market.

The new MarkLogic CEO needs to acknowledge that the founding premise of the company has failed and they must pivot the product to find a much larger addressable market. Their underlying technology is probably very good and could certainly be put to use in other ways (Curt gives some examples). I would be tempted to split the company in 2; leaving a small company to continue selling and supporting MarkLogic at maximum margins (making them an acquisition target) and a new company to build a new product in start-up mode on the foundations of the existing tech.

Friday, 25 March 2011

Analytic Database Market Segmentation

In my first post in this series I gave an overview of ParStream and their product.

In the second post I gave an overview of the Analytic Database Market from my perspective.

In this post I will briefly talk about how vendors are positioned and introduce a simple market segmentation.

Please note that this is not exhaustive, I've left off numerous vendors with perfectly respectable offerings that I didn't feel I could reasonably place.

The chart above gives you my view of the current Analytic Database market and how the various vendors are positioned. The X axis is log scale going from small data sizes (<100GB) to very large data sizes (~500TB). I have removed the scale because it based purely on my own impressions of common customer data sizes for that vendor based on published case studies and anecdotal information.

The Y axis is a log scale of the number of employees that a vendor's customers have. Employee size is highly variable for a given vendor but nonetheless each vendor seems to find a natural home in businesses of a certain size. Finally the size of each vendor's bubble represents the approximate $ cost per TB for their products (paid versions in the case of 'open core' vendors). Pricing information is notoriously difficult to come across so again this is very subjective but I have first hand experience with a number of these so it's not a stab in the dark.

Market Segments

SMP Defenders: Established vendors with large bases operating on SMP platforms

Teradata+Aster: The top of the tree. Big companies, big data, big money.

MPP Upstarts: Appliance – maybe, Columnar – maybe, Parallelism – always.

Open Source Upstarts: Columnar databases, smaller businesses, free to start.

Hadoop+Hive: The standard bearer for MapReduce. Big Data, small staff.

SMP > MPP Inflection

Still with me? Good, let's look at the notable segments of the market. First, there is a clear inflection point between the big single server (SMP) databases and the multi-server parallelised (MPP) databases. This point moves forward a little every year but not enough to keep up with the rising tide of data. For many years Teradata owned the MPP approach and charged a handsome rent. In the previous decade a bevy of new competitors jumped into the space with lower pricing and now the SMP old guard getting into MPP offerings, e.g., Oracle Exadata and Microsoft SQL Server PDW.

Teradata's Diminished Monopoly

Teradata have not their lost grip on the high end however. They maintain a near monopoly on data warehouse implementations in the very largest companies with the largest volumes of 'traditional' DW data (customers & transactions). Even Netezza has failed to make large dent into Teradata's customers. Perhaps there are instances of Teradata being displaced by Netezza; however I have never actually heard of one. There are 2 vendors who have a publicised history of being 'co-deployed' with Teradata: Greenplum and Aster Data. Greenplum's performance reputation is mixed and it was acquired last year by EMC. Aster's performance reputation is solid at very large scales and their SQL/MapReduce offering has earned them a lot of attention. It's no surprise that Teradata decided to acquire them.

The DBA Inflection Point

The other inflection point in this market happens when the database becomes complex enough to need full time babysitting, e.g., a Database Administrator. This gets a lot less attention than SMP>MPP because it's very difficult to prove. Nevertheless word gets around fairly quickly about the effort required to keep a given product humming along. It's no surprise that vendors of notoriously fiddly products sell them primarily to large enterprises where the cost of employing such an expensive specimen as a collection of DBAs is not an issue.

Small, Simple & Open(ish)

Smaller businesses, if they really need an analytic DB, choose products that have a reputation for being usable by technically inclined end users without a DBA. Recent columnar database vendors fall into this end of the spectrum, especially those that target the MySQL installed base. It's not that a DBA is completely unnecessary, simply that you can take a project a long way without one.

MapReduce: Reduced to Hadoop

Finally we have those customers in smaller businesses (or perhaps government or universities) who need to analyse truly vast quantities of data with the minimum amount of resource. In the past it was literally impossible for them to do this; they were forced to rely on gross simplifications. Now though we have the MapReduce concept of processing data in fairly simple steps, in parallel across a numerous cheap machines. In many ways this is MPP minus the database, sacrificing the convenience of SQL and ACID reliability for pure scale. Hadoop has become the face of MapReduce and is effectively the SQL of MapReduce, creating a common API that alternative approaches can offer to minimise adoption barriers. 'Hadoop compatible' is the Big Data buzz phrase for 2011.

In my next post I will look at the gaps where a differentiated offering could be introduced. I will also look at where Hadapt have pitched themselves and how ParStream and RainStor can take advantage of these market gaps.

Friday, 28 January 2011

HandlerSocket - More grist for the ORM mill

A plugin called HandlerSocket was released last year that allows InnoDB to be used to directly, bypassing the MySQL parsing and optimising steps. The genius of HandlerSocket is that the data is still "in" MySQL so you can use the entire MySQL toolchain (monitoring, replication, etc.). You also have your data stored in a highly reliable database, as opposed to some of the horror stories I'm seeing about newer NoSQL products.

In the original blog post ( here ) it talks about 720,000 qps on an 8 core Xeon with 32GB RAM. Granted this is all in memory data we're talking about but that is a hell of a figure. He also claims it outperforms Memcached.

Next, Percona added HandlerSocket to their InnoDB fork back in December ( here ) so if you're looking for someone to talk to they may be the best people.

Finally, Ilya Grigorik (way-smart guy from PostRank) blogged about it a couple of weeks ago ( here ) and there's a fairly interesting discussion in the comments comparing this to prepared statements in Oracle.

All of this reinforces my opinion that new generation ORMs are the technology that will finally allow the RDBMS apple cart to tip all the way over. Products like Redis, Riak, CouchDB, etc. are not enough on their own.

The *really* interesting thing about HandlerSocket is that shows open source databases are perfect fodder for the next wave.

Wednesday, 19 January 2011

Analytic Database Market 'Fly Over'

This is a follow up to my previous post where I laid out my initial thoughts about ParStream. This is a very high level 'fly over' view of the analytic database market. I'll follow this up with some thoughts about how ParStream can position themselves in this market.

Powerhouse Vendors

The power players in the Analytic Database market are: Oracle (particularly Exadata), IBM (mostly Netezza, also DB2), and Teradata. Each of these vendors employs a large, very well funded and sophisticated sales force. A new vendor competing against them in accounts will find it very, very hard to win deals. They can easily put more people to work on a bid than a company like ParStream *employs*. If you are tendering for business in a Global 5000 corporation then you should expect to encounter them and you need a strategy for countering their access to the executive boards of these companies (which you will not get). In terms of technology their offerings have become very similar in recent years with all 3 emphasising MPP appliances of one kind or another, however most of the installed base are still using their traditional SMP offerings (Netezza and Teradata excepted).

New MPP niche players

There are a number of recent entrants to the market who also offer MPP technology, particularly: Greenplum, AsterData and ParAccel. All 3 offer software-only MPP databases, although Greenplum's emphasis has shifted slightly since being acquired by EMC. These vendors seem to focus mostly on (or succeed with) customers who have very large data volumes but are small companies in terms of employees. Many of these customers are in the web space. These vendors also have strong stories about supporting MapReduce/Hadoop inside their databases, which also plays to the leanings of web customers. According to testimonials on the vendor's websites customers seem to choose them because they are very fast and software only.

Microsoft

Microsoft is a unique case. They do not employ a direct sales force (as far as I know) however they have steadily become major force in enterprise software. Almost all companies run Windows desktops, have at least a few Windows servers and at least a few instances of SQL Server in production. Therefore Microsoft will be considered in virtually every selection process you're involved in. Microsoft have been steadily adding BI-DW features to the SQL Server product line and generally those features are all "free" with a SQL Server license. This doesn't necessarily make SQL Server cheaper but it does make it feel like very good value. Recent improvements include the Parallel Data Warehouse appliance (with HP hardware), columnar indexing for the next release and PowerPivot for local analysis of large data volumes.

Proprietary columnar

Columnar databases have been the hot technology in analytic databases for the last few years. The biggest vendors are Sybase with their very mature IQ product, SAND with an equally mature product and Vertica with their newer (and reportedly much faster) product. These databases can be used in single server (SMP / scale-up) and MPP (multi-server / scale-out) configurations. They appear to be most popular with customers who appreciate the high levels of compression that these databases offer and already have relatively mature star-schema / Kimball style data warehouses in place. In my experience Sybase and SAND are used most in companies where they were introduced by an OEM as part of another product. Vertica is so new that it's not clear who their 'natural' customers are yet.

Open Source columnar

In the open source world there are 2 MySQL storage engines and a standalone product offering columnar databases. The MySQL engine Infobright was the first open source columnar database. It features very high compression and very fast loading however it is not suited for lots of joins and may be better thought of as a OLAP tool managed via SQL. The InfiniDB MySQL engine on the other hand is very good at joins and very good at squeezing all the available performance out of a server, however it does not have any compression currently. Finally there is LucidDB which is a Java based standalone product and has performance characteristics somewhere between the other two. LucidDB features excellent compression, index support and generally good performance but can be slow to load.

Vectorised columnar
There is only one player here: VectorWise. VectorWise is a columnar database (AFAIK) that has been architected from top to bottom to take advantage of the vector pipelines built into all recent CPUs. Vectorisation is a way of running many highly parallel operations through a single CPU. It basically removes all of the waiting and memory shifting that slows a CPU down. Initial testers have been very positive about the performance of VectorWise and had nothing but good things to say. There is also talk of an open source release so they are covering a lot of bases. They also have the advantage of being part of Ingres who may not be the force they once were but have a significant installed base and are well placed to sell VectorWise. They are the biggest direct competitor to ParStream that I can see right now.

Open Source MapReduce/NoSQL
ParStream will also compete with a new breed of open source MapReduce/NoSQL products, most notably Hadoop (and it's variants). These products are not databases per se but they have gained a lot of mindshare among developers who need to work with large data volumes. Part of their attraction is their 'cloud friendliness'. They are perfect for the cloud because they have been designed to run on many small servers and to expect that a single server could fail at any time. There is a trade-off to be made and MapReduce products tend to be much more complex to query, however for a technically savvy audience the trade is well worth it.

Next time I'll talk about where I think ParStream need to place themselves to maximise their opportunity.

UPDATE: Actually, in the next post I talk about how analytic database vendors are positioned and introduce a simple market segmentation. A further post about market opportunities will follow.

My take on why businesses have problems with ETL tools

Check out this very nice piece by Rick about the reasons why companies have failed to get the most out of their ETL tools.

My take is from the other side of the fence. As a business user I'm often frustated by ETL tools and have been known to campaign against them for the following reasons:

> ETL tools have been too focussed on Extract-Transform-Load and too little focused on actual data integration. I have complex integration challenges that are not necessarily a good fit for the ETL strategy and sometimes I feel like I'm pushing a square peg into a round hole.

> It's still very challenging to generate reusable logic inside ETL tools and this really should be the easiest thing in the world (ever heard the mantra Don't Repeat Yourself!). Often the hoops that have to be jumped through are more trouble than they are worth.

> Some ETL tools are a hodge podge of technologies and approaches with different data types and different syntaxes wherever you look. (SSIS I'm looking at you! This still is not being addressed in Denali.)

> ETL tools are too focused on their own execution engines and fail miserably to take advantage of the processing power of columnar and MPP databases by running processes on the database. This is understandable in open source tools (database specific SQL may be a bridge too far) but in commercial tools it's pathetic.

> Finally, where is the ETL equivalent of SQL? Why are we stuck with incompatible formats for each tool. The design graphs in each tool look very similar and the data they capture is near identical. Even the open source projects have failed to utilise a common format. Very poor show. This is the single biggest obstacle to more widespread ETL. Right now it's much easier for other parts of the stack to stick with SQL and pretend that ETL doesn't exist.

Thursday, 30 December 2010

2010 Review: a BI-DW Top 5

This post is written completely 'off the cuff' without any fact checking or referring back to sources. Just sayin'…

Top 5 from 2010

5) Big BI consolidation is finished
  There were no significant acquisitions of "Big BI" vendors in 2010. Since Cognos went to IBM and BO went to SAP, the last remaining member of the old guard is MicroStrategy. (It's interesting to consider why they have not been acquired but that's for another post.) In many ways the very definition of Big BI has shifted to encompass smaller players. Analysts, in particular, need things to talk about and they have effectively elevated a few companies to Big BI status that were previously somewhat ignored, e.g., SAS (as a BI provider), InformationBuilders, Pentaho, Acuate, etc. All of the major conglomerates now have a 'serious' BI element in their offerings and so I don't see further big spending on BI acquisitions in 2011. The only dark horse in this race seems to be HP and it's very unclear what their intentions are, particularly with the rumours of Neoview being cancelled; if HP were to move I see them going for either a few niche players or someone like InformationBuilders with solid software but lacking in name recognition.

4) Analytic database consolidation began
  We've seen an explosion of specialist Analytic databases over the last ~5 years and 2010 saw the start of a consolidation phase amongst these players. The first big acquisition of 2010 was Sybase by SAP; everyone assumed Sybase's IQ product (the original columnar database) was the target but the talk since then has been largely about the Sybase mobile offerings. I suspect both products are of interest to SAP; IQ allows them to move some of their ageing product lines forward and Mobile will be an enabler for taking both SAP and Business Objects to smartphones going forward.
  The banner acquisition was Netezza by IBM. I've long been very critical/sceptical of IBM's claims in the Data Warehouse / Analytic space. Particularly as I've worked with a number of DW's that were taken off DB2 (onto Teradata) but never come across one actively running on DB2. I'm a big Netezza fan so my hope is that they survive the integration and are able to leverage the resources of IBM going forward.
  We also saw Teradata acquiring the dry husk of Kickfire's ill-fated MySQL 'DW appliance'. Kickfire's fundamental technology appeared to be quite good but sadly their market strategy was quite bad. I think this a good sign from Teradata that they are open to external ideas and they see where the market is going. The competition with Netezza seems to have revitalised them and given them a new enemy to focus on. A new version of Teradata database that incorporated some columnar features (and an 'free' performance boost) could be just the ticket to get their very conservative customers migrated onto the latest version.

3) BI vendors started thinking about mobile
  Mobile BI became a 'front of mind' issue in 2010. MicroStrategy has marketed aggressively in this space but other vendors are in the hunt and have more or less complete mobile offerings. Business Objects also made some big noise about mobile but everything seemed to be demos and prototypes. Cognos has had a 'mobile' offering for some time but they remained strangely quiet, my impression is that their mobile offerings are not designed for the iOS/Android touchscreen world.
  Niche vendors have been somewhat quiet on the mobile front, possibly waiting to see how it plays out before investing, with the notable exception of Qlikview who have embraced it with both arms. This is a great strategic move for Qlikview (who IMHO prove the koan that 'strategy trumps product') because newer mobile platforms are being embraced by their mid-market customers far faster than at Global 5000 companies that the Big BI vendors focus on. Other niche and mid-market vendors should take note of this move and get something (anything!) ready as quickly as possible.

2) Hadoop became the one true MapReduce
  I remain somewhat non-plussed by MapReduce personally, however a lot of attention has been lavished on it over the last 2 years and during the course of 2010 the industry has settled on Hadoop as the MapReduce of choice. From Daniel Adabadi's HadoopDB project to Pentaho's extensive Hadoop integration to Aster's "seamless connectivity" with Hadoop to Paraccel's announcement of the same thing coming soon and on and on. The basic story of MapReduce was very sexy but in practice the details turned out to be "a bit more complicated" (as Ben Goldacre [read his book!] would say). It's not clear that Hadoop is the best possible MR implementation but it looks likely to become the SQL of MapReduce. Expect other MapReduce implementations to start talking about Hadoop compatibility ad nauseum.
  All of this casts Cloudera in an interesting light. They are after all "the Hadoop company" according to themselves. It's far too early for a 'good' acquisition in this space however money talks and I wonder if we might see something happen in 2011.

1) The Cloud got real and we all got sick of hearing about it
  I'm not sure whether 2010 was truly the "year of the Cloud" but it certainly was the peak of it's hype cycle. In 2010 the reality of cloud pricing hit home; the short version is that a lot of the fundamental cost of cloud computing is operational and we shouldn't expect to see continuous price/performance gains like we have seen in the hardware world. Savvy observers have noted that the bulk of enterprise IT spending has been non-hardware for a long time but the existence of cloud offerings brings those costs into focus.
  Ultimately, my hope for the Cloud is that it will drive companies toward buying results, e.g., SaaS services that require little-to-no customisation, and away from buying potential, e.g. faster hardware and COTS software that is rarely fit for purpose. The cycle should go something like: "This Cloud stuff seems expensive, how much does it cost us to do the same thing?" > "OMG are you frickin' serious, we really spend that?!" > "Is there anyone out there that can provide the exact same thing for a monthly fee?". Honestly, big companies are incredibly bad at hardware and even worse at software. The Cloud (as provided by Amazon, et al) is IMHO just a half step towards then endpoint which is the use of SaaS offerings for everything.

Thursday, 9 December 2010

Comment regarding Infobright's performance problems

UPDATE: This is a classic case of the comments being better than the post; make sure you read them! In summary, Jeff explained better and a lightbulb went off for me: Infobright is for OLAP in the classical sense with the huge advantage of being managed with a SQL interface. Cool.

I made a comment over on Tom Barber's blog post about a Columnar DB benchmarking exercise: http://pentahomusings.blogspot.com/2010/12/my-very-dodgy-col-store-database.html

Jeff Kibler said...
Tom –

Thanks for diving in! As indicated in your results, I believe your tests cater well to databases designed for star-schemas and full table-scan queries. Because a few of the benchmarked databases are engineered specifically for table scans, I would anticipate their lower query execution time. However, in analytics, companies overwhelmingly use aggregates, especially in ad-hoc fashion. Plus, they often go much higher than 90 gigs.

That said, Infobright caters to the full fledged analytic. As needed by the standard ad-hoc analytic query, Infobright uses software intelligence to drastically reduce the required query I/O. With denormalization and a larger data set, Infobright will show its dominance.

Cheers,

Jeff
Infobright Community Manager
8 December 2010 17:04

Joe Harris said...
Tom,

Awesome work, this is the first benchmark I've seen for VectorWise and it does look very good. Although, I'm actually surprised how close InfiniDB and LucidDB are, based on all the VW hype.

NFS on Dell Equilogic though? I always cringe when I see a database living on a SAN. So much potential for trouble (and really, really slow I/O).

Jeff,

I have to say that your comment is off base. I'm glad that Infobright has a community manager who's speaking for them but this comment is *not* helping.

First, your statement that "in analytics, companies overwhelmingly use aggregates" is plain wrong. We use aggregates as a fallback when absolutely necessary. Aggregates are a maintenance nightmare and introduce a huge "average of an average" issue that is difficult to work around. I'm sure I remember reading some Infobright PR about removing the need for aggregate tables.

Second, you guys have a very real performance problem with certain types of queries that should be straightforward. Just looking at it prima facie it seems that Infobright starts to struggle as soon as we introduce multiple joins and string or range predicates. The irony of the poor Infobright performance is that your compression is so good that the data could *almost* fit in RAM.

What I'd like to see from Infobright is: 1) a recognition of the issue as being real. 2) An explanation of why Infobright is not as fast in these circumstances. 3) An explanation of how to rewrite the queries to get better performance (if possible). 4) A statement about how Infobright is going to address the issues and when.

I like Infobright; I like MySQL; I'm an open source fan; I want you to succeed. The Star Schema Benchmark is not going away, Infobright needs to have a better response to it.

Joe