Thursday 9 December 2010

Comment regarding Infobright's performance problems

UPDATE: This is a classic case of the comments being better than the post; make sure you read them! In summary, Jeff explained better and a lightbulb went off for me: Infobright is for OLAP in the classical sense with the huge advantage of being managed with a SQL interface. Cool.

I made a comment over on Tom Barber's blog post about a Columnar DB benchmarking exercise: http://pentahomusings.blogspot.com/2010/12/my-very-dodgy-col-store-database.html


Jeff Kibler said...
Tom –

Thanks for diving in! As indicated in your results, I believe your tests cater well to databases designed for star-schemas and full table-scan queries. Because a few of the benchmarked databases are engineered specifically for table scans, I would anticipate their lower query execution time. However, in analytics, companies overwhelmingly use aggregates, especially in ad-hoc fashion. Plus, they often go much higher than 90 gigs.

That said, Infobright caters to the full fledged analytic. As needed by the standard ad-hoc analytic query, Infobright uses software intelligence to drastically reduce the required query I/O. With denormalization and a larger data set, Infobright will show its dominance.

Cheers,

Jeff
Infobright Community Manager
8 December 2010 17:04


Joe Harris said...
Tom,

Awesome work, this is the first benchmark I've seen for VectorWise and it does look very good. Although, I'm actually surprised how close InfiniDB and LucidDB are, based on all the VW hype.

NFS on Dell Equilogic though? I always cringe when I see a database living on a SAN. So much potential for trouble (and really, really slow I/O).


Jeff,

I have to say that your comment is off base. I'm glad that Infobright has a community manager who's speaking for them but this comment is *not* helping.

First, your statement that "in analytics, companies overwhelmingly use aggregates" is plain wrong. We use aggregates as a fallback when absolutely necessary. Aggregates are a maintenance nightmare and introduce a huge "average of an average" issue that is difficult to work around. I'm sure I remember reading some Infobright PR about removing the need for aggregate tables.

Second, you guys have a very real performance problem with certain types of queries that should be straightforward. Just looking at it prima facie it seems that Infobright starts to struggle as soon as we introduce multiple joins and string or range predicates. The irony of the poor Infobright performance is that your compression is so good that the data could *almost* fit in RAM.

What I'd like to see from Infobright is: 1) a recognition of the issue as being real. 2) An explanation of why Infobright is not as fast in these circumstances. 3) An explanation of how to rewrite the queries to get better performance (if possible). 4) A statement about how Infobright is going to address the issues and when.

I like Infobright; I like MySQL; I'm an open source fan; I want you to succeed. The Star Schema Benchmark is not going away, Infobright needs to have a better response to it.

Joe

6 comments:

  1. Hey Joe,

    Firstly, let me take back the comment you threw back at me -- completely agree, and I misspoke. My intention was to say that companies use their (raw) data to create aggregates (through BI tools, etc.) Upon re-reading my post, I admit my mistake on the reply.

    Secondly, I always agree that one should call a spade a spade. Infobright is truly designed to perform with "denormalized" data. Joins and star-schemas just aren't in our wheelhouse. Reason: when you join tables, you automatically remove most of the "high performance" improvements introduced by the knowledge grid. Instead of using joins, one should use nested selects; better yet, one should consider combining all their data into one table. With all that compression, you really have nothing to lose and everything to gain. In addition, Infobright's knowledge grid is there to reduce the amount of I/O required. If you're doing full table scans, you don't reduce your I/O. In that case, Infobright is not your solution if you focus on table scans.

    I'm always open to recognize/tackle a performance issue; we do receive forum posts and file bugs to help with those issues. Many times, though, the issue is with how the person approaches the problem with a selected solution. That's why I'm here -- helping community members get Infobright rockin' on their system.

    I'm a firm believer in open, candid discussion. If there are certain instances that fail your sniff test, send them my way; I'd love to tackle them with you. If you're looking for ways to improve query performance, take a look at our community forum (http://www.infobright.org/Forums). There are tips and tricks and other forum posts on this topic. In fact, if you paste your query/ddl in a forum post, most often we provide suggestions to improve execution times.

    Finally, most issues regarding performance are directly tied to education. While ICE 3.5 and other upcoming versions will have improved performance, we still will provide the same insight/recommendations regarding queries/ddl as we do today. I'd love to find awesome/innovative ways to better inform our users on how to improve performance. If you have suggestions, I'm all ears.

    Cheers, and I look forward to your response.

    Jeff Kibler
    Infobright Community Manager
    Skype: jeff.at.infobright
    E-mail: jeff@infobright.com

    ReplyDelete
  2. Thanks for the reply, Jeff.

    Your comment that "joins and star-schemas just aren't in our wheelhouse" really threw me for a loop. I've honestly never considered a completely denormalised (i.e. flat) schema in a SQL database. I guess I'm a lot more indoctrinated in 3NF and normalisation than I realised. This really changes my perspective on Infobright and what problems it's useful for.

    Thinking about a flat schema takes me back to my Excel days when I used to create Pivot Tables over very wide and sparse spreadsheets (and curse Microsoft for the Excel '95 row limit!). I can see how this is useful in quite a few cases, especially for data mart style deployments where people want to slice and dice against a fixed (or intermittently updated) data set.

    With this new perspective I'm thinking of Infobright as a very flexible OLAP platform that can be managed with standard SQL tools. Throughout years of creating Analysis Services cubes in the tortuous SSAS GUI, I've wished I could manage cubes using SQL. I'm now thinking of Infobright as an alternative to products like PowerPivot, SSAS and Essbase, rather than a competitor to "general purpose" Analytic Databases such as VectorWise, Sybase IQ, etc. In this context Infobright's approach, features and limitations make a lot more sense.

    It also explains the SSB results; I don't expect to join across cubes the way that I join tables in a database but I do expect to ask separate questions and align the summarised results; which is presumably what Infobright's join mechanism is doing (and why some of the SSB queries are slow). In other words, if someone created cubes out of the SSB benchmark tables (cube per table), and then showed me a benchmark that was very slow on joining them together, I would say they're an idiot rather than the cube platform was crap.

    All of this raises questions about the Infobright marketing to date though. I've invested quite a bit of time in Infobright, I've read many articles about it, downloaded numerous versions and evaluated it for a project I'm (still) working on. I never got the message about no joins and OLAP style use. Maybe that's because I'm entirely stupid (it's possible…) however it's not just me who's making this mistake. If people really understood this they wouldn't even include Infobright in SSB benchmark roundups (right?!). Instead they'd be comparing it to SSAS and Palo and talking about how simple it is to load manage an Infobright "cube" using SQL or PDI.

    So I guess you guys need to tighten up the message about what *exactly* Infobright is for. I understand that OLAP isn't a fashionable acronym these days but you need to get the message across that Infobright is *purely* for analysis / slice-and-dice of denormalised data; it's not for EDWs, star schemas, ELT processing, etc.

    Future possibilities: I'm not a big fan of MDX but it is the lingua franca of OLAP and I think you need a better way to ensure that Infobright uses it properly. Perhaps you could create an Infobright-specific version of Mondrian because AFAIK Mondrian expects to do a lot of joining and use summary tables. I'm guessing it would mostly be a case of taking stuff out of Mondrian until all that's left is MDX<=>SQL translation.

    Thanks again for the response. I'm pretty excited about using Infobright again; I think I'll get a lot more out of it just from having a different set of expectations.


    Joe

    ReplyDelete
  3. Interesting reading indeed. Just to expand on the mondrian thing a bit, the SSB from my time implementing cubes has always been reasonably close to the case though. As an integrator we were always told that having 1 large fact table was never good for performance, due to the fact that mondrian would have to do a select distinct across the whole fact table dimension columns to ascertain what the dimension values were as opposed to just retrieving the values from a dimension table.

    Very interested to learn more though.

    Tom

    ReplyDelete
  4. Actually, an email just went through my inbox which jogged my memory. To further my point above....

    The Mondrian developers are implementing the SSB benchmark and queries into the test suite for mondrian... so if its that bad a test, why are they doing it? :)

    ReplyDelete
  5. Tom,

    You're right, we create a set of Dimensions and Facts, each being a separate table and then load the cube from there. As far as I understand it, traditional OLAP cubes (i.e. non-Mondrian) once loaded are more like a collection of flat files at varying levels of aggregation that the engine picks from to provide the results.

    It's kind of hard to imagine but you could push the dimensions into the fact table. After all they can't logically be at a lower grain than the Fact. Maybe do it in Kettle on the fly as rows get pushed into the bulk loader. You'd could look at that process as being the "cube load".

    If I've understood Jeff correctly (always questionable…) then Infobright should be able to do distincts and aggregates *very* cheaply. (Does that tally with your testing?) I understand completely why Mondrian would add SSB as a test. Mondrian is set up to work with lots of separate tables. I've haven't tried Mondrian with Infobright but I'm guessing it has the same problems it showed in the SSB.

    That said, I'm guessing that Mondrian could be optimised to work with Infobright. At the very least they could take the MDX to SQL parsing and offer that as a standalone product.


    Joe

    ReplyDelete
  6. Hi Joe,

    Sorry for the delay in my response! I appreciate the opportunity to help clear up the waters.

    Infobright is geared towards analyzing extremely voluminous machine-generated data. The flat, amorphous nature of MGD fits well with the optimal denormalized structure. Plus, since MGD rarely (if ever) gets updated, then one-off update statements will hardly (if ever) be used

    As you mentioned, Infobright does analytics/aggregates very easily. You really see dramatic performance improvements when you eliminate large data sets in the query execution plan (ex: sum(revenue) .. where cola = blah and colb < bleh.) We shine in these occurrences because we eliminate a lot of I/O.

    Note, though, that we are continuing to improve on our joins. In fact, our upcoming release will see some improvement in this area.

    Cheers,

    Jeff

    ReplyDelete

Disqus for @joeharris76