Monday, 8 December 2014

Micro-review of InformaticaCloud for Amazon Redshift

I recently finished an evaluation of Informatica Cloud's platform as a candidate to take over our Redshift ETL workload.

I'm posting this 'thumbnail' for the sake of my own memory as much anything else. :)

TL;DR - It's very _close_ to being an ideal solution but, sadly, has some flaws we can't work around right now.

Good:

  • Very straightforward agent based architecture. Very simple to install and consume.
  • Reliably syncs/replicates *simple* data from source DBs via flat files and cleans up after itself.
  • Performs simple and complex joins on the data source. This is a *major* weakness for competitors as they can't do this.
  • All functionality is wizard driven and probably consumable by line of business users.


Bad:

  • Cannot create SQL based transforms without uploading PowerCenter mappings. This is too much to ask of LOB users.
  • Extracts land as plain text on the agent host before gzipping. This is a *very* slow approach. They should pipe directly to gzip.
  • Logging is *extremely* verbose. Every single row is written to the success or error. You'll quickly create many GB of logs.
  • Upsert operations update *every* matching row in Redshift. This is handled as a delete plus and insert within Redshift, which means we then need a very slow vacuum to reclaim space.
  • Finally, the deal breaker, cannot handle `timestamp with timezone` data types. In 2014. Redshift cannot handle these either but I need the ETL platform to be able to work with them.

1 comment:

  1. Following your blog with interest. What is your current ETL for Redshift?

    ReplyDelete

Disqus for @joeharris76