I'm posting this 'thumbnail' for the sake of my own memory as much anything else. :)
TL;DR - It's very _close_ to being an ideal solution but, sadly, has some flaws we can't work around right now.
Good:
- Very straightforward agent based architecture. Very simple to install and consume.
- Reliably syncs/replicates *simple* data from source DBs via flat files and cleans up after itself.
- Performs simple and complex joins on the data source. This is a *major* weakness for competitors as they can't do this.
- All functionality is wizard driven and probably consumable by line of business users.
Bad:
- Cannot create SQL based transforms without uploading PowerCenter mappings. This is too much to ask of LOB users.
- Extracts land as plain text on the agent host before gzipping. This is a *very* slow approach. They should pipe directly to gzip.
- Logging is *extremely* verbose. Every single row is written to the success or error. You'll quickly create many GB of logs.
- Upsert operations update *every* matching row in Redshift. This is handled as a delete plus and insert within Redshift, which means we then need a very slow vacuum to reclaim space.
- Finally, the deal breaker, cannot handle `timestamp with timezone` data types. In 2014. Redshift cannot handle these either but I need the ETL platform to be able to work with them.
Following your blog with interest. What is your current ETL for Redshift?
ReplyDelete