ETL Power with Pygrametl

14 May 2020 by Andrew Boag

Catalyst has been helping our clients wrestle data from one system to another for many years. Whether this be very formal Enterprise system integration engagements, or just custom data migrations. Thankfully the tool sets available to us in the open source world are extremely flexible. And have yet to let us down.

Catalyst has seen and worked with a number of Extract Transform Load (ETL) solutions. Ranging from mammoth perl scripts to graphical interfaces giving visual representations of transformations and data flow.

For some years now, Catalyst had been developing and supporting a particular data warehouse instance, with data imported using Pentaho Kettle (an open source ETL solution). Kettle allowed us to do some great stuff, and the visual interface was helpful for quickly throwing together new transformations.

As part of a recent review, we decided to assess some of the options that were available to us in order to address some of the shortcomings we were dealing with. It was not quite a “build it again from scratch” scenario, but all were clear that there was some hidden rust with the existing setup.

Initially, we were very keen on staying with a graphical ETL tool, one that was open source of course. They exist, but we didn’t find anything we liked. There are, of course, a lot of graphical ETL products in the proprietary software ETL space, some actually quite good, but our experience is that we are for more agile using open source tools that we can experiment and iterate with.

Research and experimentation lead us to pygrametl – an open source Python ETL framework maintained by a group led by Associate Professor Christian Thomsen from Aalborg University, Denmark. According to the documentation it has been around since 2009 with uptake in healthcare, finance and transport.

It’s easy to install and start getting your feet wet.

There was some level of hesitation in the team about moving towards a scripting language for ETLs as we feared that maintainability would suffer. It was always so handy to be able to “see” transformations. The visualisations are a sort of self-documentation.

But after an initial prototyping sprint, it was very clear that the authors of pygrametl have the right idea.

The things we loved:

  • Lots of good code examples in the documentation.
  • Clever implementation of Dimension and Fact Table inserts and updates. We were able to migrate a lot of our legacy logic and data warehouse table structure with minimal changes.
  • The framework is built on the power of python, meaning any bespoke scripting transformations are easy even for a novice python developer.
  • Easy to run on pretty much any Linux environment.

Paired with metabase, we were able to present real-time operational and strategic reports from a number of data sources. Very powerful. We’ve managed to achieve a massive amount in a few short months. More about metabase in another upcoming blog.

Links