Management consultancy releases free ETL tool

Management consultancy McKinsey has created and released its first ever open source software, Kedro – a Python-based data pipeline workflow development tool that McKinsey says it has used on over 50 of its own projects.

Kedro, designed for used by data scientists (source code here) is the brainchild of two QuantumBlack engineers – Nikolaos Tsaousis and Aris Valtazanos, who created it to manage their workstreams at the analytics firm. (McKinsey bought QuantumBlack in 2015).

Kedro lets users structure analytics code in a uniform way and deliver it production-ready, as well as build modular, versioned data pipelines.

The release represents a “big step” for the firm, said Jeremy Palmer, CEO of QuantumBlack, “as we continue to balance the value of proprietary assets with opportunities to engage as part of the developer community.”

Rise of Open Source

McKinsey joins companies as diverse as dedicated software firms, retailers like Walmart or accommodation marketplace Airbnb in releasing open source tools for popular consumption, amid a resurgence in open source software use in the enterprise (albeit one that has come hand-in-hand with ongoing challenges surrounding license type, the cloud and concerns about “asset stripping” of code bases.

For businesses, becoming part of the open source community can help them to attract developers (who would often rather learn a foundational technology than one vendor’s proprietary system) the potential ability to commercialise that project in future by offering a managed service, a way to avoid vendor lock-in by creating and nurturing a free tool (for which “with enough eyeballs, all bugs are shallow”) as well as the warm fuzzy glow of creating a good tool and letting others use it for free.

McKinsey on Kedro: What’s Special?

In a Q&A published alongside an installation guide and other documentation, the project team explained how the tool differs from other workflow schedulers and extract-transform-load (ETL) tools.

“Data pipelines consist of extract-transform-load (ETL) workflows. If we understand that data pipelines must be scaleable, monitored, versioned, testable and modular then this introduces us to a spectrum of tools that can be used to construct such data pipelines. Pipeline abstraction is implemented in workflow schedulers like Luigi and Airflow, as well as in ETL frameworks like Bonobo ETL and Bubbles.”

“We see Airflow and Luigi as complementary frameworks: Airflow and Luigi are tools that handle deployment, scheduling, monitoring and alerting. Kedro is the worker that should execute a series of tasks, and report to the Airflow and Luigi managers. We are building integrations for both tools and intend these integrations to offer a faster prototyping time and reduce the barriers to entry associated with moving pipelines to both workflow schedulers.”