Management consultancy releases free ETL tool
Management consultancy McKinsey has created and released its first ever open source software, Kedro – a Python-based data pipeline workflow development tool that McKinsey says it has used on over 50 of its own projects.
Kedro, designed for used by data scientists (source code here) is the brainchild of two QuantumBlack engineers – Nikolaos Tsaousis and Aris Valtazanos, who created it to manage their workstreams at the analytics firm. (McKinsey bought QuantumBlack in 2015).
Kedro lets users structure analytics code in a uniform way and deliver it production-ready, as well as build modular, versioned data pipelines.
The release represents a “big step” for the firm, said Jeremy Palmer, CEO of QuantumBlack, “as we continue to balance the value of proprietary assets with opportunities to engage as part of the developer community.”
Rise of Open Source
McKinsey joins companies as diverse as dedicated software firms, retailers like Walmart or accommodation marketplace Airbnb in releasing open source tools for popular consumption, amid a resurgence in open source software use in the enterprise (albeit one that has come hand-in-hand with ongoing challenges surrounding license type, the cloud and concerns about “asset stripping” of code bases.
Read this: Mark Shuttleworth on Taking Canonical Public, Legacy IT and Ubuntu, and his Botanical Garden
For businesses, becoming part of the open source community can help them to attract developers (who would often rather learn a foundational technology than one vendor’s proprietary system) the potential ability to commercialise that project in future by offering a managed service, a way to avoid vendor lock-in by creating and nurturing a free tool (for which “with enough eyeballs, all bugs are shallow”) as well as the warm fuzzy glow of creating a good tool and letting others use it for free.
McKinsey on Kedro: What’s Special?
In a Q&A published alongside an installation guide and other documentation, the project team explained how the tool differs from other workflow schedulers and extract-transform-load (ETL) tools.
“Data pipelines consist of extract-transform-load (ETL) workflows. If we understand that data pipelines must be scaleable, monitored, versioned, testable and modular then this introduces us to a spectrum of tools that can be used to construct such data pipelines.
Pipeline abstraction is implemented in workflow schedulers like Luigi and Airflow, as well as in ETL frameworks like Bonobo ETL and Bubbles.”
“We see Airflow and Luigi as complementary frameworks: Airflow and Luigi are tools that handle deployment, scheduling, monitoring and alerting. Kedro is the worker that should execute a series of tasks, and report to the Airflow and Luigi managers. We are building integrations for both tools and intend these integrations to offer a faster prototyping time and reduce the barriers to entry associated with moving pipelines to both workflow schedulers.”
Kedro vs Other ETL Frameworks
McKinsey said the primary differences to Bonobo ETL and Bubbles are:
“Ability to support big data operations. Kedro supports big data operations by allowing you to use PySpark on your projects. We also look at processing dataframes differently to both tools as we consider entire dataframes and do not make use of the slower line-by-line data stream processing.
Project structure. Kedro provides a built-in project structure from the beginning of your project configured for best-practice project management.
Automatic dependency resolution for pipelines. The
Pipeline module also maps out dependencies between nodes and displays the results of this in a sophisticated but easy to understand directed acyclic graph; and extensibility.
Project manager Yetunde Dada said: “Data scientists are trained in mathematics, statistics and modelling—not necessarily in the software engineering principles required to write production code. Often, converting a pilot project into production code can add weeks to a timeline, a pain point with clients. Now, they can spend less time on the code, and more time focused on applying analytics.”