Menu

Open-source data stacks

Home / Portfolio/ Open-source data stacks

Open source data stacks

Data engineering

Growing companies trying to develop their digital and data capabilities are typically confronted by the same choices: when to start investing in commercial solutions. Numerous tools and software solutions cater to companies, offering data hosting, connectors, data visualisation options, often at a steep price even though companies may not be mature enough or have the required skills to fully utilize them.

An option that can easily be overlooked is the implementation of open-source solutions which can come close to offering the same features, albeit without the customer support.

This project is an interesting implementation of such an open-source data stack built within Docker containers in order to facilitate both installation and accessibility

Data orchestrator

One the hallmarks of a good datastack is the possibility to preprogram tasks along with corresponding dependencies. Airflow allows the creation of Directed Asynchronous Graphs of tasks, and have these tasks be conducted on a schedule. The tool can be set up to connect with others, such as Airbyte and dbt.

Data connector

Airbyte is, as time of writing, an open-source connectors which can consolidate data into a number of databases. Alternatives could be Singer for instance.

Data transformation

Built on an SQL layer, dbt allows version controlling and provides transformation testing which can be useful to ensure pre-programmed transformations are valid

Data hosting

Any number of databases are compatible with dbt and Airbyte, by default PostgreSQL works well though MySQL or Clickhouse could be interested applications.

To see code and results, follow the link to my Github page.

Follow