Pipelines

Introduction

This repository holds various data science pipelines.

Getting Started

Set up your development environment:

make install_dev

To run pipelines against test fixtures. The below assumes you are in the repository root.

ln -s $PWD/data/fixtures/input $PWD/data/input

Run a pipeline with: pipes --pipeline sample or python -m pipelines --pipeline sample.

Run pipes --help to see all possible options.

Configuration Management

This repository uses hierarchical configuration management. That means that multiple configuration files will be loaded, where the type and order determines how these files will be consolidated.

The following configuration files are loaded upon pipeline start:

Main root config: config.yaml.
Data connector configuration data_connectors.yaml.
Any further root configs specified as command line argument with --root_config.
Main pipeline config: {pipeline}/config.yaml.
Any further pipeline configs specified as command line argument with --pipeline_config.
Configuration specified as command line argument with --cli_config.

These files are combined in a single configuration object. Previously unseen keys are appended to this object. Conflicting keys are resolved by giving precedence to last loaded configuration file.

One can experiment with this functionality by running:

# Multiplier 5.
pipes --pipeline sample
# Multiplier 7.
pipes --pipeline sample --pipeline_config e2e_test
# Multiplier 10.
pipes --pipeline sample --pipeline_config production
# Multiplier 7 as `e2e_test` is specified last.
pipes --pipeline sample --pipeline_config production e2e_test
# Printed value is cli_pipeline_value.
pipes --pipeline sample --cli_config sample.key=cli_pipeline_value

Inspect the price column in data/output/sample/products.csv to see the result.

Data Connectors

Data connectors allow to read or write from different data sources. Currently, we support the following connectors:

Local file system.
Google Cloud Storage.
Database.

To interact with GCS, ensure you are logged into the gcloud CLI tool. To interact with the database, copy the .env.example file to .env and fill in the values.

cp .env.example .env

To benefit from the data connector abstraction, only read or write data using the functions provided in data/reader.py and data/writer.py. Never import data connectors in a pipeline or step module.

Services

This repository has one optional service, the database. To run the database, run docker-compose up --detach database.

Name		Name	Last commit message	Last commit date
Latest commit History 435 Commits
.github		.github
configs		configs
data		data
pipelines		pipelines
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.eslintrc.yaml		.eslintrc.yaml
.gitignore		.gitignore
.markdownlintrc		.markdownlintrc
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipelines

Introduction

Getting Started

Configuration Management

Data Connectors

Services

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

NielsDegrande/pipelines

Folders and files

Latest commit

History

Repository files navigation

Pipelines

Introduction

Getting Started

Configuration Management

Data Connectors

Services

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages