If you want it, kerblam it!

Kerblam! is a Rust command line tool to manage the execution of scientific data analysis, where having reproducible results and sharing the executed pipelines is important. It makes it easy to write multiple analysis pipelines and select what data is analysed.

With Kerblam! your analyses will be less bloated, more organized, and more reproducible.

Kerblam! is a Free and Open Source Software, hosted on Github at MrHedmad/kerblam. The code is licensed under the MIT License.

Use the sidebar to jump to a specific section. If you have never used Kerblam! before, you can read the documentation from start to finish to learn all there is to know about Kerblam! by clicking on the arrows on the side of the page.

Kerblam! is very opinionated. To read more about why these choices where made, you can read the Kerblam! philosophy.

About

This page aggregates a series of meta information about Kerblam!.

License

The project is licensed with the MIT License. Read here for the choose a license entry of the license.

Citing

If you want or need to cite Kerblam!, provide a link to the Github repository or use the following Zenodo DOI: doi.org/10.5281/zenodo.10664806.

Naming

This project is named after the fictitious online shop/delivery company in S11E07 of Doctor Who. Kerblam! might be referred to as Kerblam!, Kerblam or Kerb!am, interchangeably, although Kerblam! is preferred. The Kerblam! logo is written in the Kwark Font by tup wanders.

About this book

This book is rendered by mdbook, and is written as a series of markdown files. Its source code is available in the Kerblam! repo under the ./docs/ folder.

The book hosted online always refers to the latest Kerblam! release. If you are looking for older or newer versions of this book, you should read the markdown files directly on Github, where you can select which tag to view from the top bar, or clone the repository locally, checkout to the commit you like, and rebuild from source. If you're interested, read the development guide to learn more.

Installation

You have a few options when installing Kerblam!.

Requirements

Currently, Kerblam! only supports mac OS (both intel and apple chips) and GNU linux. Other unix/linux versions may work, but are untested. It also uses binaries that it assumes are already installed and visible from your $PATH:

If you can use git, make, tar, bash and docker or podman from your CLI, you're good to go!

Most if not all of these tools come pre-packaged in most linux distros. Check your repositories for them.

You can find and download a Kerblam! binary for your operating system in the releases tab.

There are also helpful scripts that automatically download the correct version for your specific operating system thanks to cargo-dist. You can always install or update to the latest version with:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh | sh

Be warned that the above command executes a script downloaded from the internet. You can click here or manually follow the fetched URL above to download the same installer script and inspect it before you run it, if you'd like.

Install from source

If you want to install the latest version from source, install Rust and cargo, then run:

cargo install kerblam

If you wish to instead use the latest development version, run:

cargo install --git https://github.com/MrHedmad/kerblam.git

The main branch should always compile on supported platforms with the above command. If it does not, please open an issue.

Adding the Kerblam! badge

You can add a Kerblam! badge in the README of your project to show that you use Kerblam! Just copy the following code and add it to the README:

![Kerblam!](https://img.shields.io/badge/Kerblam!-v0.5.1-blue?logo=&link=https%3A%2F%2Fgithub.com%2FMrHedmad%2Fkerblam)

The above link is very long - this is because the Kerblam! logo is baked in as a base64 image. You can update the badge's version by directly editing the link (e.g. change v0.5.1 to v0.4.0) manually.

Quickstart

Welcome to Kerblam! This introductory chapter will give you the general overview on Kerblam!: what it does and how it does it.

Kerblam! is a project manager. It helps you write clean, concise data analysis pipelines, and takes care of chores for you.

Every Kerblam! project has a kerblam.toml file in its root. When Kerblam! looks for files, it does it relative to the position of the kerblam.toml file and in specific, pre-determined folders. This helps you keep everything in its place, so that others that are unfamiliar with your project can understand it if they ever need to look at it.

These folders, relative to where the kerblam.toml file is, are:

  • ./data/: Where all the project's data is saved. Intermediate data files are specifically saved here.
  • ./data/in/: Input data files are saved and should be looked for in here.
  • ./data/out/: Output data files are saved and should be looked for in here.
  • ./src/: Code you want to be executed should be saved here.
  • ./src/pipes/: Makefiles and bash build scripts should be saved here. They have to be written as if they were saved in ./.
  • ./src/dockerfiles/: Container build scripts should be saved here.

Any sub-folder of one of these specific folders (with the exception of src/pipes and src/dockerfiles) contains the same type of files as the parent directory. For instance, data/in/fastq is treated as if it contains input data by Kerblam! just as the data/in directory is.

You can configure almost all of these paths in the kerblam.toml file, if you so desire. This is mostly done for compatibility reasons with non-kerblam! projects. New projects that wish to use Kerblam! are strongly encouraged to follow the standard folder structure, however.

The rest of these docs are written as if you are using the standard folder structure. If you are not, don't worry! All Kerblam! commands respect your choices in the kerblam.toml file.

If you want to convert an existing project to use Kerblam!, you can take a look at the kerblam.toml section of the documentation to learn how to configure these paths.

If you follow this standard (or you write proper configuration), you can use Kerblam! to do a bunch of things:

  • You can run pipelines written in make or arbitrary shell files in src/pipes/ as if you ran them from the root directory of your project by simply using kerblam run <pipe>;
  • You can wrap your pipelines in docker containers by just writing new dockerfiles in src/dockerfiles, with essentially just the installation of the dependencies, letting Kerblam! take care of the rest;
  • If you have wrapped up pipelines, you can export them for later execution (or to send them to a reviewer) with kerblam package <pipe> without needing to edit your dockerfiles;
  • If you have a package from someone else, you can run it with kerblam replay.
  • You can fetch remote data from the internet with kerblam data fetch, see how much disk space your project's data is using with kerblam data and safely cleanup all the files that are not needed to re-run your project with kerblam data clean.
  • You can show others your work by packing up the data with kerblam data pack and share the .tar.gz file around.
  • And more!

The rest of this tutorial walks you through every feature.

I hope you enjoy Kerblam! and that it makes your projects easier to understand, run and reproduce!

If you like Kerblam!, please consider leaving a star on Github. Thank you for supporting Kerblam!

Creating new projects - kerblam new

You can quickly create new kerblam! projects by using kerblam new.

Go in a directory where you want to store the new project and run kerblam new test-project. Kerblam! asks you some setup questions:

  • If you want to use Python;
  • If you want to use R;
  • If you want to use pre-commit;
  • If you have a Github account, and would like to setup the origin of your repository to github.com.

Say 'yes' to all of these questions to follow along. Kerblam! will then:

  • Create the project directory,
  • Make a new git repository,
  • create the kerblam.toml file,
  • create all the default project directories,
  • make an empty .pre-commit-config file for you,
  • create a venv environment, as well as the requirements.txt and requirements-dev.txt files (if you opted to use Python),
  • and setup the .gitignore file with appropriate ignores.

Kerblam! will NOT do an Initial commit for you! You still need to do that manually once you've finished setting up.

You can now start working in your new project, simply cd test-project.

Akin to git, Kerblam! will look in parent directories for a kerblam.toml file and run there if you call it from a project sub-folder. Efficient!

Pipelines

Kerblam! is first and foremost a pipeline runner.

Say that you have a script in ./src/calc_sum.py. It takes an input .csv file, processes it, and outputs a new .csv file, using stdin and stdout.

You have an input.csv file that you'd like to process with calc_sum.py. You could write a shell script or a makefile with the command to run. We'll refer to these scripts as "pipes".

Here's an example makefile pipe:

./data/out/output.csv: ./data/in/input.csv ./src/calc_sum.py
    cat $< | ./src/calc_sum.py > $@

You'd generally place this file in the root of the repository and run make to execute it.

This is perfectly fine for projects with a relatively simple structure and just one execution pipeline.

Imagine however that you have to change your pipeline to run two different jobs which share a lot of code and input data but have slightly (or dramatically) different execution. You might modify your pipe to accept if statements, use environment variables or perhaps write many of them and run them separately. In any case, having a single file that has the job of running all the different pipelines is hard, adds complexity and makes managing the different execution scripts harder than it needs to be.

Kerblam! manages your pipes for you. You can write different makefiles and/or shell files for different types of runs of your project and save them in ./src/pipes/. When you kerblam run, Kerblam! looks into that folder, finds (by name) the makefiles that you've written, and brings them to the top level of the project (e.g. ./) for execution. In this way, you can write your pipelines as if they were in the root of the repository, cutting down on a lot of boilerplate paths.

For instance, you could have written a ./src/pipes/process_csv.makefile for the previous step, and you could invoke it with kerblam run process_csv. You could then write more makefiles or shell files for other tasks and run them similarly, keeping them all neatly separated from the rest of the code.

The next sections outline the specifics of how Kerblam! executes pipes.

Executing code - kerblam run

The kerblam run command is used to run pipelines.

Kerblam! looks for files ending in the .makefile extension for makefiles and .sh for shell files in the pipelines directory (by default src/pipes/). It automatically uses the proper execution strategy based on what extension the file is saved as.

Shellfiles are always executed in bash. You can use anything that is installed on your system this way, e.g. snakemake or nextflow.

Make has a special execution policy to allow it to work with as little boilerplate as possible. You can read more on Make in the GNU Make book.

kerblam run supports the following flags:

  • --profile <profile>: Execute this pipeline with a profile. Read more about profiles in the section below.
  • --desc (-d): Show the description of the pipeline, then exit.
  • --local (-l): Skip running in a container, if a container is available, preferring a local run.

In short, kerblam run does something similar to this:

  • Move your pipe.sh or pipe.makefile file in the root of the project, under the name executor;
  • Launch make -f executor or bash executor for you.

This is why pipelines are written as if they are executed in the root of the project, because they are.

Data Profiles - Running the same pipelines on different data

You can run your same pipelines, as-is, on different data thanks to data profiles.

By default, Kerblam! will use your untouched ./data/in/ folder when executing pipes. If you want the same pipes to run on different sets of input data, Kerblam! can temporarily swap out your real data with this 'substitute' data during execution.

For example, a process_csv.makefile requires an input ./data/in/input.csv file. However, you might want to run the same pipe on another, different_input.csv file. You could copy and paste the first pipe and change the paths to the first file to this alternative one. However, you then have to maintain two essentially identical pipelines, and you are prone to adding errors while you modify it (what if you forget to change one reference to the original file?). You can use kerblam to do the same, but in an easy, declarative and less-error-prone way.

Define in your kerblam.toml file a new section under data.profiles:

# You can use any ASCII name in place of 'alternate'.
[data.profiles.alternate]
# The quotes are important!
"input.csv" = "different_input.csv"

You can then run the same makefile with the new data with:

kerblam run process_csv --profile alternate

Paths under every profile section are relative to the input data directory, by default data/in.

Under the hood, Kerblam! will:

  • Rename input.csv to input.csv.original;
  • Move different_input.csv to input.csv;
  • Run the analysis as normal;
  • When the run ends (it finishes, it crashes or you kill it), Kerblam! will undo both actions: it moves different_input.csv back to its original place and renames input.csv.original back to input.csv.

This effectively causes the makefile to run with different input data.

Careful that the output data will (most likely) be saved as the same file names as a "normal" run!

Kerblam! does not look into where the output files are saved or what they are saved as. If you really want to, use the KERBLAM_PROFILE environment variable described below and change the output paths accordingly.

Profiles are most commonly useful to run the pipelines on test data that is faster to process or that produces pre-defined outputs. For example, you could define something similar to:

[data.profiles.test]
"input.csv" = "test_input.csv"
"configs/config_file.yaml" = "configs/test_config_file.yaml"

And execute your test run with kerblam run pipe --profile test.

The profiles feature is used so commonly for test data that Kerblam! will automatically make a test profile for you, swapping all input files in the ./data/in folder that start with test_xxx with their "regular" counterparts xxx. For example, the profile above is redundant!

If you write a [data.profiles.test] profile yourself, Kerblam! will not modify it in any way, effectively disabling the automatic test profile feature.

Kerblam! tries its best to cleanup after itself (e.g. undo profiles, delete temporary files, etc...) when you use kerblam run, even if the pipe fails, and even if you kill your pipe with CTRL-C.

If your pipeline is unresponsive to a CTRL-C, pressing it twice (two SIGTERM signals in a row) will kill Kerblam! instead, leaving the child process to be cleaned up by the OS and the (eventual) profile not cleaned up.

This is to allow you to stop whatever Kerblam! or the pipe is doing in case of emergency.

Kerblam! will run the pipelines with the environment variable KERBLAM_PROFILE set to whatever the name of the profile is. In this way, you can detect from inside the pipeline if you are in a profile or not. This is useful if you want to keep the outputs of different profiles separate, for instance.

Containerized Execution of Pipelines

Kerblam! can ergonomically run pipelines inside containers for you, making it easier to be reproducible.

If Kerblam! finds a container recipe (such as a Dockerfile) of the same name as one of your pipes in the ./src/dockerfiles/ folder (e.g. ./src/dockerfiles/process_csv.dockerfile for the ./src/pipes/process_csv.makefile pipe), it will use it automatically when you execute a pipeline (e.g. kerblam run process_csv) to run the pipeline inside a container.

Specifically, it will do something similar to this:

  • Copy the pipeline to the root of the directory (as it does normally when you launch kerblam run), as ./executor;
  • Run docker build -f ./src/dockerfiles/process_csv.dockerfile --tag process_csv_kerblam_runtime . to build the container;
  • Run docker run --rm -it -v ./data:/data --entrypoint make process_csv_kerblam_runtime -f /executor.

This last command runs the container, telling it to execute make with target file -f /executor. Note that it's not exactly what kerblam does - it has additional features to correctly mount your paths, capture stdin and stdout, etc...

If you have your docker container COPY . ., you can then effectively have Kerblam! run your projects in docker environments, so you can tweak your dependencies and tooling (which might be different than your dev environment) and execute in a protected, reproducible environment.

Kerblam! will build the container images without moving the recipies around (this is what the -f flag does). The .dockerfile in the build context (next to the kerblam.toml) is shared by all pipes. See the 'using a dockerignore' section of the Docker documentation for more.

You can write dockerfiles for both make and sh pipes. Kerblam! configures automatically the correct entrypoint and arguments to run the pipe in the container.

Read the "writing dockerfiles for Kerblam!" section to learn more about how to write dockerfiles that work nicely with Kerblam! (spoiler: it's easier than writing canonical dockerfiles!).

For example, you can have the following Dockerfile:

# ./src/dockerfiles/process_csv.dockerfile

FROM ubuntu:latest

RUN apt-get install python, python-pip && \
    pip install pandas

COPY . .

and this dockerignore file:

# ./src/dockerfiles/.dockerignore
.git
data
venv

and simply run kerblam run process_csv to build the container and run your code inside it.

If you run kerblam run without a pipeline (or with a non-existant pipeline), you will get the list of available pipelines. You can see at a glance what pipelines have an associated dockerfile as they are prepended with a little whale (πŸ‹):

Error: No runtime specified. Available runtimes:
    πŸ‹β—Ύ my_pipeline :: Generate the output data in a docker container
    β—Ύβ—Ύ local_pipeline :: Run some code locally

Default dockerfile

Kerblam! will look for a default.dockerfile if it cannot find a container recipe for the specific pipe (e.g. pipe.dockerfile), and use that instead. You can use this to write a generalistic dockerfile that works for your most simple pipelines. The whale (πŸ‹) emoji in the list of pipes will be replaced by a fish (🐟) for pipes that use the default container, so you can identify them easily:

Error: No runtime specified. Available runtimes:
    πŸ‹β—Ύ my_pipeline :: Generate the output data in a docker container
    πŸŸβ—Ύ another :: Run in the default container

Switching backends

Kerblam! runs containers by default with Docker, but you can tell it to use Podman instead by setting the execution > backend option in your kerblam.toml:

[execution]
backend = "podman" # by default "docker"

Podman is slightly harder to set up but has a few benefits, mainly not having to run in root mode, and being a FOSS program. For 90% of usecases, you can use podman instead of docker and it will work exactly the same. Podman and Docker images are interchangeable, so you can use Podman with dockerhub with no issues.

Setting the container working directory

Kerblam! does not parse your dockerfile or add any magic to the calls that it makes based on heuristics. This means that if you wish to save your code not in the root of the container, you must tell kerblam! about it.

For instance, this recipe copies the contents of the analysis in a folder called "/app":

COPY . /app/

This one does the same by using the WORKDIR directive:

WORKDIR /app
COPY . .

If you change the working directory, let Kerblam! know by setting the execution > workdir option in kerblam.toml:

[execution]
workdir = "/app"

In this way, Kerblam! will run the containers with the proper paths.

This option applies to ALL containers managed by Kerblam!

There is currently no way to configure a different working directory for every specific dockerfile.

Writing Dockerfiles for Kerblam!

When you write dockerfiles for use with Kerblam! there are a few things you should keep in mind:

  • Kerblam! will automatically set the proper entrypoints for you;
  • The build context of the dockerfile will always be the place where the kerblam.toml file is.
  • Kerblam! will not ignore any file for you.
  • The behaviour of kerblam package is slightly different than kerblam run, in that the context of kerblam package is an isolated "restarted" project, as if kerblam data clean --yes was run on it, while the context of kerblam run is the current project, as-is.

This means a few things:

COPY directives are executed in the root of the repository

This is exactly what you want, usually. This makes it possible to copy the whole project over to the container by just using COPY . ..

The data directory is excluded from packages

If you have a COPY . . directive in the dockerfile, it will behave differently when you kerblam run versus when you kerblam package.

When you run kerblam package, Kerblam! will create a temporary build context with no input data. This is what you want: Kerblam! needs to separately package your (precious) input data on the side, and copy in the container only code and other execution-specific files.

In a run, the current local project directory is used as-is as a build context. This means that the data directory will be copied over. At the same time, Kerblam! will also mount the same directory to the running container, so the copied files will be "overwritten" by the live mountpoint while to container is running.

This generally means that copying the whole data directory is useless in a run, and that it cannot be done during packaging.

Therefore, a best practice is to ignore the contents of the data folders in the .dockerignore file. This makes no difference while packaging containers but a big difference when running them, as docker skips copying the useless data files.

To do this in a standard Kerblam! project, simply add this to your .dockerignore:

# Ignore the intermediate/output directory
data

You might also want to add any files that you know are not useful in the docker environment, such as local python virtual environments.

Your dockerfiles can be very small

Since the configuration is handled by Kerblam!, the main reason to write dockerfiles is to install dependencies.

This makes your dockerfiles generally very small:

FROM ubuntu:latest

RUN apt-get update && apt-get install # a list of packages

COPY . .

You might also be interested in the article 'best practices while writing dockerfiles' by Docker.

Docker images are named based on the pipeline name

If you run kerblam run my_pipeline twice, the same container is built to run the pipeline twice, meaning that caching will make your execution quite fast if you place the COPY . . directive near the bottom of the dockerfile.

This way, you can essentially work exclusively in docker and never install anything locally.

Kerblam! will name the containers for the pipelines as <pipeline name>_kerblam_runtime.

Describing pipelines

If you execute kerblam run without specifying a pipe (or you try to run a pipe that does not exist), you will get a message like this:

Error: no runtime specified. Available runtimes:
    β—Ύβ—Ύ process_csv
    πŸ‹β—Ύ save_plots
    β—Ύβ—Ύ generate_metrics

The whale emoji (πŸ‹) represents pipes that have an associated Docker container.

If you wish, you can add additional information to this list by writing a section in the makefile/shellfile itself. Using the same example as above:

#? Calculate the sums of the input metrics
#?
#? The script takes the input metrics, then calculates the row-wise sums.
#? These are useful since we can refer to this calculation later.

./data/out/output.csv: ./data/in/input.csv ./src/calc_sum.py
    cat $< | ./src/calc_sum.py > $@

If you add this block of lines starting with #? , Kerblam! will use them as descriptions (note that the space after the ? is important!), and it will treat them as markdown. The first paragraph of text (#? lines not separated by an empty #? line) will be the title of the pipeline. Try to keep this short and to the point. The rest of the lines will be the long description.

Kerblam will parse all lines starting with #? , although it's preferrable to only have a single contiguous description block in each file.

The output of kerblam run will now read:

Error: no runtime specified. Available runtimes:
    β—ΎπŸ“œ process_csv :: Calculate the sums of the input metrics
    πŸ‹β—Ύ save_plots
    β—Ύβ—Ύ generate_metrics

The scroll (πŸ“œ) emoji appears when Kerblam! notices a long description. You can show the full description for such pipes with kerblam run process_csv --desc.

With pipeline docstrings, you can have a record of what the pipeline does for both yourself and others who review your work.

You cannot write docstrings inside docker containers1.

1

You actually can. I can't stop you. But Kerblam! ignores them.

Packaging pipelines for later

The kerblam package command is one of the most useful features of Kerblam! It allows you to package everything needed to execute a pipeline in a docker container and export it for execution later.

You must have a matching dockerfile for every pipeline that you want to package, or Kerblam! won't know what to package your pipeline into.

For example, say that you have a process pipe that uses make to run, and requires both a remotely-downloaded remote.txt file and a local-only precious.txt file.

If you execute:

kerblam package process --tag my_process_package

Kerblam! will:

  • Create a temporary build context;
  • Copy all non-data files to the temporary context;
  • Build the specified dockerfile as normal, but using this temporary context;
  • Create a new Dockerfile that:
    • Inherits from the image built before;
    • Copies the Kerblam! executable to the root of the container;
    • Configure the default execution command to something suitable for execution (just like kerblam run does, but "baked in").
  • Build the docker container and tag it with my_process_package;
  • Export all precious data, the kerblam.toml and the --tag of the container to a process.kerblam.tar tarball.

If you don't specify a --tag, Kerblam! will name the result as <pipe>_exec. The --tag parameter is a docker tag. You can specify a remote repository and push it with docker push ... as you would normally do.

After Kerblam! packages your project, you can re-run the analysis with kerblam replay by using the process.kerblam.tar file:

kerblam replay process.kerblam.tar ./replay_directory

Kerblam! reads the .kerblam.tar file, recreates the execution environment from it by unpacking the packed data, and executes the exported docker container with the proper mountpoints (as described in the kerblam.toml file).

In the container, Kerblam! fetches remote files (i.e. runs kerblam data fetch) and then the pipeline is triggered via kerblam run. Since the output folder is attached to the output directory on disk, the final output of the pipeline is saved locally.

These packages are meant to make pipelines reproducible in the long-term. For day-to-day runs, kerblam run is much faster.

The responsibility of having the resulting docker work in the long-term is up to you, not Kerblam! For most cases, just having kerblam run work is enough for the resulting package made by kerblam package to work, but depending on your docker files this might not be the case. Kerblam! does not test the resulting package - it's up to you to do that. It's best to try your packaged pipeline once before shipping it off.

However, even a broken kerblam package is still useful! You can always enter with --entrypoint bash and interactively work inside the container later, manually fixing any issues that time or wrong setup might have introduced.

Kerblam! respects your choices of execution options when it packages, changing backend or working directory as you'd expect. See the kerblam.toml specification to learn more.

Managing Data

Kerblam! has a bunch of utilities to help you manage the local data for your project. If you follow open science guidelines, chances are that a lot of your data is FAIR, and you can fetch it remotely.

Kerblam! is perfect to work with such data. The next tutorial sections outline what Kerblam! can do to help you work with data.

Remember that Kerblam! recognizes what data is what by the location where you save the data in. If you need a refresher, read this section of the book.

kerblam data will give you an overview of the status of local data:

> kerblam data
./data       500 KiB [2]
└── in       1.2 MiB [8]
└── out      823 KiB [2]
──────────────────────
Total        2.5 Mib [12]
└── cleanup  2.3 Mib [9] (92.0%)
└── remote   1.0 Mib [5]
! There are 3 undownloaded files.   

The first lines highlight the size (500 KiB) and amount (2) of files in the ./data/in (input), ./data/out (output) and ./data (intermediate) folders.

The total size of all the files in the ./data/ folder is then broken down between categories: the Total data size, how much data can be removed with kerblam data clean or kerblam data pack, and how many files are specified to be downloaded but are not yet present locally.

Fetching remote data

If you define in kerblam.toml the section data.remote you can have Kerblam! automatically fetch remote data for you:

[data.remote]
# This follows the form "url_to_download" = "save_as_file"
"https://raw.githubusercontent.com/MrHedmad/kerblam/main/README.md" = "some_readme.md"

When you run kerblam data fetch, Kerblam! will attempt to download some_readme.md by following the URL you provided and save it in the input data directory (e.g. data/in).

Most importantly, some_readme.md is treated as a file that is remotely available and therefore locally expendable for the sake of saving disk size (see the data clean and data pack commands).

You can specify any number of URLs and file names in [data.remote], one for each file that you wish to be downloaded.

The download directory for all fetched data is your input directory, so if you specify some/nested/dir/file.txt, kerblam! will save the file in ./data/in/some/nested/dir/file.txt. This also means that if you write an absolute path (e.g. /some_file.txt), Kerblam! will treat the path as it should treat it - by making some_file.txt in the root of the filesystem (and most likely failing to do so). It will, however, warn you before acting that it is about to do something potentially unwanted, giving you the chance to abort.

Package and distribute data

Say that you wish to send all your data folder to a colleague for inspection. You can tar -czvf exported_data.tar.gz ./data/ and send your whole data folder, but you might want to only pick the output and non-remotely available inputs, and leave re-downloading the (potentially bulky) remote data to your colleague.

It is widely known that remembering tar commands is impossible.

If you run kerblam data pack you can do just that. Kerblam! will create a exported_data.tar.gz file and save it locally with the non-remotely-available .data/in files and the files in ./data/out. You can also pass the --cleanup flag to also delete them after packing.

You can then share the data pack with others.

Cleanup data

If you want to cleanup your data (perhaps you have finished your work, and would like to save some disk space), you can run kerblam data clean.

Kerblam! will remove:

  • All temporary files in ./data/;
  • All output files in ./data/out;
  • All input files that can be downloaded remotely in ./data/in.
  • All empty (even nested) folders in ./data/ and ./data/out. This essentially only leaves input data that cannot be retrieved remotely on disk.

Kerblam! will consider as "remotely available" files that are present in the data.remote section of kerblam.toml. See this chapter of the book to learn more about remote data. If you wish to preserve the remote data (perhaps you merely want to "reset" the pipelines but start again quickly) you can use the --keep-remote flag to do so.

If you want to preserve the empty folders left behind after cleaning, pass the --keep-dirs flag to do just that.

Kerblam! will ask for your confirmation before deleting the files. If you're feeling bold, skip it with the --yes flag.

Other utilities

Kerblam! has a few other utilities to deal with the most tedius steps when working with projects.

kerblam ignore - Add items to your .gitignore quickly

Oops! You forgot to include your preferred language to your .gitignore. You now need to google for the template .gitignore, open the file and copy-paste it in.

With Kerblam! you can do that in just one command. For example:

kerblam ignore Rust

will fetch Rust.gitignore from the Github gitignore repository and append it to your .gitignore for you. Be careful that this command is case sensitive (e.g. Rust works, rust does not).

You can also add specific files or folders this way:

kerblam ignore ./src/something_useless.txt

Kerblam! will add the proper pattern to the .gitignore file to filter out that specific file.

The optional --compress flag makes Kerblam! check the .gitignore file for duplicated entries, and only retain one copy of each pattern. This also cleans up comments and whitespace in a sensible way.

The --compress flag allows to fix ignoring stuff twice. E.g. kerblam ignore Rust && kerblam ignore Rust --compress is the same as running kerblam ignore Rust just once.

Getting help

You can get help with Kerblam! via a number of channels:

Thank you so much for giving Kerblam! a go.

Usage examples

There are a bunch of examples in the MrHedmad/kerblam-examples repository, ready for your perusal.

The latest development version of Kerblam! is tested against these examples, so you can be sure they are as fresh as they can be.

The Kerblam.toml file

The kerblam.toml file is the control center of kerblam! All of its configuration is found there. Here is what fields are available, and what they do.

Extra fields not found here are silently ignored. This means that you must be careful of typos!

The fields are annotated where possible with the default value.

[meta] # Metadata regarding kerblam!
version = "0.4.0"
# Kerblam! will check this version and give you a warning
# if you are not running the same executable.
# To save you headaches!

# The [data] section has options regarding... well, data.
[data.paths]
input = "./data/in"
output = "./data/out"
intermediate = "./data"

[data.profiles] # Specify profiles here
profile_name = {
    "original_name" = "profile_name",
    "other_name" = "other_profile_name"
}

# Or, alternatively
[data.profiles.profile_name]
"original_name" = "profile_name"
"other_name" = "other_profile_name"
# Any number of profiles can be specified, but stick to just one of these
# two methods of defining them.

[data.remote] # Specify how to fetch remote data
"url_to_fetch" = "file_to_save_to"
# there can be any number of "url" = "file" entries here.
# Files are saved inside `[data.paths.input]`

##### --- #####
[code] # Where to look for containers and pipes
env_dir = "./src/dockerfiles"
pipes_dir = "./src/pipes"

[execution] # How to execute the pipelines
backend = "docker" # or "podman", the backend to use to build and run containers
workdir = "/" # The working directory inside all built containers

Note that this does not want to be a valid TOML, just a reference. Don't expect to copy-paste it and obtain a valid Kerblam! configuration.

Contributing to Kerblam!

Thank you for wanting to contribute!

The developer guide changes more often than this book, so you can read it directly on Github.

The Kerblam! philosophy

Hello! This is the maintainer. This article covers the design principles behind how Kerblam! functions. It is both targeted at myself - to remind me why I did what I did - and to anyone who is interested in the topic of managing data analysis projects.

Reading this is not at all necessary to start using Kerblam!. Perhaps you want to read the tutorial instead.

I am an advocate of open science, open software and of sharing your work as soon and as openly as possible. I also believe that documenting your code is even more important than the code itself. Keep this in mind when reading this article, as it is strongly opinionated.

The first time I use an acronym I'll try to make it bold italics so you can have an easier time finding it if you forget what it means. However, I try to keep acronyms to a minimum.

Introduction

After three years doing bioinformatics work as my actual job, I think I have come across many of the different types of projects that one encounters as a bioinformatician:

  1. You need to analyse some data either directly from someone or from some online repository. This requires the usage of both pre-established tools and new code and/or some configuration.
    • For example, someone in your research group performed RNA-Seq, and you are tasked with the data analysis.
  2. You wish to create a new tool/pipeline/method of analysis and apply it to some data to both test its performance and/or functionality, before releasing the software package to the public.

The first point is data analysis. The second point is software development. Both require writing software, but they are not exactly the same.

You'd generally work on point 2 like a generalist programmer would. In terms of how you work, there are many different workflow mental schemas that you can choose from, each with its following, pros, and cons. Simply search for coding workflow to find a plethora of different styles, methods and types of way you can use to manage what to do and when while you code.

In any case, while working with a specific programming language, you usually have only one possible way to layout your files. A python project uses a quite specific structure: you have a pyproject.toml/setup.py, a module directory1... Similarly, when you work on a Rust project, you use cargo, and therefore have a cargo.toml file, a /src directory...

The topic of structuring the code itself is even deeper, with different ways to think of your coding problem: object oriented vs functional vs procedural, monolithic vs microservices, etcetera, but it's out of the scope of this piece.

At its core, software is a collection of text files written in a way that the computer can understand. The process of laying out these files in a logical way in the filesystem is what I mean when I say project layout (PL). A project layout system (PLS) is a pre-established way to layout these files. Kerblam! is a tool that can help you with general tasks if you follow the Kerblam! project layout system.

There are also project management systems, that are tasked with managing what has to be done while writing code. They are not the subject of this piece, however.

Since we are talking about code, there are a few characteristics in common between all code-centric projects:

  • The changes between different versions of the text files are important. We need to be able to go back to a previous version if we need to. This can be due by a number of things: if we realize that we changed something that we shouldn't have, if we just want to see a previous version of the code or if we need to run a previous version of the program for reproducibility purposes.
  • Code must be documented to be useful. While it is often sufficient to read a piece of code to understand what it does, the why is often unclear. This is even more important when creating new tools: a tool without clear documentation is unusable, and an unusable tool might as well not exist.
  • Often, code has to be edited by multiple people simultaneously. It's important to have a way to coordinate between people as you add your edits in.
  • Code layout is often driven by convention or by requirements of build systems/ interpreters/external tools that need to read your code. Each language is unique under this point.

From these initial observations we can start to think about a generic PLS. Version control takes care of - well - version control and is essential for collaboration. Version control generally does not affect the PL meaningfully. However, version control often does not work well with large files, especially binary files.

Design principle A: We must use a version control system.

Design principle B: Big binary blobs bad2!

2

I'm very proud of this pun. Please don't take it from me.

I assume that the reader knows how vital version control is when writing software. In case that you do not, I want to briefly outline why you'd want to use a version control system in your work:

  • It takes care of tracking what you did on your project;
  • You can quickly turn back time if you mess up and change something that should not have been changed.
  • It allows you to collaborate both in your own team (if any) and with the public (in the case of open-source codebases). Collaboration is nigh impossible without a version control system.
  • It allows you to categorize and compartimentalize your work, so you can keep track of every different project neatly.
  • It makes the analysis (or tool) accessible - and if you are careful also reproducible - to others, which is an essential part of the scientific process. These are just some of the advantages you get when using a version control system. One of the most popular version control systems is git. With git, you can progressively add changes to code over time, with git taking care of recording what you did, and managing different versions made by others.

If you are not familiar with version control systems and specifically with git, I suggest you stop reading and look up the git user manual.

Design principle A makes it so that the basic unit of our PLS is the repository. Our project therefore is a repository of code.

As we said, documentation is important. It should be versioned together with the code, as that is what it is describing and it should change at the same pace.

Design principle C: Documentation is good. We should do more of that.

Code is read more times than it is written, therefore, it's important for a PLS to be logical and obvious. To be logical, one should categorize files based on their content, and logically arrange them in a way that makes sense when you or a stranger looks through them. To be obvious, the categorization and the choice of folder and file names should make sense at a glance (e.g. the 'scripts' directory is for scripts, not for data).

Design principle D: Be logical, obvious and predictable

Scientific computing needs to be reproduced by others. The best kind of reproducibility is computational reproducibility, by which the same output is generated given the same input. There are a lot of things that you can do while writing code to achieve computational reproducibility, but one of the main contributors to reproducibility is still containerization.

Additionally, being easily reproducible is - in my mind - as important to being reproducible to begin with. The easier it is to reproduce your work, the more "morally upright" you will be in the eyes of the reader. This has a lot of benefits, of course, with the main one being that you are more resilient to backlash in the inevitable case that you commit an error.

Design principle E: Be (easily) reproducible.

Structuring data analysis

While structuring single programs is relatively straightforward, doing the same for a data analysis project is less set in stone. However, given the design principles that we have established in the previous section, we can try to find a way to fulfill all of them for the broadest scope of application possible.

To design such a system, it's important to find what are the points in common between all types of data analysis projects. In essence, a data analysis project encompasses:

  • Input data that must be analysed in order to answer some question.
  • Output data that is created as a result of analysing the input data.
  • Code that analyses that data.
  • It is often the case that data analysis requires many different external tools, each with its own set of requirements. These sum with the requirements of your own code and scripts.

"Data analysis" code is not "tool" code: it usually uses more than one programming language, it is not monolithic (i.e builds up to just one "thing") and can differ wildly in structure (from just one script, to external tool, to complex pieces of code that run many steps of the analysis).

This complexity results in a plethora of different ways to structure the code and the data during the project.

I will not say that the Kerblam! way is the one-and-only, cover-all way to structure your project, but I will say that it is a sensible default.

Kerblam!

The kerblam! way to structure a project is based on the design principles that we have seen, the characteristics of all data analysis project and some additional fundamental observations, which I list below:

  1. All projects deal with input and output data.
  2. Some projects have intermediate data that can be stored to speed up the execution, but can be regenerated if lost (or the pipeline changes).
  3. Some projects generate temporary data that is needed during the pipeline but then becomes obsolete when the execution ends.
  4. Projects may deal with very large data files.
  5. Projects may use different programming languages.
  6. Projects, especially exploratory data analysis, require a record of all the trials that were made during the exploratory phase. Often, one last execution is the final one, with the resulting output the presented one. Having these in mind, we can start to outline how Kerblam! deals with each of them.

Data

Points 1, 2, 3 and 4 deal with data. A kerblam! project has a dedicated data directory, as you'd expect. However, kerblam! actually differentiates between the different data types. Other than input, output, temporary and intermediate data, kerblam! also considers:

  • Remote data is data that can be downloaded at runtime from a (static) remote source.
  • Input data that is not remote is called precious, since it cannot be substituted if it is lost.
  • All data that is not precious is fragile, since it can be deleted with little repercussion (i.e. you just re-download it or re-run the pipeline to obtain it again.

Practically, data can be input/output/temp/intermediate, either fragile or precious and either local or remote.

To make the distinction between these different data types we could either keep a separate configuration that points at each file (a git-like system), or we specify directories where each type of file will be stored.

Kerblam! takes both of these approaches. The distinction between input/output/temp/intermediate data is given by directories. It's up to the user to save each file in the appropriate directory. The distinction between remote and local files is however given by a config file, kerblam.toml, so that Kerblam! can fetch the remote files for you on demand3. Fragile and precious data can just be computed from knowing the other two variables.

3

Two birds with one stone, or so they say.

The only data that needs to be manually shared with others is precious data. Everything else can be downloaded or regenerated by the code. This means that the only data that needs to be committed to version control is the precious one. If you strive to keep precious data to a minimum - as should already be the case - analysis code can be kept tiny, size-wise. This makes Kerblam! compliant with principle B4 and makes it easier (or in some cases possible) to be compliant with principle A5.

Execution

Points 5 and 6 are generally covered by pipeline managers. A pipeline manager, like snakemake or nextflow, executes code in a controlled way in order to obtain output files. While both of these were made with data analysis in mind, they are both very powerful and very "complex"6 and unwieldy for most projects.

Kerblam! supports simple shell scripts (which in theory can be used to run anything, even pipeline managers like nextflow or snakemake) and makefiles natively. make is a quite old GNU utility that is mainly used to build packages and create compiled C/C++ projects. However, it supports and manages the creation of any file with any creation recipe. It is easy to learn and quick to write, and is at the perfect spot for most analyses between a simple shell script and a full-fledged pipeline manager.

Kerblam! considers these executable scripts and makefiles as "pipes", where each pipe can be executed to obtain some output. Each pipe should call external tools and internal code. If code is structured following the unix philosophy, each different piece of code ("program") can be reused in the different pipelines and interlocked with one another inside pipelines.

With these considerations, point 6 can be addressed by making different pipes with sensible names, saving them in version control. Point 5 is easy if each program is independent of each other, and developed in its own folder. Kerblam! appoints the ./src directory to contain the program code (e.g. scripts, directories with programs, etc...) and the /src/pipes directory to contain shell scripts and makefile pipelines.

These steps fulfill the design principle D7: Makefiles and shell scripts are easy to read, and having separate folders for pipelines and actual code that runs makes it easy to know what is what. Having the rest of the code be sensibly managed is up to the programmer.

Principle E8 can be messed up very easily, and the reproducibility crisis is a symptom of this. A very common way to make any analysis reproducible is to package the execution environment into containers, executable bundles that can be configured to do basically anything in an isolated, controlled environment.

Kerblam! projects leverage docker containers to make the analysis as easily reproducible as possible. Using docker for the most basic tasks is relatively straightforward:

  • Start with an image;
  • Add dependencies;
  • Copy the current environment;
  • Setup the proper entrypoint;
  • Execute the container with a directory mounted to the local file system in order to extract the output files as needed.

Kerblam! automatically detects dockerfiles in the ./src/dockerfiles directory and builds and executes the containers following this simple schema. To give as much freedom to the user as possible, Kerblam! does not edit or check these dockerfiles, just executes them in the proper environment and the correct mounting points.

The output of a locally-run pipeline cannot be trusted as it is not reproducible. Having Kerblam! natively run all pipelines in containers allows development runs to be exactly the same as the output runs when development ends.

To be compliant with principle D7, knowing what dockerfile is needed for what pipeline can be challenging. Kerblam! requires that pipes and the respective dockerfiles must have the same name.

Documentation

Documentation is essential, as we said in principle C9. However, documentation is for humans, and is generally well established how to layout the documentation files in a repository:

  • Add a README file.
  • Add a [LICENSE], so it's clear how other may use your code.
  • Create a /docs folder with other documentation, such as CONTRIBUTING guides, tutorials and generally human-readable text needed to understand your project.

There is little that an automated tool can do to help with documentation. There are plenty of guides online that deal with the task of documenting a project, so I will not cover it further.

1

Python packaging is a bit weird since there are so many packaging engines that create python packages. Most online guides use setuptools, but modern python (as of Dec 2023) now works with the build script with a pyproject.toml file, which supports different build engines. See this pep for more info.

6

I cannot find a good adjective other than "complex". These tools are not hard to use, or particularly difficult to learn, but they do have an initial learning curve. The thing that I want to highlight is that they are so formal, and require careful specification of inputs, output, channels and pipelines that they become a bit unwieldy to use as a default. For large project with many moving parts and a lot of computing (e.g. the need to run in a cluster), using programs such as these can be very important and useful. However, bringing a tank to a fist fight could be a bit too much.

4

Big binary blobs bad.

5

We must use a version control system.

7

Be logical, obvious and predictable.

8

Be (easily) reproducible.

9

Documentation is good. We should do more of that.