Managing Datasets

invisibleroads · August 29, 2023, 9:25pm

Sometimes you want to be able to run your notebooks and scripts on different datasets without changing your notebook or script code. One way to increase the longevity of your code is by using symbolic links.

ln -s source_path target_path

Accessing the datasets folder from different folders

Create symbolic links to the datasets folder:

mkdir datasets
mkdir reports/abc -p
mkdir reports/bcd -p
ln -s datasets reports/abc/datasets
ln -s datasets reports/bcd/datasets

Creating symbolic links to specific datasets

cd datasets
ln -s xyz-20230801.csv xyz.csv

Configuring dataset references

You can configure dataset references in automate.yml, which will update symbolic links when you run the automation.

datasets:
  - path: datasets/xyz.csv
    reference:
      path: datasets/xyz-20230801.csv

Your notebooks and scripts should be written to access the normalized path:

from pathlib import Path

datasets_folder = Path('datasets')
source_path = datasets_folder / 'xyz.csv'

If you need to change the dataset, update the corresponding reference path in automate.yml:

datasets:
  - path: datasets/xyz.csv
    reference:
      path: datasets/xyz-20231231.csv

Getting the dataset timestamp

from pathlib import Path

p = Path('raw-data-20221231-20230808.csv')
terms = p.stem.split('-')

raw_timestamp = terms[-2]
raw_datetime = datetime.strptime(raw_timestamp, '%Y%m%d')
raw_datetime.strftime('%A, %B %d, %Y')
# Saturday, December 31, 2022

processed_timestamp = terms[-1]
processed_datetime = datetime.strptime(processed_timestamp, '%Y%m%d')
processed_datetime.strftime('%A, %B %d, %Y')
# Tuesday, August 08, 2023