Sometimes you want to be able to run your notebooks and scripts on different datasets without changing your notebook or script code. One way to increase the longevity of your code is by using symbolic links.
ln -s source_path target_path
Accessing the datasets folder from different folders
Create symbolic links to the datasets folder:
mkdir datasets
mkdir reports/abc -p
mkdir reports/bcd -p
ln -s datasets reports/abc/datasets
ln -s datasets reports/bcd/datasets
Creating symbolic links to specific datasets
cd datasets
ln -s xyz-20230801.csv xyz.csv
Configuring dataset references
You can configure dataset references in automate.yml
, which will update symbolic links when you run the automation.
datasets:
- path: datasets/xyz.csv
reference:
path: datasets/xyz-20230801.csv
Your notebooks and scripts should be written to access the normalized path:
from pathlib import Path
datasets_folder = Path('datasets')
source_path = datasets_folder / 'xyz.csv'
If you need to change the dataset, update the corresponding reference path in automate.yml
:
datasets:
- path: datasets/xyz.csv
reference:
path: datasets/xyz-20231231.csv
Getting the dataset timestamp
from pathlib import Path
p = Path('raw-data-20221231-20230808.csv')
terms = p.stem.split('-')
raw_timestamp = terms[-2]
raw_datetime = datetime.strptime(raw_timestamp, '%Y%m%d')
raw_datetime.strftime('%A, %B %d, %Y')
# Saturday, December 31, 2022
processed_timestamp = terms[-1]
processed_datetime = datetime.strptime(processed_timestamp, '%Y%m%d')
processed_datetime.strftime('%A, %B %d, %Y')
# Tuesday, August 08, 2023