Report Automation

Here we gather notes and questions that might help beginners as you automate your own reports.

Skills

Answer Your Own Questions

You can answer most of your own technical questions by searching the internet. Even if you know the answer, you can find better and newer ways to do what you already know how to do. Here are some tips:

  • Describe what you are trying to do.
    python open excel
  • Search for the error message enclosed in quotes.
    "IOError: No such file or directory"
  • Include the name of the package in your search query.
    "IOError: No such file or directory" +pandas
  • On StackOverflow, first try the most upvoted answer and then try the most recent answer.

For more tips, see Use Google More Efficiently.

Structure Your Workflow

Design

  1. Identify who will read your report. Think about what motivates them, what they know and what they don’t know.
  2. List questions from your readers that your report will answer and hypotheses that your report will test.
  3. Sketch tables, charts, maps using dummy data to answer the questions.
  4. Test your dummy report on potential readers and iterate until they love it.

Prioritize

  1. Rank the questions you want to answer and the hypotheses you want to test.
  2. Enumerate and sort the numbers, tables, charts, images, maps in your report into categories that indicate the order in which you will tackle them. For example, you can categorize variables into Critical, High, Medium, Low, Future.
Hypothesis 1: Enforcement of stop signs is correlated with fewer pedestrian deaths from cars.
Hypothesis 2: Enforcement of stop signs is correlated with fewer biker deaths from cars.

# Critical
- [ ] Chart: Month by month number of pedestrian deaths when stop signs are enforced vs not enforced
- [ ] Chart: Month by month number of biker deaths when stop signs are enforced vs not enforced

# High
- [ ] Map: Distribution of stop signs and car accidents in region 

# Medium
- [ ] Number: Probability of a pedestrian death at an intersection when there is a stop sign
- [ ] Number: Probability of a biker death at an intersection when there is a stop sign
- [ ] Number: Probability of a pedestrian death at an intersection when there is a stop sign and it has been enforced recently
- [ ] Number: Probability of a biker death at an intersection when there is a stop sign and it has been enforced recently

# Low
- [ ] Number: Number of stop signs in region
- [ ] Number: Percentage of stop signs that have had related accidents
- [ ] Image: Front cover
- [ ] Image: Back cover

# Future
- [ ] Map: Proposed locations of new stop signs
- [ ] Map: Proposed locations where stop signs should be enforced on a random basis

Prototype

  1. Find existing tables, charts, maps that look like what you want to make.
  2. Prepare partial datasets to reduce iteration time.
  3. Prototype the tables, charts, maps using the partial datasets.

Automate

  1. Prepare configuration files that tell CrossCompute how to run your report.
  2. Test your report over full datasets.
  3. Add text and styling.

Prototype Your Tables, Charts, Maps

When looking at your list of variables, do not be afraid!

  • Prototype the most important variables first. You can always add more later.
  • Work on each variable one at a time.
  • Work in phases/iterations so that any point, you can send your draft report to anyone and your draft makes sense as a story.
  • Push your work to a repository at the end of the day so that you do not lose your work.

Step 1: Decide Which Table, Chart, Map You Want to Draw

  1. Choose the most important variable from your list of variables.
  2. Imagine how best to illustrate the variable for the story you want to tell in your report.
  3. Sketch the variable as a table, chart, map or image on a piece of paper. Despite all the advances in technology, there is still nothing quite like drawing with pen, pencil or crayon on paper.

Step 2: Find an Existing Table, Chart, Map that Looks Like What You Want to Draw

Browse the internet to see if you can find a representation that looks like what you have drawn. We will call this our Reference Table/Chart/Map.

Tables

Charts

Maps

Step 3: Prepare Partial Datasets to Reduce Iteration Time

  • It makes no sense to wait an hour each time you modify a table, chart or map. Prepare a partial dataset that has a small fraction of the full dataset.
  • If you are pulling the data from a live API, download some elements to make a static dummy dataset.

Step 4: Prototype Your Tables, Charts, Maps Using Partial Datasets

  1. Review your hypotheses.
  2. Examine your real data.
  3. Extract 1-2 elements from your real data to serve as your Start Point.
  4. Look at the code from your Reference Table/Chart/Map to identify the required format for your data.
  5. Make dummy data based on your Reference Table/Chart/Map to serve as your End Point.
  6. Write pseudocode to take you from your Start Point to your End Point.
  7. Convert your pseudocode into real code one line at a time.
  8. When writing your code, take some extra effort to make your code readable by using descriptive variable names and by making sure the entirety of each function or code cell fits in a single screen or less.

Here are some important links and concepts:

Find

The first step is to locate your dataset.

  1. In JupyterLab, use the standard bash commands to situate yourself and find your file.

    # See files in your current folder
    ls
    
    # Change to a different folder
    cd SOME-OTHER-FOLDER
    
    # Change to the parent folder
    cd ..
    
    # Toggle to the previous folder
    cd -
    
  2. Define variables to remember your datasets folder and your dataset path.

    from os.path import join    
    datasets_folder = 'datasets'
    spreadsheet_path = join(datasets_folder, 'your-data.xlsx')
    

Load

  1. Use the right package and function to load your dataset.

    import pandas as pd
    spreadsheet = pd.ExcelFile(spreadsheet_path)
    spreadsheet.sheet_names  # See all sheet names
    
  2. In some cases, you might need to skip rows if the columns are appearing as Unnamed: 1

    stop_signs_table = spreadsheet.parse('Stop Signs', skiprows=1)
    

Preview

  1. Show the first three rows using numpy indexing.

    stop_signs_table[:3]
    
  2. Examine the first row.

    stop_signs_table.iloc[0]
    
  3. See descriptive statistics.

    stop_signs_table.describe()
    

Filter

  1. Filter a table using a single condition.

    # Make a boolean mask
    mask = car_accident_table['vehicle_id'] == vehicle_id
    
    # Apply mask to extract rows where the boolean mask is True
    car_accident_table[mask]
    
  2. Filter a table that has timestamps.

    # Check that the column is of type pandas.Timestamp
    car_accident_table['timestamp'].iloc[0]
    
    # Find rows that occurred after 2020
    mask = car_accident_table['timestamp'] > '2020'
    car_accident_table[mask]
    
    # Find rows that occurred after 2020-06
    mask = car_accident_table['timestamp'] > '2020-06'
    car_accident_table[mask]
    
  3. Filter a table using multiple conditions.

    condition1 = car_accident_table['vehicle_id'] == vehicle_id
    condition2 = car_accident_table['timestamp'] > 2019
    condition3 = car_accident_table['timestamp'] < 2021
    car_accident_table[condition1 & condition2 & condition3]
    

Count

There are many ways to aggregate values in pandas.

import pandas as pd

vehicle_ids_by_vehicle_type = {
    'truck': [2],
    'suv': [0, 1, 3],
    'van': [],
    'sedan': [4],
}

car_accident_table = pd.DataFrame([
    [0, '2020-01-02', 'p'],
    [0, '2020-02-01', 'p'],
    [0, '2020-03-01', 'b'],
    [1, '2020-02-27', 'c'],
    [2, '2020-03-01', 'c'],
    [2, '2020-04-01', 'p'],
    [3, '2020-04-01', 'p'],
    [3, '2020-01-17', 'b'],
    [4, '2020-05-01', 'c'],
    [None, '2020-06-01', 'c'],
], columns=[
    'vehicle_id',
    'timestamp',
    'accident_code',
])
  1. Attempt a one-liner using groupby.

    car_accident_table.groupby('vehicle_id').count()
    
  2. Apply a function over each row and store the values in a column.

    def get_category(row):
        accident_code = row['accident_code']
        try:
            return {
                'p': 'pedestrian',
                'b': 'bike',
                'c': 'car',
            }[accident_code]
        except KeyError:
            pass
    
    car_accident_table['category'] = car_accident_table.apply(get_category, axis=1)
    
  3. Iterate manually. Though least efficient, looping manually is also the most flexible.

    average_accident_count_by_vehicle_type = {}
    for vehicle_type, vehicle_ids in vehicle_ids_by_vehicle_type.items():
        total_accident_count = 0
        for vehicle_id in vehicle_ids:
            mask = car_accident_table['vehicle_id'] == vehicle_id
            specific_car_accident_table = car_accident_table[mask]
            total_accident_count += len(specific_car_accident_table)
        vehicle_count = len(vehicle_ids)
        if not vehicle_count:
            continue
        average_accident_count = total_accident_count / vehicle_count
        average_accident_count_by_vehicle_type[vehicle_type] = average_accident_count
    average_accident_count_by_vehicle_type
    

Note that you will often encounter missing values encoded in different ways. Sometimes, you will see the value nan that means “not a number”. You can test whether a value is nan using np.isnan.

import numpy as np

for row_index, row in car_accident_table.iterrows():
    value = row['vehicle_id']
    if np.isnan(value):
        print(row_index)

Step 5: Save Your Work

It is a good habit to push your work to your repository at the end of each session so that in the event your local machine fails, you can still recover all of your work since when you last pushed it.

Note: If you are using Windows, please use Git Bash to run the commands.

# Load your SSH credentials -- note the backticks are not quotes!
eval `ssh-agent`
ssh-add

# Enter your machine
ssh your-machine-name

# Enter your project
cd ~/Projects/your-project

# Clear all outputs from your notebooks before committing
find . -name "*.ipynb" -prune -exec jupyter nbconvert --clear-output --inplace '{}' ';'

We strongly recommend clearing all outputs from your notebooks before committing as this will reduce the messiness of any merge conflicts.

# Check which files have changed
git status

# Review changes
git diff HEAD

# Add all changes
git add -A

# Check which files have changed
git status

# Commit
git commit -m 'YOUR MESSAGE'

# Pull latest changes and resolve any merge conflicts
git pull

# Push changes to the repository
git push

Note that your changes have not been saved until after you have successfully pushed your changes to the repository.

Configure Report Automation

  • CrossCompute (see examples)
    • automation.yml tells CrossCompute what to do when you click the CrossCompute button
    • report.yml defines the report structure
    • report.css defines the report styling
    • run.ipynb contains your code to generate the tables, charts, maps
    • result.yml defines a section structure
    • result.md contains the report text
  • Linux (see tutorial)
    • ls, cd, nano, vim, pwd
  • Git (see tutorial)
    • clone, status, log, commit, push, pull, branch

Add Text and Styling

TODO

  • [ ] Add notes about INPUT_FOLDER and OUTPUT_FOLDER
  • [ ] Document report.yml syntax