Here we gather notes and questions that might help beginners as you automate your own reports.
- Video: Report Automation Introduction
- Video: Introduction to Prototyping Tables, Charts, Maps
- Video: Recreating the Load Profiles Report Example
Skills
Answer Your Own Questions
You can answer most of your own technical questions by searching the internet. Even if you know the answer, you can find better and newer ways to do what you already know how to do. Here are some tips:
- Describe what you are trying to do.
python open excel
- Search for the error message enclosed in quotes.
"IOError: No such file or directory"
- Include the name of the package in your search query.
"IOError: No such file or directory" +pandas
- On StackOverflow, first try the most upvoted answer and then try the most recent answer.
For more tips, see Use Google More Efficiently.
Structure Your Workflow
Design
- Identify who will read your report. Think about what motivates them, what they know and what they don’t know.
- List questions from your readers that your report will answer and hypotheses that your report will test.
- Sketch tables, charts, maps using dummy data to answer the questions.
- Test your dummy report on potential readers and iterate until they love it.
Prioritize
- Rank the questions you want to answer and the hypotheses you want to test.
- Enumerate and sort the numbers, tables, charts, images, maps in your report into categories that indicate the order in which you will tackle them. For example, you can categorize variables into Critical, High, Medium, Low, Future.
Hypothesis 1: Enforcement of stop signs is correlated with fewer pedestrian deaths from cars.
Hypothesis 2: Enforcement of stop signs is correlated with fewer biker deaths from cars.
# Critical
- [ ] Chart: Month by month number of pedestrian deaths when stop signs are enforced vs not enforced
- [ ] Chart: Month by month number of biker deaths when stop signs are enforced vs not enforced
# High
- [ ] Map: Distribution of stop signs and car accidents in region
# Medium
- [ ] Number: Probability of a pedestrian death at an intersection when there is a stop sign
- [ ] Number: Probability of a biker death at an intersection when there is a stop sign
- [ ] Number: Probability of a pedestrian death at an intersection when there is a stop sign and it has been enforced recently
- [ ] Number: Probability of a biker death at an intersection when there is a stop sign and it has been enforced recently
# Low
- [ ] Number: Number of stop signs in region
- [ ] Number: Percentage of stop signs that have had related accidents
- [ ] Image: Front cover
- [ ] Image: Back cover
# Future
- [ ] Map: Proposed locations of new stop signs
- [ ] Map: Proposed locations where stop signs should be enforced on a random basis
Prototype
- Find existing tables, charts, maps that look like what you want to make.
- Prepare partial datasets to reduce iteration time.
- Prototype the tables, charts, maps using the partial datasets.
Automate
- Prepare configuration files that tell CrossCompute how to run your report.
- Test your report over full datasets.
- Add text and styling.
Prototype Your Tables, Charts, Maps
When looking at your list of variables, do not be afraid!
- Prototype the most important variables first. You can always add more later.
- Work on each variable one at a time.
- Work in phases/iterations so that any point, you can send your draft report to anyone and your draft makes sense as a story.
- Push your work to a repository at the end of the day so that you do not lose your work.
Step 1: Decide Which Table, Chart, Map You Want to Draw
- Choose the most important variable from your list of variables.
- Imagine how best to illustrate the variable for the story you want to tell in your report.
- Sketch the variable as a table, chart, map or image on a piece of paper. Despite all the advances in technology, there is still nothing quite like drawing with pen, pencil or crayon on paper.
Step 2: Find an Existing Table, Chart, Map that Looks Like What You Want to Draw
Browse the internet to see if you can find a representation that looks like what you have drawn. We will call this our Reference Table/Chart/Map.
Tables
Charts
Maps
Step 3: Prepare Partial Datasets to Reduce Iteration Time
- It makes no sense to wait an hour each time you modify a table, chart or map. Prepare a partial dataset that has a small fraction of the full dataset.
- If you are pulling the data from a live API, download some elements to make a static dummy dataset.
Step 4: Prototype Your Tables, Charts, Maps Using Partial Datasets
- Review your hypotheses.
- Examine your real data.
- Extract 1-2 elements from your real data to serve as your Start Point.
- Look at the code from your Reference Table/Chart/Map to identify the required format for your data.
- Make dummy data based on your Reference Table/Chart/Map to serve as your End Point.
- Write pseudocode to take you from your Start Point to your End Point.
- Convert your pseudocode into real code one line at a time.
- When writing your code, take some extra effort to make your code readable by using descriptive variable names and by making sure the entirety of each function or code cell fits in a single screen or less.
Here are some important links and concepts:
- Jupyter (see documentation): Notebooks, Kernels, Terminals
- Python (see tutorial): Control Flow, Data Structures, Input and Output, Package Installation and Imports
- Tables (see tutorial): Series, DataFrames
- Charts (see matplotlib and seaborn): Figure, Axes
- Maps (see geojson): Point, LineString, Polygon
Find
The first step is to locate your dataset.
-
In JupyterLab, use the standard
bash
commands to situate yourself and find your file.# See files in your current folder ls # Change to a different folder cd SOME-OTHER-FOLDER # Change to the parent folder cd .. # Toggle to the previous folder cd -
-
Define variables to remember your datasets folder and your dataset path.
from os.path import join datasets_folder = 'datasets' spreadsheet_path = join(datasets_folder, 'your-data.xlsx')
Load
-
Use the right package and function to load your dataset.
import pandas as pd spreadsheet = pd.ExcelFile(spreadsheet_path) spreadsheet.sheet_names # See all sheet names
-
In some cases, you might need to skip rows if the columns are appearing as
Unnamed: 1
stop_signs_table = spreadsheet.parse('Stop Signs', skiprows=1)
Preview
-
Show the first three rows using
numpy
indexing.stop_signs_table[:3]
-
Examine the first row.
stop_signs_table.iloc[0]
-
See descriptive statistics.
stop_signs_table.describe()
Filter
-
Filter a table using a single condition.
# Make a boolean mask mask = car_accident_table['vehicle_id'] == vehicle_id # Apply mask to extract rows where the boolean mask is True car_accident_table[mask]
-
Filter a table that has timestamps.
# Check that the column is of type pandas.Timestamp car_accident_table['timestamp'].iloc[0] # Find rows that occurred after 2020 mask = car_accident_table['timestamp'] > '2020' car_accident_table[mask] # Find rows that occurred after 2020-06 mask = car_accident_table['timestamp'] > '2020-06' car_accident_table[mask]
-
Filter a table using multiple conditions.
condition1 = car_accident_table['vehicle_id'] == vehicle_id condition2 = car_accident_table['timestamp'] > 2019 condition3 = car_accident_table['timestamp'] < 2021 car_accident_table[condition1 & condition2 & condition3]
Count
There are many ways to aggregate values in pandas
.
import pandas as pd
vehicle_ids_by_vehicle_type = {
'truck': [2],
'suv': [0, 1, 3],
'van': [],
'sedan': [4],
}
car_accident_table = pd.DataFrame([
[0, '2020-01-02', 'p'],
[0, '2020-02-01', 'p'],
[0, '2020-03-01', 'b'],
[1, '2020-02-27', 'c'],
[2, '2020-03-01', 'c'],
[2, '2020-04-01', 'p'],
[3, '2020-04-01', 'p'],
[3, '2020-01-17', 'b'],
[4, '2020-05-01', 'c'],
[None, '2020-06-01', 'c'],
], columns=[
'vehicle_id',
'timestamp',
'accident_code',
])
-
Attempt a one-liner using
groupby
.car_accident_table.groupby('vehicle_id').count()
-
Apply a function over each row and store the values in a column.
def get_category(row): accident_code = row['accident_code'] try: return { 'p': 'pedestrian', 'b': 'bike', 'c': 'car', }[accident_code] except KeyError: pass car_accident_table['category'] = car_accident_table.apply(get_category, axis=1)
-
Iterate manually. Though least efficient, looping manually is also the most flexible.
average_accident_count_by_vehicle_type = {} for vehicle_type, vehicle_ids in vehicle_ids_by_vehicle_type.items(): total_accident_count = 0 for vehicle_id in vehicle_ids: mask = car_accident_table['vehicle_id'] == vehicle_id specific_car_accident_table = car_accident_table[mask] total_accident_count += len(specific_car_accident_table) vehicle_count = len(vehicle_ids) if not vehicle_count: continue average_accident_count = total_accident_count / vehicle_count average_accident_count_by_vehicle_type[vehicle_type] = average_accident_count average_accident_count_by_vehicle_type
Note that you will often encounter missing values encoded in different ways. Sometimes, you will see the value nan
that means “not a number”. You can test whether a value is nan
using np.isnan
.
import numpy as np
for row_index, row in car_accident_table.iterrows():
value = row['vehicle_id']
if np.isnan(value):
print(row_index)
Step 5: Save Your Work
It is a good habit to push your work to your repository at the end of each session so that in the event your local machine fails, you can still recover all of your work since when you last pushed it.
Note: If you are using Windows, please use Git Bash to run the commands.
# Load your SSH credentials -- note the backticks are not quotes!
eval `ssh-agent`
ssh-add
# Enter your machine
ssh your-machine-name
# Enter your project
cd ~/Projects/your-project
# Clear all outputs from your notebooks before committing
find . -name "*.ipynb" -prune -exec jupyter nbconvert --clear-output --inplace '{}' ';'
We strongly recommend clearing all outputs from your notebooks before committing as this will reduce the messiness of any merge conflicts.
# Check which files have changed
git status
# Review changes
git diff HEAD
# Add all changes
git add -A
# Check which files have changed
git status
# Commit
git commit -m 'YOUR MESSAGE'
# Pull latest changes and resolve any merge conflicts
git pull
# Push changes to the repository
git push
Note that your changes have not been saved until after you have successfully pushed your changes to the repository.
Configure Report Automation
- CrossCompute (see examples)
-
automation.yml
tells CrossCompute what to do when you click the CrossCompute button -
report.yml
defines the report structure -
report.css
defines the report styling -
run.ipynb
contains your code to generate the tables, charts, maps -
result.yml
defines a section structure -
result.md
contains the report text
-
- Linux (see tutorial)
-
ls
,cd
,nano
,vim
,pwd
-
- Git (see tutorial)
-
clone
,status
,log
,commit
,push
,pull
,branch
-
Add Text and Styling
- Markdown (see documentation): Headings, Links
- CSS (see documentation): Colors, Margins, Tables
TODO
- [ ] Add notes about INPUT_FOLDER and OUTPUT_FOLDER
- [ ] Document report.yml syntax