NYC Open Data Student Showcase 20190303 Updates

invisibleroads · February 22, 2019, 7:25pm

Teams that are presenting at our second annual NYC Open Data Student Showcase on Sunday, March 3 should check here for last minute technical updates.

GeoTable Maximum Display Count Increased to 1024

We updated crosscompute-geotable to increase the maximum display count for geotables rendered as maps.

PySAL>=2 Breaks Previous Code that Uses KDTree and KNN

While updating the platform, we inadvertently updated PySAL to 2.0.0 which introduces changes that break existing code.

# OLD
import numpy as np
from pysal.cg import RADIUS_EARTH_KM
from pysal.cg.kdtree import KDTree
from pysal import knnW_from_array
kd_tree = KDTree(xys, distance_metric='Arc', radius=RADIUS_EARTH_KM)
distances, indices = kd_tree.query(xy, k=count, distance_upper_bound=1)
relative_indices = indices[~np.isnan(indices)]
relative_distances = distances[~np.isnan(indices)]
w = knnW_from_array(xys, k=2)
w.transform = 'R'

# NEW
from pysal.lib.cg import KDTree, RADIUS_EARTH_KM
from pysal.lib.weights import KNN
kd_tree = KDTree(xys, distance_metric='Arc', radius=RADIUS_EARTH_KM)
distances, indices = kd_tree.query(xy, k=count, distance_upper_bound=1)
relative_indices = indices[indices < count]
relative_distances = distances[indices < count]
w = KNN(kd_tree, k=2)
w.set_transform('R')

Click here to see a complete example.

Spatial Regression Tutorials Fixed and Updated

We fixed and updated all code in our Spatial Regression Tutorial Notebooks and Example Tools.

NYC Department of Health and Mental Hygiene temporarily restricted access to their Communicable Disease Surveillance Data, which broke our walkthrough. We replaced the dataset with HIV/AIDS Diagnoses by Neighborhood, Sex, and Race/Ethnicity and updated the Prepare Dependent Variable By Aggregating Over Column Walkthrough.
We updated the Prepare Dependent Variable By Aggregating Over Image Walkthrough to extract the spatial reference in proj4 format using pollution_raster.crs.to_proj4().

Almost everything that you need to prepare your predictive tool using NYC Open Data is written in our Spatial Regression Tutorials:

How to Create Your First Fun Predictive Tool
- README
How to Load Open Data
- Load NYC Open Data
How to Geocode Addresses
- Geocode Addresses
How to Prepare Your Training Dataset with Spatial Statistics
How to Train and Select Your Predictive Model
- Train Example Model 20190201
- Train Model to Estimate Graduation Rate from Tree Count
How to Create a Predictive Tool
How to Prepare and Fit a Spatial Regression Model Using PySAL
- Compare Spatial Regression Models on Airbnb Listings)
How to Create an Animated Heatmap from a Raster
- Animate Air Pollution

Convenience Method for Loading Datasets Fixed

We fixed our suggested method for loading open data from Socrata-based open data portals like NYC Open Data. Specifically, there was an issue where index labels were duplicated because we forgot to specify ignore_index=False when concatenating the buffered tables.

import pandas as pd

def load(
    endpoint_url,
    selected_columns=None,
    buffer_size=1000,
    search_term_by_column=None,
    **kw,
):
    buffer_url = (f'{endpoint_url}?$limit={buffer_size}')
    if selected_columns:
        select_string = ','.join(selected_columns)
        buffer_url += f'&$select={select_string}'
    for column, search_term in (search_term_by_column or {}).items():
        buffer_url += f'&$where={column}+like+"%25{search_term}%25"'
    print(buffer_url)
    tables = []
    
    if endpoint_url.endswith('.json'):
        f = pd.read_json
    else:
        f = pd.read_csv

    t = f(buffer_url, **kw)
    while len(t):
        print(len(tables) * buffer_size + len(t))
        tables.append(t)
        offset = buffer_size * len(tables)
        t = f(buffer_url + f'&$offset={offset}', **kw)
    return pd.concat(tables, ignore_index=True, sort=False)

Please use the latest method suggested in our Load NYC Open Data Walkthrough.

Tool to Help Prepare Your Training Dataset Using Tree Statistics Now Available

If your team has not yet created a training dataset, you can now use our shortcut tool! We have created a tool that will augment your dataset with basic tree statistics from the 2015 Street Tree Census.

What You Can Do For the Next Few Weeks

Your team is only nine days away from presenting at our second annual NYC Open Data Student Showcase!

Friday, March 1, 2019 4pm - Dress Rehearsal to Test Your Slides and Results
Sunday, March 3, 2019 2pm - NYC Open Data Student Showcase 2019

Rehearse your presentation at least three times. Make sure everyone on your team gets a chance to speak!
Draft slides on slides.com only AFTER you have rehearsed your presentation. If you are using Google Slides, make sure to make your slides available to the public.
Link your sides next to your team name on this spreadsheet.
Add visualizations using matplotlib or seaborn. If you are using seaborn, remember to install the seaborn package using the following commands described in this post.

import subprocess
assert subprocess.call('pip install seaborn'.split()) == 0

Link to your pre-generated results in this spreadsheet. Don’t forget to take a screenshot or record a screencast of your result and include it in your slides. You do not want to have technical difficulties nor do you not want to wait anxiously for your demo to run in front of a live audience.

You are welcome to ask last-minute questions on this forum. Good luck!

invisibleroads · February 24, 2019, 2:28pm

PySAL>=2 Changes Return Values for KDTree

OLD
relative_indices = indices[~np.isnan(indices)]

NEW
relative_indices = indices[indices < count]

Here is a complete example:

xys = [
    (0, 0),
    (0, 1),
    (1, 0),
    (1, 1),
]
xy = 0.1, 0.1
count = len(xys)

from pysal.lib.cg import KDTree, RADIUS_EARTH_KM
kd_tree = KDTree(xys, distance_metric='Arc', radius=RADIUS_EARTH_KM)
distances, indices = kd_tree.query(
    xy, k=count, distance_upper_bound=100)
relative_indices = indices[indices < count]
relative_distances = distances[indices < count]

Thanks to Jendri and Paloma for pinpointing these issues.

invisibleroads · February 25, 2019, 2:55pm

Please avoid using single or double quotes in name of your notebook for now. There is an issue where having a single or double quote in the name will prevent your tool from running properly.

BAD
Find NYC's Restrooms

GOOD
Find Restrooms in NYC