Sep 22, 2021
Last lecture
Today
# Let's setup the imports we'll need first
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import geopandas as gpd
Load 311 requests in Philadelphia from the data/
directory.
Source: OpenDataPhilly
# Load the data from a CSV file into a pandas DataFrame
requests = pd.read_csv('./data/public_cases_fc_2020.csv.tar.gz')
print("number of requests = ", len(requests))
requests.head()
Remove the requests missing lat/lon coordinates
requests = requests.dropna(subset=['lat', 'lon'])
Create Point objects for each lat
and lon
combination.
We can use the helper utility function: geopandas.points_from_xy()
requests['Coordinates'] = gpd.points_from_xy(requests['lon'], requests['lat'])
requests['Coordinates'].head()
Now, convert to a GeoDataFrame.
Important
ESPG:4326
Since we're only using a few EPSG codes in this course, you can usually tell what the CRS is by looking at the values in the Point() objects.
Philadelphia has a latitude of about 40 deg and longitude of about -75 deg.
Our data must be in the usual lat/lng EPSG=4326.
requests = gpd.GeoDataFrame(requests,
geometry="Coordinates",
crs="EPSG:4326")
Group by the service name and calculate the size of each group:
service_types = requests.groupby('service_name').size()
Sort by the number (in descending order):
service_types = service_types.sort_values(ascending=False)
Slice the data to take the first 20 elements:
top20 = service_types.iloc[:20]
top20
trash_requests = requests.loc[
requests["service_name"] == "Rubbish/Recyclable Material Collection"
].copy()
print("The nuumber of trash-related requests = ", len(trash_requests))
See for example, this article in the Philadelphia Inquirer
# Convert the requested datetime to a column of Datetime objects
trash_requests['requested_datetime'] = pd.to_datetime(trash_requests['requested_datetime'])
# Use the .dt attribute to extract out the month name
trash_requests['month_int'] = trash_requests['requested_datetime'].dt.month
trash_requests['month'] = trash_requests['requested_datetime'].dt.month_name()
TL;DR: This is usually fine!
If you select a subset of a dataframe (a "slice") and then make changes (like adding a new column), you will get this warning. There is a good discussion of the issue on StackOverflow.
You can usually make this go away if you add a .copy()
after you perform your selection. For example, this warning will go away if we had done:
trash_requests = requests.loc[requests["service_name"] == "Rubbish/Recyclable Material Collection"].copy()
totals_by_month = trash_requests.groupby(["month_int", "month"], as_index=False).size()
totals_by_month
as_index=False
syntax here¶This will force the size() function to return a DataFrame instead of having the month
column as the index of the resulted groupby operation.
It saves us from having to do the .reset_index()
function call after running the .size()
function.
For making static bar charts with Python, seaborn's sns.barplot()
is the best option
import seaborn as sns
# Initialize figure/axes
fig, ax = plt.subplots(figsize=(12, 6))
# Plot!
sns.barplot(
x="month_int",
y="size",
data=totals_by_month,
color="#2176d2",
ax=ax,
)
ax.set_xticklabels(totals_by_month["month"]);
The trend is clear in the previous chart, but can we do a better job with the aesthetics? Yes!
For reference, here is a common way to clean up charts in matplotlib:
# Initialize figure/axes
fig, ax = plt.subplots(figsize=(12, 6))
# Plot!
sns.barplot(
x="month",
y="size",
data=totals_by_month,
color="#2176d2",
ax=ax,
order=[
"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December",
],
zorder=999, # Make sure the bar charts are on top of the grid
)
# Remove x/y axis labels
ax.set_xlabel("")
ax.set_ylabel("")
# Format the ytick labels to use a comma and no decimal places
ax.set_yticklabels([f"{yval:,.0f}" for yval in ax.get_yticks()])
# Rotate the xticks
plt.setp(ax.get_xticklabels(), rotation=60)
# Add a grid backgrou d
ax.grid(True, axis="y")
# Remove the top and right axes lines
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
# Add a title
ax.set_title(
"Philadelphia's Trash-Related 311 Requests in 2020", weight="bold", fontsize=16
);
The original data has EPSG=4326. We'll convert to EPSG=3857.
trash_requests = trash_requests.to_crs(epsg=3857)
trash_requests.head()
A GeoJSON holding Zillow definitions for Philadelphia neighborhoods is available in the data/
directory.
zillow = gpd.read_file('data/zillow_neighborhoods.geojson')
zillow = zillow.to_crs(epsg=3857)
zillow.head()
fig, ax = plt.subplots(figsize=(8, 8))
ax = zillow.plot(ax=ax, facecolor='none', edgecolor='black')
ax.set_axis_off()
ax.set_aspect("equal")
Use the sjoin()
function to match point data (requests) to polygon data (neighborhoods)
joined = gpd.sjoin(trash_requests, zillow, op='within', how='left')
joined.head()
Note that this operation can be slow
Group by neighborhood and calculate the size:
totals = joined.groupby('ZillowName', as_index=False).size()
type(totals)
Note: we're once again using the as_index=False
to ensure the result of the .size()
function is a DataFrame rather than a Series with the ZillowName
as its index
totals.head()
Lastly, merge Zillow geometries (GeoDataFrame) with the total # of requests per neighborhood (DataFrame).
Important
When merging a GeoDataFrame (spatial) and DataFrame (non-spatial), you should always call the .merge()
function of the spatial data set to ensure that the merged data is a GeoDataFrame.
For example...
totals = zillow.merge(totals, on='ZillowName')
totals.head()
Choropleth maps color polygon regions according to the values of a specific data attribute. They are built-in to GeoDataFrame objects.
First, plot the total number of requests per neighborhood.
# Create the figure/axes
fig, ax = plt.subplots(figsize=(8, 8))
# Plot
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
cmap="viridis"
)
# Format
ax.set_axis_off()
ax.set_aspect("equal")
# Needed to line up the colorbar properly
from mpl_toolkits.axes_grid1 import make_axes_locatable
# Create the figure
fig, ax = plt.subplots(figsize=(8, 8))
# NEW: Create a nice, lined up colorbar axes (called "cax" here)
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.2)
# Plot
totals.plot(
ax=ax,
cax=cax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
cmap="viridis",
)
# NEW: Get the limits of the GeoDataFrame
xmin, ymin, xmax, ymax = totals.total_bounds
# NEW: Set the xlims and ylims
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
# Format
ax.set_axis_off()
ax.set_aspect("equal")
These improvements are optional, but they definitely make for nicer plots!
Yes, built-in to the plot()
function!
Many different schemes, but here are some of the most common ones:
# Quantiles Scheme
fig, ax = plt.subplots(figsize=(10, 7), facecolor="#cfcfcf")
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
legend_kwds=dict(loc="lower right"),
cmap="Reds",
scheme="Quantiles",
k=5,
)
ax.set_title("Quantiles: k = 5")
ax.set_axis_off()
ax.set_aspect("equal")
## Equal Interval Scheme
fig, ax = plt.subplots(figsize=(10,7), facecolor='#cfcfcf')
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
legend_kwds=dict(loc='lower right'),
cmap="Reds",
scheme="EqualInterval",
k=5
)
ax.set_title("Equal Interval: k = 5")
ax.set_axis_off()
ax.set_aspect("equal")
## Fisher Jenks Scheme
fig, ax = plt.subplots(figsize=(10,7), facecolor='#cfcfcf')
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
legend_kwds=dict(loc='lower right'),
cmap="Reds",
scheme="FisherJenks",
)
ax.set_title("Fisher Jenks: k = 5")
ax.set_axis_off()
ax.set_aspect("equal")
## User Defined Scheme
fig, ax = plt.subplots(figsize=(10,7), facecolor='lightgray')
totals.plot(
ax=ax,
column="size",
edgecolor="white",
linewidth=0.5,
legend=True,
legend_kwds=dict(loc='lower right'),
cmap="Reds",
scheme="UserDefined",
classification_kwds=dict(bins=[100, 300, 500, 800, 1400]) ## NEW: specify user defined bins
)
ax.set_title("User Defined Bins")
ax.set_axis_off()
ax.set_aspect("equal")
The documentation can be found here: https://pysal.org/mapclassify/api.html
Contains the full list of schemes and the function definitions for each.
Better to normalize by area: use the .area attribute of the geometry series
totals['N_per_area'] = totals['size'] / (totals.geometry.area) * 1e4
Now plot the normalized totals:
totals
# Create the figure
fig, ax = plt.subplots(figsize=(8, 8))
# NEW: Create a nice, lined up colorbar axes (called "cax" here)
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.2)
# Plot
totals.plot(
ax=ax,
cax=cax,
column="N_per_area",
edgecolor="white",
linewidth=0.5,
legend=True,
cmap="viridis",
)
# NEW: Get the limits of the GeoDataFrame
xmin, ymin, xmax, ymax = totals.total_bounds
# NEW: Set the xlims and ylims
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
# Format
ax.set_axis_off()
ax.set_aspect("equal")
Since households are driving the 311 requests, it would be even better to normalize by the number of properties in a given neighborhood rather than neighborhood area
Hexagonal bins aggregate quantities over small spatial regions.
Use matplotlib's hexbin()
function
# create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# Extract out the x/y coordindates of the Point objects
xcoords = trash_requests.geometry.x
ycoords = trash_requests.geometry.y
# Plot a hexbin chart
hex_vals = ax.hexbin(xcoords, ycoords, gridsize=50)
# Add the zillow geometry boundaries
zillow.plot(ax=ax, facecolor="none", edgecolor="white", linewidth=0.25)
# add a colorbar and format
fig.colorbar(hex_vals, ax=ax)
ax.set_axis_off()
ax.set_aspect("equal")
Let's plot a random sample of the requests, with a nice basemap underneath.
We'll use the contextily utility package.
import contextily as ctx
# load the city limits data
city_limits = gpd.read_file('./data/City_Limits')
# create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# Plot a random sample of 1,000 requests as points
random_requests = trash_requests.sample(1000)
# Plot
random_requests.plot(ax=ax, marker='.', color='crimson')
# Add the city limits
city_limits.to_crs(trash_requests.crs).plot(ax=ax, edgecolor='black', linewidth=3, facecolor='none')
# NEW: plot the basemap underneath
ctx.add_basemap(ax=ax, crs=trash_requests.crs, source=ctx.providers.CartoDB.Positron)
# remove axis lines
ax.set_axis_off()
Lots of different tile providers available..
Easiest to use tab complete on ctx.providers
ctx.providers
# create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# plot a random sample of requests
trash_requests.sample(1000).plot(ax=ax, marker='.', color='crimson')
# add the city limits
city_limits.to_crs(trash_requests.crs).plot(ax=ax, edgecolor='white', linewidth=3, facecolor='none')
# plot the basemap underneath
ctx.add_basemap(ax=ax, crs=trash_requests.crs, source=ctx.providers.CartoDB.DarkMatter) # NEW: DarkMatter
# remove axis lines
ax.set_axis_off()
Yes! Let's add interactivity. We'll start with altair
...
Altair recently add full support for GeoDataFrames
, making interactive choropleths very easy to make!
import altair as alt
# IMPORTANT: Altair needs the GeoDataFrame to be in EPSG:4326
totals_4326 = totals.to_crs(epsg=4326)
# plot map, where variables ares nested within `properties`,
(
alt.Chart(totals_4326)
.mark_geoshape(stroke="white")
.encode(
tooltip=["N_per_area:Q", "ZillowName:N", "size:Q"],
color=alt.Color("N_per_area:Q", scale=alt.Scale(scheme="viridis")),
).properties(width=500, height=400)
)
Challenge for later: use altair's repeated charts to show several choropleths for different 311 request types at once.
A similar example (using a different dataset) is available in the altair gallery.
Goals: Visualize the property assessment values by neighborhood in Philadelphia, using a
Challenge (if time remaining):
Visualize the highest-valued residential and commercial properties as points on top of a contextily
basemap
2019 property assessment data:
data = pd.read_csv('./data/opa_residential.csv')
data.head()
We'll focus on the market_value
column for this analysis
Remember to set the EPSG of the input data — this data is in the typical lat/lng coordinates (EPSG=4326)
data = data.dropna(subset=['lat', 'lng'])
data["Coordinates"] = gpd.points_from_xy(data["lng"], data["lat"])
gdata = gpd.GeoDataFrame(data, geometry="Coordinates", crs="EPSG:4326")
gdata.head()
len(gdata)
Use the sjoin()
function.
Make sure you CRS's match before doing the sjoin!
zillow.crs
gdata = gdata.to_crs(epsg=3857)
# could also do:
# gdata = gdata.to_crs(zillow.crs)
gdata = gpd.sjoin(gdata, zillow, op='within', how='left')
gdata.head()
Hints:
# Do the group operation
grouped = gdata.groupby('ZillowName', as_index=False)
# Take the median of market value for each group
median_values = grouped['market_value'].median()
median_values.head()
# Merge in the median values ---> geopandas .merge pandas
median_values = zillow.merge(median_values, on='ZillowName')
# Normalize
median_values['market_value'] /= 1e3 # in thousands
median_values.head()
# Create the figure
fig, ax = plt.subplots(figsize=(8, 8))
# Create a nice, lined up colorbar axes (called "cax" here)
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.2)
# Plot
median_values.plot(
ax=ax,
cax=cax,
column="market_value",
edgecolor="white",
linewidth=0.5,
legend=True
)
# Get the limits of the GeoDataFrame
xmin, ymin, xmax, ymax = median_values.total_bounds
# Set the xlims and ylims
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
# Format
ax.set_axis_off()
ax.set_aspect("equal")
# Format cax labels
cax.set_yticklabels([f"${val:.0f}k" for val in cax.get_yticks()]);
Hints:
C
and reduce_C_function
of the hexbin()
functionplt.hexbin?
for more helpbins='log'
on the resulting mapNote: you should pass in the raw point data rather than any aggregated data to the hexbin()
function
# Create the axes
fig, ax = plt.subplots(figsize=(10, 8))
# Use the .x and .y attributes
# NOTE: we are passing in the raw point values here!
# Matplotlib is doing the binning and aggregation work for us!
xcoords = gdata.geometry.x
ycoords = gdata.geometry.y
hex_vals = ax.hexbin(
xcoords,
ycoords,
C=gdata.market_value / 1e3,
reduce_C_function=np.median,
bins="log",
gridsize=30,
)
# Add the zillow geometry boundaries
zillow.plot(ax=ax, facecolor="none", edgecolor="white", linewidth=1, alpha=0.5)
# Add a colorbar and format
cbar = fig.colorbar(hex_vals, ax=ax)
ax.set_axis_off()
ax.set_aspect("equal")
# Format cbar labels
cbar.set_ticks([100, 1000])
cbar.set_ticklabels(["$100k", "$1M"]);
# Convert median values to EPSG=4326
median_values_4326 = median_values.to_crs(epsg=4326)
# Plot the map
chart = (
alt.Chart(median_values_4326)
.mark_geoshape(stroke="white")
.encode(
tooltip=["market_value:Q", "ZillowName:N"],
color=alt.Color("market_value:Q", scale=alt.Scale(scheme='viridis'))
).properties(width=500, height=400)
)
chart
They are all located in Center City
sorted_properties = gdata.sort_values(by='market_value', ascending=False)
top20 = sorted_properties.head(n=20)
top20
# Create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# Plot the top 20 properties by market value
top20.plot(ax=ax, marker=".", color="crimson", markersize=200)
# Add the city limits (with the same CRS!)
city_limits = city_limits.to_crs(top20.crs)
city_limits.plot(
ax=ax, edgecolor="black", linewidth=3, facecolor="none"
)
# Plot the basemap underneath
ctx.add_basemap(ax=ax, crs=top20.crs, source=ctx.providers.CartoDB.Positron)
# remove axis lines
ax.set_axis_off()
Hint: You can use the total_bounds
of the top 20 properties dataframe to get the x/y limits of the axes
# Create the axes
fig, ax = plt.subplots(figsize=(12, 12))
# Plot the top 20 properties by market value
top20.plot(ax=ax, marker=".", color="crimson", markersize=200)
# Add the city limits (with the same CRS!)
city_limits = city_limits.to_crs(top20.crs)
city_limits.plot(
ax=ax, edgecolor="black", linewidth=3, facecolor="none"
)
# Set the xlims and ylims using the boundaries
# of the top 20 properties and a padding (in meters)
xmin, ymin, xmax, ymax = top20.total_bounds
PAD = 1000
ax.set_xlim(xmin-PAD, xmax+PAD)
ax.set_ylim(ymin-PAD, ymax+PAD)
# Plot the basemap underneath
ctx.add_basemap(ax=ax, crs=top20.crs, source=ctx.providers.CartoDB.Positron)
# Remove axis lines
ax.set_axis_off()
The axes limits must be set before the call to the ctx.add_basemap()
. This enables contextily to load a basemap with the proper level of zoom!
--> The top 20 properties are actually all contained within just 6 buildings in Center City