Sep 27, 2021
Two parts:
Proper data visualization is crucial throughout all of the steps of the data science pipeline: data wrangling, modeling, and storytelling
GeoViews builds on HoloViews to add support for geographic data
hvPlot
fit in?¶hvPlot
package¶It's relatively new: officially released in February 2019
%%html
<center>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">We are very pleased officially announce the release of hvPlot! It provides a high-level plotting API for the PyData ecosystem including <a href="https://twitter.com/pandas_dev?ref_src=twsrc%5Etfw">@pandas_dev</a>, <a href="https://twitter.com/xarray_dev?ref_src=twsrc%5Etfw">@xarray_dev</a>, <a href="https://twitter.com/dask_dev?ref_src=twsrc%5Etfw">@dask_dev</a>, <a href="https://twitter.com/geopandas?ref_src=twsrc%5Etfw">@geopandas</a> and more, generating interactive <a href="https://twitter.com/datashader?ref_src=twsrc%5Etfw">@datashader</a> and <a href="https://twitter.com/BokehPlots?ref_src=twsrc%5Etfw">@BokehPlots</a>. <a href="https://t.co/Loc5XElJUL">https://t.co/Loc5XElJUL</a></p>— HoloViews (@HoloViews) <a href="https://twitter.com/HoloViews/status/1092409050283819010?ref_src=twsrc%5Etfw">February 4, 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</center>
An interface just like the pandas
plot() function, but much more useful.
# Our usual imports
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# let's load the measles data from week 2
url = "https://raw.githubusercontent.com/MUSA-550-Fall-2021/week-2/master/data/measles_incidence.csv"
measles_data_raw = pd.read_csv(url, skiprows=2, na_values='-')
measles_data_raw.head()
measles_data = measles_data_raw.melt(id_vars=["YEAR", "WEEK"],
value_name="incidence",
var_name="state")
measles_data.head()
pandas
¶The default .plot()
doesn't know which variables to plot.
fig, ax = plt.subplots(figsize=(10, 6))
measles_data.plot(ax=ax)
But we can group by the year, and plot the national average each year
by_year = measles_data.groupby("YEAR")['incidence'].sum()
by_year.head()
fig, ax = plt.subplots(figsize=(10, 6))
# Plot the annual average by year
by_year.plot(ax=ax)
# Add the vaccine year and label
ax.axvline(x=1963, c='k', linewidth=2)
ax.text(1963, 27000, " Vaccine introduced", ha='left', fontsize=18);
hvplot
¶Use the .hvplot()
to create interactive plots.
# This will add the .hvplot() function to your DataFrame!
import hvplot.pandas
# Import holoviews too
import holoviews as hv
# Load bokeh
hv.extension('bokeh')
img = by_year.hvplot(kind='line')
img
In this case, .hvplot()
creates a Holoviews
Curve
object.
Not unlike altair
Chart
objects, it's an object that knows how to translate from your DataFrame data to a visualization.
print(img)
by_year.hvplot(kind='scatter')
by_year.hvplot(kind='bar', rot=90, width=1000)
Use the *
operator to layer together chart elements.
Note: the same thing can be accomplished in altair, but with the +
operator.
# The line chart of incidence vs year
incidence = by_year.hvplot(kind='line')
# Vertical line + label for vaccine year
vline = hv.VLine(1963).opts(color='black')
label = hv.Text(1963, 27000, " Vaccine introduced", halign='left')
final_chart = incidence * vline * label
final_chart
This is some powerful magic.
Let's calculate the annual measles incidence for each year and state:
by_state = measles_data.groupby(['YEAR', 'state'])['incidence'].sum()
by_state.head()
Now, tell hvplot
to plot produce charts for each state:
by_state_chart = by_state.hvplot(x="YEAR",
y="incidence",
groupby="state",
width=400,
kind="line")
by_state_chart
PA = by_state_chart['PENNSYLVANIA'].relabel('PA')
NJ = by_state_chart['NEW JERSEY'].relabel('NJ')
+
operator¶combined = PA + NJ
combined
print(combined)
The charts are side-by-side by default. You can also specify the number of rows/columns explicitly.
# one column
combined.cols(1)
Using the by
keyword:
states = ['NEW YORK', 'NEW JERSEY', 'CALIFORNIA', 'PENNSYLVANIA']
sub_states = by_state.loc[:, states]
sub_state_chart = sub_states.hvplot(x='YEAR',
y='incidence',
by='state',
kind='line')
sub_state_chart * vline
Just like in altair, when we used the alt.Chart().facet(column='state')
syntax
Below, we specify the state
column should be mapped to each column:
img = sub_states.hvplot(x="YEAR",
y='incidence',
col="state",
kind="line",
rot=90,
frame_width=200)
img * vline
by_state.hvplot
by_state.loc[1960:1970, states].hvplot.bar(x='YEAR',
y='incidence',
by='state', rot=90)
Change bar()
to line()
and we get the same thing as before.
by_state.loc[1960:1970, states].hvplot.line(x='YEAR',
y='incidence',
by='state', rot=90)
See the help message for explicit hvplot functions:
by_state.hvplot?
by_state.hvplot.line?
Can we reproduce the WSJ measles heatmap that we made in altair in week 2?
Use the help function:
measles_data.hvplot.heatmap?
We want to plot 'YEAR' on the x axis, 'state' on the y axis, and specify 'incidence' as the values begin plotted in each heatmap bin.
by_state
data frame which has already summed over weeks for each statemeasles_data
) with columns for state, week, year, and incidencereduce_function
keyword to sum over weeksby_state
# METHOD #1: just plot the incidence
heatmap = by_state.hvplot.heatmap(
x="YEAR",
y="state",
C="incidence",
cmap="viridis",
height=500,
width=1000,
flip_yaxis=True,
rot=90,
)
heatmap.redim(
state="State", YEAR="Year",
)
## METHOD 2: hvplot does the aggregation
heatmap = measles_data.hvplot.heatmap(
x="YEAR",
y="state",
C="incidence",
cmap='viridis',
reduce_function=np.sum,
height=500,
width=1000,
flip_yaxis=True,
rot=90,
)
heatmap.redim(state="State", YEAR="Year")
import hvplot
hvplot.save(heatmap, 'measles.html')
# load the html file and display it
from IPython.display import HTML
HTML('measles.html')
Let's load the penguins data set from week 2
url = "https://raw.githubusercontent.com/MUSA-550-Fall-2021/week-2/master/data/penguins.csv"
penguins = pd.read_csv(url)
penguins.head()
Use the hvplot.scatter_matrix()
function:
penguins.hvplot.scatter?
columns = ['flipper_length_mm',
'bill_length_mm',
'body_mass_g',
'species']
hvplot.scatter_matrix(penguins[columns], c='species')
Note the "box select" and "lasso" features on the tool bar for interactions
.hvplot()
functionLet's load some geographic data for countries:
import geopandas as gpd
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head()
fig, ax = plt.subplots(figsize=(10,10))
world.plot(column='gdp_md_est', ax=ax)
ax.set_axis_off()
world.hvplot.polygons?
# Can also just do world.hvplot()
world.hvplot.polygons(c='gdp_md_est',
geo=True,
frame_height=400)
import geopandas as gpd
# Load the data
url = "https://raw.githubusercontent.com/MUSA-550-Fall-2021/week-3/master/data/opa_residential.csv"
data = pd.read_csv(url)
# Create the Point() objects
data['Coordinates'] = gpd.points_from_xy(data['lng'], data['lat'])
# Create the GeoDataFrame
data = gpd.GeoDataFrame(data, geometry='Coordinates', crs="EPSG:4326")
# load the Zillow data from GitHub
url = "https://raw.githubusercontent.com/MUSA-550-Fall-2020/week-3/master/data/zillow_neighborhoods.geojson"
zillow = gpd.read_file(url)
# Important: Make sure the CRS match
data = data.to_crs(zillow.crs)
# perform the spatial join
data = gpd.sjoin(data, zillow, op='within', how='left')
# Calculate the median market value per Zillow neighborhood
median_values = data.groupby('ZillowName', as_index=False)['market_value'].median()
# Merge median values with the Zillow geometries
median_values = zillow.merge(median_values, on='ZillowName')
print(type(median_values))
median_values.head()
median_values.crs
# pass arguments directly to hvplot()
# and it recognizes polygons automatically
median_values.hvplot(c='market_value',
frame_width=600,
frame_height=500,
geo=True,
cmap='viridis',
hover_cols=['ZillowName'])
geo=True
assumes EPSG:4326¶If you specify geo=True
, the data needs to be in typical lat/lng CRS. If not, you can use the crs
keyword to specify the type of CRS your data is in.
median_values_3857 = median_values.to_crs(epsg=3857)
median_values_3857.crs
median_values_3857.hvplot(c='market_value',
frame_width=600,
frame_height=500,
geo=True,
crs=3857, # NEW: specify the CRS
cmap='viridis',
hover_cols=['ZillowName'])
Let's add a tile source underneath the choropleth map
import geoviews as gv
import geoviews.tile_sources as gvts
%%opts WMTS [width=800, height=800, xaxis=None, yaxis=None]
choro = median_values.hvplot(c='market_value',
width=500,
height=400,
alpha=0.5,
geo=True,
cmap='viridis',
hover_cols=['ZillowName'])
gvts.ESRI * choro
print(type(gvts.ESRI))
%%opts WMTS [width=200, height=200, xaxis=None, yaxis=None]
(gvts.OSM +
gvts.Wikipedia +
gvts.StamenToner +
gvts.EsriNatGeo +
gvts.EsriImagery +
gvts.EsriUSATopo +
gvts.EsriTerrain +
gvts.CartoDark).cols(4)
Note: we've used the %%opts
cell magic to apply syling options to any charts generated in the cell.
See the documentation guide on customizations for more details.
You can do it with hvplot! Sort of.
data['x'] = data.geometry.x
data['y'] = data.geometry.y
data.head()
x
and y
coordinate columns and the associated market_value
columngeometry
columnsubdata = data[['x', 'y', 'market_value']]
type(subdata)
C
column to aggregate for each bin (raw counts are shown if not provided)reduce_function
that determines how to aggregate the C
columndata.hvplot.hexbin?
subdata.head()
subdata.hvplot.hexbin(x='x',
y='y',
C='market_value',
reduce_function=np.median,
logz=True,
geo=True,
gridsize=40,
cmap='viridis')
Not the prettiest but it gets the job done for some quick exploratory analysis!
Some very cool examples available in the galleries