Sep 15, 2021
Reminder: Links to course materials and main sites (Piazza, Canvas, Github) can be found on the home page of the main course website:
Recommended readings for the week listed here
Last time
Today
We'll use the Palmer penguins data set, data collected for three species of penguins at Palmer station in Antartica
Artwork by @allison_horst
import pandas as pd
from matplotlib import pyplot as plt
# Load data on Palmer penguins
penguins = pd.read_csv("./data/penguins.csv")
penguins.head(n=10)
import seaborn as sns
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# style keywords as dict
color_map = {"Adelie": "#1f77b4", "Gentoo": "#ff7f0e", "Chinstrap": "#D62728"}
style = dict(palette=color_map, s=60, edgecolor="none", alpha=0.75)
# use the scatterplot() function
sns.scatterplot(
x="flipper_length_mm", # the x column
y="bill_length_mm", # the y column
hue="species", # the third dimension (color)
data=penguins, # pass in the data
ax=ax, # plot on the axes object we made
**style # add our style keywords
)
# Format with matplotlib commands
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
ax.legend(loc='best')
The **
syntax is the unpacking operator. It will unpack the dictionary and pass each keyword to the function.
So the previous code is the same as:
sns.scatterplot(
x="flipper_length_mm",
y="bill_length_mm",
hue="species",
data=penguins,
ax=ax,
palette=color_map, # defined in the style dict
edgecolor="none", # defined in the style dict
alpha=0.5 # defined in the style dict
)
But we can use **style
as a shortcut!
In general, seaborn
is fantastic for visualizing relationships between variables in a more quantitative way
Don't memorize every function...
I always look at the beautiful Example Gallery for ideas.
How about adding linear regression lines?
Use lmplot()
sns.lmplot(
x="flipper_length_mm",
y="bill_length_mm",
hue="species",
data=penguins,
height=6,
aspect=1.5,
palette=color_map,
scatter_kws=dict(edgecolor="none", alpha=0.5),
);
Use jointplot()
sns.jointplot(
x="flipper_length_mm",
y="bill_length_mm",
data=penguins,
height=8,
kind="kde",
cmap="viridis",
);
Use pairplot()
# The variables to plot
variables = [
"species",
"bill_length_mm",
"flipper_length_mm",
"body_mass_g",
"bill_depth_mm",
]
# Set the seaborn style
sns.set_context("notebook", font_scale=1.5)
# make the pair plot
sns.pairplot(
penguins[variables].dropna(),
palette=color_map,
hue="species",
plot_kws=dict(alpha=0.5, edgecolor="none"),
)
sns.catplot(x="species", y="bill_length_mm", hue="sex", data=penguins);
Great tutorial available in the seaborn documentation
The color_palette
function in seaborn is very useful. Easiest way to get a list of hex strings for a specific color map.
viridis = sns.color_palette("viridis", n_colors=7).as_hex()
print(viridis)
sns.palplot(viridis)
You can also create custom light, dark, or diverging color maps, based on the desired hues at either end of the color map.
sns.palplot(sns.diverging_palette(10, 220, sep=50, n=7))
import altair as alt
Important: focuses on tidy data — you'll often find yourself running pd.melt()
to get to tidy format
Let's try out our flipper length vs bill length example from last lecture...
# initialize the chart with the data
chart = alt.Chart(penguins)
# define what kind of marks to use
chart = chart.mark_circle(size=60)
# encode the visual channels
chart = chart.encode(
x="flipper_length_mm",
y="bill_length_mm",
color="species",
tooltip=["species", "flipper_length_mm", "bill_length_mm", "island", "sex"],
)
# make the chart interactive
chart.interactive()
Example: previous code is the same as
chart = chart.encode(
x=alt.X("flipper_length_mm"),
y=alt.Y("bill_length_mm"),
color=alt.Color("species"),
tooltip=alt.Tooltip(["species", "flipper_length_mm", "bill_length_mm", "island", "sex"]),
)
alt.Scale()
object to specify the scale# initialize the chart with the data
chart = alt.Chart(penguins)
# define what kind of marks to use
chart = chart.mark_circle(size=60)
# encode the visual channels
chart = chart.encode(
x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
tooltip=["species", "flipper_length_mm", "bill_length_mm", "island", "sex"],
)
# make the chart interactive
chart = chart.interactive()
chart
For a complete list of these encodings, see the Encodings section of the documentation.
Altair charts can be fully specified as JSON $\rightarrow$ easy to embed in HTML on websites!
# Save the chart as a JSON string!
json = chart.to_json()
# Print out the first 1,000 characters
print(json[:1000])
chart.save("chart.html")
# Display IFrame in IPython
from IPython.display import IFrame
IFrame('chart.html', width=600, height=375)
chart = (
alt.Chart(penguins)
.mark_circle(size=60)
.encode(
x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species:N",
)
.interactive()
)
chart
Note that the interactive()
call allows users to pan and zoom.
Altair is able to automatically determine the type of the variable using built-in heuristics. Altair and Vega-Lite support four primitive data types:
Data Type | Code | Description |
---|---|---|
quantitative | Q | Numerical quantity (real-valued) |
nominal | N | Name / Unordered categorical |
ordinal | O | Ordered categorial |
temporal | T | Date/time |
You can set the data type of a column explicitly using a one letter code attached to the column name with a colon:
Easily create multiple views of a dataset.
(
alt.Chart(penguins)
.mark_point()
.encode(
x=alt.X("flipper_length_mm:Q", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm:Q", scale=alt.Scale(zero=False)),
color="species:N"
).properties(
width=200, height=200
).facet(column="species").interactive()
)
Note: I've added the variable type identifiers (Q, N) to the previous example
Lots of features to create compound charts: repeated charts, faceted charts, vertical and horizontal stacking of subplots.
See the documentation for examples
A relatively new addition to altair, vega, and vega-lite. This allows you to define what happens when users interact with your visualization.
# create the selection box
brush = alt.selection_interval()
alt.Chart(penguins).mark_point().encode(
x=alt.X(
"flipper_length_mm", scale=alt.Scale(zero=False)
), # x
y=alt.Y(
"bill_length_mm", scale=alt.Scale(zero=False)
), # y
color=alt.condition(
brush, "species", alt.value("lightgray")
), # color
tooltip=["species", "flipper_length_mm", "bill_length_mm"],
).properties(
width=200, height=200, selection=brush
).facet(column="species")
We used the alt.condition()
function to specify a conditional color for the markers. It takes three arguments:
brush
object determines if a brush
, color the marker according to the "species" columnbrush
, use the literal hex color "lightgray" Let's examine the relationship between flipper_length_mm
, bill_length_mm
, and body_mass_g
We'll use a repeated chart that repeats variables across rows and columns.
Use a conditional color again, based on a brush selection.
# Setup the selection brush
brush = alt.selection(type='interval', resolve='global')
# Setup the chart
alt.Chart(penguins).mark_circle().encode(
x=alt.X(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
y=alt.Y(alt.repeat("row"), type='quantitative', scale=alt.Scale(zero=False)),
color=alt.condition(brush, 'species:N', alt.value('lightgray')), # conditional color
).properties(
width=200,
height=200,
selection=brush
).repeat( # repeat variables across rows and columns
row=['flipper_length_mm', 'bill_length_mm', 'body_mass_g'],
column=['body_mass_g', 'bill_length_mm', 'flipper_length_mm']
)
Let's explore the relationship between flipper length, body mass, and sex.
Scatter flipper length vs body mass for each species, colored by sex
alt.Chart(penguins).mark_point().encode(
x=alt.X('flipper_length_mm', scale=alt.Scale(zero=False)),
y=alt.Y('body_mass_g', scale=alt.Scale(zero=False)),
color=alt.Color("sex:N", scale=alt.Scale(scheme="Set2")),
).properties(
width=400, height=150
).facet(row='species')
I've specified the scale
keyword to the alt.Color()
object and passed a scheme
value:
scale=alt.Scale(scheme="Set2")
Set2
is a Color Brewer color. The available color schemes are very similar to those matplotlib. A list is available on the Vega documentation: https://vega.github.io/vega/docs/schemes/.
Next, plot the total number of penguins per species by the island they are found on.
(
alt.Chart(penguins)
.mark_bar()
.encode(
x=alt.X('*:Q', aggregate='count', stack='normalize'),
y='island:N',
color='species:N',
tooltip=['island','species', 'count(*):Q']
)
)
Plot a histogram of number of penguins by flipper length, grouped by species.
(
alt.Chart(penguins)
.mark_bar()
.encode(
x=alt.X('flipper_length_mm', bin=alt.Bin(maxbins=20)),
y='count():Q', #shorthand
color='species',
tooltip=['species', alt.Tooltip('count()', title='Number of Penguins')]
).properties(height=250)
)
Finally, let's bin the data by body mass and plot the average flipper length per bin, colored by the species.
(
alt.Chart(penguins.dropna())
.mark_line()
.encode(
x=alt.X("body_mass_g:Q", bin=alt.Bin(maxbins=10)),
y=alt.Y('mean(flipper_length_mm):Q', scale=alt.Scale(zero=False)), # apply a mean to the flipper length in each bin
color='species:N',
tooltip=['mean(flipper_length_mm):Q', "count():Q"]
).properties(height=300, width=500)
)
In addition to mean()
and count()
, you can apply a number of different transformations to the data before plotting, including binning, arbitrary functions, and filters.
See the Data Transformations section of the user guide for more details.
# Setup a brush selection
brush = alt.selection(type='interval')
# The top scatterplot: flipper length vs bill length
points = (
alt.Chart()
.mark_point()
.encode(
x=alt.X('flipper_length_mm:Q', scale=alt.Scale(zero=False)),
y=alt.Y('bill_length_mm:Q', scale=alt.Scale(zero=False)),
color=alt.condition(brush, 'species:N', alt.value('lightgray'))
).properties(
selection=brush,
width=800
)
)
# the bottom bar plot
bars = (
alt.Chart()
.mark_bar()
.encode(
x='count(species):Q',
y='species:N',
color='species:N',
).transform_filter(
brush.ref() # the filter transform uses the selection to filter the input data to this chart
).properties(width=800)
)
chart = alt.vconcat(points, bars, data=penguins) # vertical stacking
chart
Exercise: let's reproduce this famous Wall Street Journal visualization showing measles incidence over time.
# Print out the current working directory
%pwd
# List all of the current working directories
%ls
path = './data/measles_incidence.csv' # this is a relative path
data = pd.read_csv(path, skiprows=2, na_values='-')
data.head()
Note: data is weekly
Hints
groupby()
then sum()
work flow. WEEK
column — you don't need that in the grouping operation# drop week first
annual = data.drop('WEEK', axis=1)
grped = annual.groupby('YEAR')
print(grped)
annual = grped.sum()
annual
You can use melt()
to get tidy data. You should have 3 columns: year, state, and total incidence.
measles = annual.reset_index()
measles.head()
measles = measles.melt(id_vars='YEAR', value_name="incidence", var_name="state")
measles.head(n=10)
mark_rect()
function to encode the values as rectangles and then color them according to the average annual measles incidence per state.You'll want to take advantage of the custom color map defined below to best match the WSJ's graphic.
# Define a custom colormape using Hex codes & HTML color names
colormap = alt.Scale(
domain=[0, 100, 200, 300, 1000, 3000],
range=[
"#F0F8FF",
"cornflowerblue",
"mediumseagreen",
"#FFEE00",
"darkorange",
"firebrick",
],
type="sqrt",
)
See the documentation for more information.
For data sources with larger than 5,000 rows, you'll need to run the code below for Altair to work — it forces Altair save a local copy of the data.
alt.data_transformers.enable('json')
# Heatmap of YEAR vs state, colored by incidence
chart = (
alt.Chart(measles)
.mark_rect()
.encode(
x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
color=alt.Color("incidence:Q", sort="ascending", scale=colormap),
tooltip=["state", "YEAR", "incidence"],
)
.properties(width=700, height=500)
)
chart
threshold = pd.DataFrame([{"threshold": 1963}])
threshold
# Vertical line for vaccination year
rule = alt.Chart(threshold).mark_rule(strokeWidth=4).encode(x="threshold:O")
chart + rule
Note: I've used the "+" shorthand operator for layering two charts on top of each other — see the documentation on Layered Charts for more info!
The categorical color scale choice is properly not the best. It's best to use a perceptually uniform color scale like viridis. See below:
# Heatmap of YEAR vs state, colored by incidence
chart = (
alt.Chart(measles)
.mark_rect()
.encode(
x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
color=alt.Color(
"incidence:Q",
sort="ascending",
scale=alt.Scale(scheme="viridis"),
legend=None,
),
tooltip=["state", "YEAR", "incidence"],
)
.properties(width=700, height=450)
)
# Vertical line for vaccination year
rule = (
alt.Chart(threshold).mark_rule(strokeWidth=4, color="white").encode(x="threshold:O")
)
chart + rule
# The heatmap
chart = (
alt.Chart(measles)
.mark_rect()
.encode(
x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
y=alt.Y("state:N", axis=alt.Axis(title=None, ticks=False)),
color=alt.Color(
"incidence:Q",
sort="ascending",
scale=alt.Scale(scheme="viridis"),
legend=None,
),
tooltip=["state", "YEAR", "incidence"],
)
.properties(width=700, height=400)
)
# The annual average
annual_avg = (
alt.Chart(measles)
.mark_line()
.encode(
x=alt.X("YEAR:O", axis=alt.Axis(title=None, ticks=False)),
y=alt.Y("mean(incidence):Q", axis=alt.Axis(title=None, ticks=False)),
)
.properties(width=700, height=200)
)
# Add the vertical line
rule = (
alt.Chart(threshold).mark_rule(strokeWidth=4, color="white").encode(x="threshold:O")
)
# Combine everything
alt.vconcat(annual_avg, chart + rule)