Sep 13, 2021
Office hours survey: https://www.surveymonkey.com/r/TCKNWTX
Questions / concerns?
Guides to installing Python, using conda for managing packages, and working with Jupyter notebook on course website:
Piazza post walking through somes tips for managing the folder structure on your laptop:
Part 1
Part 2
import pandas as pd
from matplotlib import pyplot as plt
Load citywide Zillow Rent Index (ZRI) and Zillow Home Value Index (ZHVI) data.
Files were downloaded from https://www.zillow.com/research/data/
home_values = pd.read_csv("data/zillow/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
rent_values = pd.read_csv("data/zillow/Metro_ZORI_AllHomesPlusMultifamily_SSA.csv")
Peek at the first few rows of the ZRI data:
rent_values.head()
And do the same for the ZHVI data:
home_values.head()
valid_cities = [
"New York, NY",
"Chicago, IL",
"Los Angeles-Long Beach-Anaheim, CA",
"Philadelphia, PA",
"Houston, TX",
"Phoenix, AZ",
]
selection = home_values['RegionName'].isin(valid_cities)
home_values_trimmed = home_values.loc[selection]
selection = rent_values['RegionName'].isin(valid_cities)
rent_values_trimmed = rent_values.loc[selection]
rent_values_trimmed
Unwanted columns can be dropped from the data frame using the drop()
function.
Note that the column axis is the second axis (axis=1
), and if you wanted to remove rows, you could use the first axis (axis=0
).
x = ['SizeRank', 'RegionID', "RegionType", "StateName"]
home_values_final = home_values_trimmed.drop(x, axis=1)
columns = ['SizeRank', 'RegionID']
rent_values_final = rent_values_trimmed.drop(columns, axis=1)
rent_values_final
Currently, our data is in wide format $\rightarrow$ each observation has its own column. This usually results in many columns but few rows.
home_values_final
Usually it's better to have data in tidy (also known as long) format.
Tidy datasets are arranged such that each variable is a column and each observation is a row.
In our case, we want to have a column called ZRI
and one called ZHVI
and a row for each month that the indices were measured.
pandas provides the melt()
function for converting from wide formats to tidy formats.
melt()
doesn’t aggregate or summarize the data. It transforms it into a different shape, but it contains the exact same information as before.
Imagine you have 6 rows of data (each row is a unique city) with 10 columns of home values (each column is a different month). That is wide data and is the format usually seen in spreadsheets or tables in a report.
If you melt()
that wide data, you would get a table with 60 rows and 3 columns. Each row would contain the city name, the month, and the home value that city and month. This tidy-formatted data contains the same info as the wide data, but in a different form.
This animation shows the transformation from wide to long / long to wide. You can ignore gather() and spread() - those are the R versions of the pandas functions.
pd.melt?
Now, let's melt our datasets:
ZHVI = pd.melt(
home_values_final,
id_vars=["RegionName"],
value_name="ZHVI",
var_name="Date"
)
ZRI = pd.melt(
rent_values_final,
id_vars=["RegionName"],
value_name="ZRI",
var_name="Date"
)
and take a look:
ZRI.tail()
ZHVI.head()
Another common operation is merging, also known as joining, two datasets.
We can use the merge()
function to merge observations that have the same Date
and RegionName
values.
ZVHI
has the Date column in the format of YYYY-MM-DDZRI
has the Date column in the format of YYYY-MMWe need to put them into the same format before merging the data!
We can fix this by create Datetime
objects and formatting the dates into the same format.
Currently our Date
column is stored as a string
.
pandas includes additional functionality for dates, but first we must convert the strings using the to_datetime()
function.
# Convert the Date column to Datetime objects
ZHVI["Date"] = pd.to_datetime(ZHVI["Date"])
strftime
function¶We can use the ".dt" property of the Date
column to access datetime functions of the new Datetime column.
For converting to strings in a certain format, we can use the "strftime" function (docs). This uses a special syntax to convert the date object to a string with a specific format.
Important reference: Use the this strftime guide to look up the syntax!
# Extract YYYY-MM string
date_strings = ZHVI["Date"].dt.strftime("%Y-%m")
# First entry is a string!
date_strings.iloc[0]
# Add the strings back as a column
ZHVI["Date"] = date_strings
ZHVI.head()
# Left dataframe is ZRI
# Right dataframe is ZHVI
zillow_data = pd.merge(ZRI, ZHVI, on=['Date', 'RegionName'], how='outer')
# Let's sort the data by Date
zillow_data = zillow_data.sort_values("Date", ascending=True)
zillow_data
Merging is very powerful and the merge can be done in a number of ways. In this case, we did a outer merge in order to keep all parts of each dataframe. By contrast, the inner merge only keeps the overlapping intersection of the merge.
See the infographic on joining in this repository.
# Convert the Date column back to a Datetime
zillow_data["Date"] = pd.to_datetime(zillow_data["Date"])
Quick trick: Series
that hold Datetime
objects have a dt
attribute that let's you grab parts of the date easily.
For example, we can easily add new columns for the month and year using:
# Note the the dtype is now datetime64[ns]
zillow_data['Date'].head()
# Extract out the month and year of each date
# Add them to the data frame as new columns!
zillow_data['Month'] = zillow_data['Date'].dt.month
zillow_data['Year'] = zillow_data['Date'].dt.year
zillow_data.head()
Year
¶pandas is especially useful for grouping and aggregating data via the groupby()
function.
From the pandas documentation, groupby
means:
The documentation is available here.
We can calculate annual averages for each year by grouping by the RegionName
and Year
columns and taking the mean of our desired column. For example:
# calculate mean values for each Year and City (RegionName)
annual_ZHVI = zillow_data.groupby(['RegionName', 'Year'])['ZHVI'].mean()
annual_ZRI = zillow_data.groupby(['RegionName', 'Year'])['ZRI'].mean()
print(type(annual_ZHVI))
annual_ZHVI.head()
Imporant: The result of the groupby operation is always indexed by the group keys!
In this case, the result is indexed by the columns we grouped by (RegionName
and Year
).
We can reset the index so that the index values are listed as columns in the data frame again.
annual_ZHVI = annual_ZHVI.reset_index()
annual_ZRI = annual_ZRI.reset_index()
annual_ZHVI.head(n=50)
with plt.style.context("ggplot"):
# Create figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Plot for each unique city
for city in annual_ZHVI["RegionName"].unique():
# select the data for this city
selection = annual_ZHVI["RegionName"] == city
df = annual_ZHVI.loc[selection]
# plot
ax.plot(df["Year"], df["ZHVI"] / 1e3, label=city, linewidth=4)
# Format the axes
ax.set_ylim(50, 800)
ax.legend(loc=0, ncol=2, fontsize=12)
ax.set_ylabel("Zillow Home Value Index\n(in thousands of dollars)")
with plt.style.context('ggplot'):
# Create the figure and axes
fig, ax = plt.subplots(figsize=(10,6))
# Loop over the cities to plot each one
for city in annual_ZRI['RegionName'].unique():
# Select the city data
selection = annual_ZRI['RegionName'] == city
df = annual_ZRI.loc[selection]
# Plot
ax.plot(df['Year'], df['ZRI'], label=city, linewidth=4)
# Format
ax.set_ylim(1000, 3300)
ax.legend(loc=0, ncol=2)
ax.set_ylabel('Zillow Rent Index (in dollars)')
Starting with two of my favorite historical examples, and their modern renditions...
See http://projects.flowingdata.com/atlas, by Nathan Yau
ggplot2
provides an R implementation of The Grammar of GraphicsSee, e.g. Data Sketches
by Simon Scarr in 2011
The same data, but different design choices...
Some recent examples...
Lots of companies, cities, institutions, etc. have started design guidelines to improve and standardize their data visualizations.
One I particularly like: City of London Data Design Guidelines
First few pages are listed in the "Recommended Reading" portion of this week's README.
London's style guide includes some basic data viz principles that everyone should know and includes the following example:
Choose your colors carefully:
matplotlib
matplotlib
and available by defaultFor quantitative data, these color maps are very strong options
Almost too many tools available...
So many tools...so little time
You'll use different packages to achieve different goals, and they each have different things they are good at.
Today, we'll focus on:
And next week for geospatial data:
Goal: introduce you to the most common tools and enable you to know the best package for the job in the future
We'll use the object-oriented interface to matplotlib
Figure
and Axes
objectsAxes
objectFigure
or Axes
objectsimport numpy as np
from matplotlib import pyplot as plt
# Generate some random data using numpy (numbers between -1 and 1)
# Shape is (100, 100)
data = 2 * np.random.random(size=(100,100)) - 1
print(data.min(), data.max(), data.mean())
plt.pcolormesh(data, cmap='viridis')
plt.pcolormesh(data, cmap='jet')
plt.pcolormesh(data, cmap='coolwarm')
Important bookmark: Choosing Color Maps in Matplotlib
# print out all available color map names
print(len(plt.colormaps()))
We'll use the Palmer penguins data set, data collected for three species of penguins at Palmer station in Antartica
Artwork by @allison_horst
# Load data on Palmer penguins
penguins = pd.read_csv("./data/penguins.csv")
penguins.head(n=10)
Data is already in tidy format
I want to scatter flipper length vs. bill length, colored by the penguin species
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Color for each species
color_map = {"Adelie": "#1f77b4", "Gentoo": "#ff7f0e", "Chinstrap": "#D62728"}
# Group the data frame by species and loop over each group
# NOTE: "group" will be the dataframe holding the data for "species"
for species, group in penguins.groupby("species"):
print(f"Plotting {species}...")
# Plot flipper length vs bill length for this group
ax.scatter(
group["flipper_length_mm"],
group["bill_length_mm"],
marker="o",
label=species,
color=color_map[species],
alpha=0.75,
)
# Format the axes
ax.legend(loc="best")
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
pandas
?¶# Tab complete on the plot attribute of a dataframe to see the available functions
#penguins.plot.scatter?
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Calculate a list of colors
color_map = {"Adelie": "#1f77b4", "Gentoo": "#ff7f0e", "Chinstrap": "#D62728"}
colors = [color_map[species] for species in penguins["species"]]
# Scatter plot two columns, colored by third
penguins.plot.scatter(
x="flipper_length_mm",
y="bill_length_mm",
c=colors,
alpha=0.75,
ax=ax, # Plot on the axes object we created already!
)
# Format
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
Note: no easy way to get legend added to the plot in this case...
pandas
plotting capabilities are good for quick and unpolished plots during the data exploration phase