9/8/2021
To follow along today, go to https://github.com/MUSA-550-Fall-2021/week-1
The following line imports the pandas package:
import pandas as pd
The primary objects in pandas are the:
DataFrame
, which is very similar to an Excel spreadsheet, and has rows and named columnsSeries
, which represents a single column of data. A DataFrame
contains one or more Series
and a name for each Series
.The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in R.
You can think Series
objects as fancier versions of Python's built-in list data type
To create a Series
object:
# city names and population
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
city_names
DataFrame
objects can be created by passing a dict
mapping string
column names to their respective Series
.
cities_df = pd.DataFrame({ 'City Name': city_names,
'Population': population })
cities_df
Note: always try to give your variables meaningful names. It will help immensely when you are trying to debug problems or when you're trying to understand the code you've written months later.
# access columns with a dict-like syntax
cities_df['Population']
# list slicing: get the elements with indices 1 and 2 (but NOT 3)
cities_list = ['San Francisco', 'San Jose', 'Sacramento']
cities_list[:2]
Unfortunately the functionality for slicing lists is not that powerful...but pandas will have many more features!
# slicing data frame rows is very similar!
cities_df[:2]
pandas includes functionality for many different ways of selecting data. See the documentation for many more examples.
cities_df['Population'].median()
NumPy is a popular toolkit for scientific computing.
pandas Series
can be used as arguments to most NumPy functions:
import numpy as np
# calculate the median population value
np.median(cities_df['Population'])
For more complex single-column transformations, you can use Series.apply
. It accepts a function that is applied to each value in the Series
.
For example, we can find which cities have a population greater than a million:
# define our function
def get_large_cities(population):
return population > 1e6
large_cities_sel = cities_df['Population'].apply(get_large_cities)
large_cities_sel
# add the new computed column to our original data frame
cities_df["Large Cities"] = large_cities_sel
small_cities_func = lambda population: population < 1e6
# we can also use lambda (unnamed, inline) functions
cities_df["Small Cities"] = cities_df["Population"].apply(
small_cities_func
)
# print out
cities_df
We can select the "large" cities by passing the boolean values to the .loc()
function of a DataFrame
:
cities_df['Large Cities']
cities_df.loc[cities_df['Large Cities']]
cities_df['Population'] > 1e6
# this is equivalent to doing
cities_df.loc[ cities_df['Population'] < 1e6 ]
Use the Python tilde operator to do a logicial not operation:
# reverse the large cities boolean selection
cities_df.loc[~large_cities_sel]
# or equivalently:
# NOTE: you need to put the whole expression in () and then apply the tilde!
cities_df.loc[~ (cities_df['Population'] > 1e6) ]
We can use the pandas query function.
The query() function will return a subset of your dataframe based on a string version of the boolean expression.
cities_df.query("Population < 1e6")
The documentation is your best friend! Use the question mark operator!
# use the question mark
pd.DataFrame.loc?
Let's load census data on rentals rates from the 2019 Amercian Community Survey.
We've stored the data locally in the comma-separated value (CSV) format. Later in the course, you'll learn how to download them directly using Python.
Use the read_csv()
function from pandas. The first argument to read_csv()
is the file path to load.
census_df = pd.read_csv("./data/census/ACS_19_1YR_B25003.csv")
census_df.head(n=5)
Select certain columns and rename them.
## FIRST STEP: let's trim to the columns we want
# columns we want
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
census_df = census_df[columns]
# Peak at the dataframe with trimmed columns
census_df.head()
## STEP 2: Let's rename the columns!
# rename columns
census_df.columns = ['City', 'Total', 'Rentals']
# Return the head (first 5 rows) from the cell
census_df.head(n=5)
# Step 3: Add the rental rate as a new column
census_df['Rental Rate'] = census_df['Rentals'] / census_df['Total'] * 100
census_df.head()
# How many rows are in the dataframe? --> use the len() operator
len(census_df)
We can select the largest cities by population using the Series.isin()
function, which checks if each value in the Series
is in the specified list.
# Define the list of city names we want
top_cities_list = [
"Philadelphia city, Pennsylvania",
"New York city, New York",
"Los Angeles city, California",
"Houston city, Texas",
"Chicago city, Illinois",
"Phoenix city, Arizona",
]
# Use the isin() to test whether a row value is in a city
census_df["City"].isin(top_cities_list)
# Now do the selection!
census_df.loc[census_df["City"].isin(top_cities_list)] # select the valid rows
Only 6 rows as expected!
Alternatively, we could use the query() function. The syntax is a bit different (and I always forget and have to look at the documentation).
In the query() string, you can reference existing variables by prefacing the variable name with an "@" symbol.
For example:
# Get the rows where the City column is in "top_cities_list"
census_df.query("City in @top_cities_list")
Python uses the "%" operator to insert variable values into strings. For example, the file name of the data we want to load for 2019 is:
year_str = "19"
path = "./data/census/ACS_%s_1YR_B25003.csv" % year_str
print(path)
See this guide on this type of string formatting for more info.
Python version 3.6 introduced a new and improved string formatting syntax, where the variable that is being inserted can be referenced directly using the variable name.
The variable name should be enclosed in curly braces inside the string:
year_str = "19"
path = f"./data/census/ACS_{year_str}_1YR_B25003.csv"
print(path)
For more info: see this guide
Let's use a for loop to loop over each year value from 2005 to 2019
all_rental_data = []
start_year = 2005
stop_year = 2019
for year in range(start_year, stop_year + 1):
#---------------------------------------------
# Step 0: Print out year
# The variable year changes for each iteration of the loop
#----------------------------------------------
print(f"Loading data for year = {year}...")
#---------------------------------
# Step 1: Read data for this year
#---------------------------------
year_str = str(year) # convert integer value of "year" to a string
year_str = year_str[2:] # extract the last two digits of the year string
rental_df = pd.read_csv(f"./data/census/ACS_{year_str}_1YR_B25003.csv")
#---------------------------------
# Step 2: Columns we want
#---------------------------------
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
rental_df = rental_df[columns]
#---------------------------------
# Step 3: Rename columns
#---------------------------------
rental_df.columns = ['City', 'Total', 'Rentals']
#----------------------------------
# Step 4: Calculate the rental rate
#----------------------------------
rental_df['Rental Rate'] = rental_df['Rentals'] / rental_df['Total'] * 100
#----------------------------------
# Step 5: Select the cities we want
#----------------------------------
selection = rental_df['City'].isin(top_cities_list)
top_cities_df = rental_df.loc[selection].copy()
#----------------------------------
# Step 6: Add a column for the year
#----------------------------------
top_cities_df['Year'] = year
#-----------------------------------
# Step 7: Save this dataframe in the list
#-----------------------------------
all_rental_data.append(top_cities_df)
#---------------------------------------------------
# Step 8: After the for loop, combine all of the
# data frames into one along the row axis
#---------------------------------------------------
all_rental_data = pd.concat(all_rental_data, axis=0)
# Note: the pd.concat() took a list of dataframes and converted them
# into a single dataframe!
type(all_rental_data)
all_rental_data.head()
# What about the tail?
all_rental_data.tail()
# How many rows? Hopefully (6 cities x 15 years = 90!)
len(all_rental_data)
matplotlib
¶Much more to come next week. For now, we'll use some very simple plotting commands with Python's main plotting libary matplotlib
.
Import the library we'll use from matplotlib
.
from matplotlib import pyplot as plt
We'll use the plot()
function which plots a simple x
vs y
line.
plot()
function for the sub-selection data frameTo find the unique city names in the 'City' column, we can use the unique()
function:
all_rental_data['City'].unique()
with plt.style.context('ggplot'):
# Step 1: Create a figure and axes to plot on
fig, ax = plt.subplots(figsize=(10,6))
# Step 2: Loop over each city and plot
for city in all_rental_data['City'].unique():
# Print out the city in each loop iteration
print(city)
# Select data for this city
this_city = all_rental_data['City'] == city
city_df = all_rental_data.loc[this_city]
# Plot Year vs Rental Rate
ax.plot(city_df['Year'], city_df['Rental Rate'], label=city, linewidth=4)
# Add a legend
ax.legend(loc=0, ncol=3, fontsize=12)
# Set the y limit
ax.set_ylim(30, 72)
# Add a y label
ax.set_ylabel("Rental Rate in Percent");
Note: we used the "ggplot" theme to style our plots. Matplotlib comes with a number of built in styles: see all of the examples here.
# trim to just Philadelphia first, using this boolean selection index
all_rental_data['City'] == 'Philadelphia city, Pennsylvania'
# select Philadelphia only
philly = all_rental_data.loc[all_rental_data['City']=='Philadelphia city, Pennsylvania']
philly
Note how the first row of the above dataframe has an index label of 393. This means we can access it using the .loc[]
function like:
philly.loc[487]
This is different than the .iloc[]
function, which also indexes based on the integer value of the row, regardless of what the labels in the index are. For example, the first row of a dataframe can always be accessed using:
For more details on the iloc()
function, see the documentation on indexing by position.
philly.iloc[2]
We can reset the index labels so they range from 0 to the length of the dataframe, using the reset_index()
function. For example
philly.reset_index(drop=True)
Next, to calculate our percent difference we need to get the values for Philadelphia in 2005 and 2019:
# select 2005
philly_2005 = philly.loc[philly['Year']==2005]
philly_2005
# Notice b/c we have a dataframe with length of 1, we don't get back just a number
philly_2005["Rental Rate"]
len(philly_2005["Rental Rate"])
Notice that philly_2005
has only a single row. It's more useful if we had this a scalar (just a single number)
We can use the .squeeze()
function to do this. It does just one it sounds like: if you have a DataFrame with only one row, it will "squeeze" the row dimension by removing it, returning just a Series object:
philly_2005 = philly.loc[philly['Year']==2005].squeeze()
philly_2005
value_2005 = philly_2005["Rental Rate"]
value_2005
# Do the same thing for 2019 — make sure you squeeze!
philly_2019 = philly.loc[philly['Year']==2019].squeeze()
# Get the value
value_2019 = philly_2019['Rental Rate']
value_2019
# calculate percent change
percent_change = (value_2019 - value_2005)/ value_2005 * 100
print(f"change in rental rate in Philadelphia from 2005 to 2019 = {percent_change}")
Load citywide Zillow Rent Index (ZRI) and Zillow Home Value Index (ZHVI) data.
Files were downloaded from https://www.zillow.com/research/data/
home_values = pd.read_csv("data/zillow/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
rent_values = pd.read_csv("data/zillow/Metro_ZORI_AllHomesPlusMultifamily_SSA.csv")
Peek at the first few rows of the ZRI data:
rent_values.head()
And do the same for the ZHVI data:
home_values.head()
valid_cities = [
"New York, NY",
"Chicago, IL",
"Los Angeles-Long Beach-Anaheim, CA",
"Philadelphia, PA",
"Houston, TX",
"Phoenix, AZ",
]
selection = home_values['RegionName'].isin(valid_cities)
home_values_trimmed = home_values.loc[selection]
selection = rent_values['RegionName'].isin(valid_cities)
rent_values_trimmed = rent_values.loc[selection]
rent_values_trimmed
Unwanted columns can be dropped from the data frame using the drop()
function.
Note that the column axis is the second axis (axis=1
), and if you wanted to remove rows, you could use the first axis (axis=0
).
x = ['SizeRank', 'RegionID', "RegionType", "StateName"]
home_values_final = home_values_trimmed.drop(x, axis=1)
columns = ['SizeRank', 'RegionID']
rent_values_final = rent_values_trimmed.drop(columns, axis=1)
rent_values_final
Currently, our data is in wide format $\rightarrow$ each observation has its own column. This usually results in many columns but few rows.
home_values_final
Usually it's better to have data in tidy (also known as long) format.
Tidy datasets are arranged such that each variable is a column and each observation is a row.
In our case, we want to have a column called ZRI
and one called ZHVI
and a row for each month that the indices were measured.
pandas provides the melt()
function for converting from wide formats to tidy formats.
melt()
doesn’t aggregate or summarize the data. It transforms it into a different shape, but it contains the exact same information as before.
Imagine you have 6 rows of data (each row is a unique city) with 10 columns of home values (each column is a different month). That is wide data and is the format usually seen in spreadsheets or tables in a report.
If you melt()
that wide data, you would get a table with 60 rows and 3 columns. Each row would contain the city name, the month, and the home value that city and month. This tidy-formatted data contains the same info as the wide data, but in a different form.
This animation shows the transformation from wide to long / long to wide. You can ignore gather() and spread() - those are the R versions of the pandas functions.
pd.melt?
Now, let's melt our datasets:
ZHVI = pd.melt(
home_values_final,
id_vars=["RegionName"],
value_name="ZHVI",
var_name="Date"
)
ZRI = pd.melt(
rent_values_final,
id_vars=["RegionName"],
value_name="ZRI",
var_name="Date"
)
and take a look:
ZRI.tail()
ZHVI.head()
Another common operation is merging, also known as joining, two datasets.
We can use the merge()
function to merge observations that have the same Date
and RegionName
values.
ZVHI
has the Date column in the format of YYYY-MM-DDZRI
has the Date column in the format of YYYY-MMWe need to put them into the same format before merging the data!
We can fix this by create Datetime
objects and formatting the dates into the same format.
Currently our Date
column is stored as a string
.
pandas includes additional functionality for dates, but first we must convert the strings using the to_datetime()
function.
# Convert the Date column to Datetime objects
ZHVI["Date"] = pd.to_datetime(ZHVI["Date"])
strftime
function¶We can use the ".dt" property of the Date
column to access datetime functions of the new Datetime column.
For converting to strings in a certain format, we can use the "strftime" function (docs). This uses a special syntax to convert the date object to a string with a specific format.
Important reference: Use the this strftime guide to look up the syntax!
# Extract YYYY-MM string
date_strings = ZHVI["Date"].dt.strftime("%Y-%m")
# First entry is a string!
date_strings.iloc[0]
# Add the strings back as a column
ZHVI["Date"] = date_strings
ZHVI.head()
zillow_data = pd.merge(ZRI, ZHVI, on=['Date', 'RegionName'], how='outer')
zillow_data.sort_values("Date")
Merging is very powerful and the merge can be done in a number of ways. See the infographic on joining in this repository.
Available on GitHub:
Two parts:
Due date: Monday 9/20 by the start of class (7pm)