Week 1B: Exploratory Data Science in Python

9/8/2021

Today

  • Introduction to Pandas
  • Key data analysis concepts
  • Example: Zillow + Census housing data

Reminder: The weekly workflow

  • You'll set up your local Python environment as part of first homework assignment (posted today!)
  • Each week, you will have two options to follow along with lectures:
    1. Using Binder in the cloud, launching via the button on the week's repository
    2. Download the week's repository to your laptop and launch the notebook locally
  • Work on homeworks locally on your laptop — Binder is only a temporary environment (no save features)

To follow along today, go to https://github.com/MUSA-550-Fall-2021/week-1

Screen%20Shot%202020-09-02%20at%2010.31.40%20PM.png

Python Data Analysis: the pandas package

Documentation is available at https://pandas.pydata.org

The following line imports the pandas package:

In [63]:
import pandas as pd

Basic pandas concepts

The primary objects in pandas are the:

  • DataFrame, which is very similar to an Excel spreadsheet, and has rows and named columns
  • Series, which represents a single column of data. A DataFrame contains one or more Series and a name for each Series.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in R.

You can think Series objects as fancier versions of Python's built-in list data type

To create a Series object:

In [64]:
# city names and population
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199]) 
In [65]:
city_names
Out[65]:
0    San Francisco
1         San Jose
2       Sacramento
dtype: object

DataFrame objects can be created by passing a dict mapping string column names to their respective Series.

In [66]:
cities_df = pd.DataFrame({ 'City Name': city_names, 
                          'Population': population })
cities_df
Out[66]:
City Name Population
0 San Francisco 852469
1 San Jose 1015785
2 Sacramento 485199

Note: always try to give your variables meaningful names. It will help immensely when you are trying to debug problems or when you're trying to understand the code you've written months later.

Accessing data

In [67]:
# access columns with a dict-like syntax
cities_df['Population']
Out[67]:
0     852469
1    1015785
2     485199
Name: Population, dtype: int64

Rows can be accessed using Python's syntax for slicing

In [68]:
# list slicing: get the elements with indices 1 and 2 (but NOT 3)
cities_list = ['San Francisco', 'San Jose', 'Sacramento']
cities_list[:2]
Out[68]:
['San Francisco', 'San Jose']

Unfortunately the functionality for slicing lists is not that powerful...but pandas will have many more features!

In [69]:
# slicing data frame rows is very similar!
cities_df[:2]
Out[69]:
City Name Population
0 San Francisco 852469
1 San Jose 1015785

pandas includes functionality for many different ways of selecting data. See the documentation for many more examples.

Manipulating data

In [70]:
cities_df['Population'].median()
Out[70]:
852469.0

NumPy is a popular toolkit for scientific computing.

pandas Series can be used as arguments to most NumPy functions:

In [71]:
import numpy as np

# calculate the median population value
np.median(cities_df['Population'])
Out[71]:
852469.0

For more complex single-column transformations, you can use Series.apply. It accepts a function that is applied to each value in the Series.

For example, we can find which cities have a population greater than a million:

In [72]:
# define our function
def get_large_cities(population):
    return population > 1e6

large_cities_sel = cities_df['Population'].apply(get_large_cities)
In [73]:
large_cities_sel
Out[73]:
0    False
1     True
2    False
Name: Population, dtype: bool
In [74]:
# add the new computed column to our original data frame
cities_df["Large Cities"] = large_cities_sel

small_cities_func =  lambda population: population < 1e6

# we can also use lambda (unnamed, inline) functions
cities_df["Small Cities"] = cities_df["Population"].apply(
    small_cities_func
)

# print out
cities_df
Out[74]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
1 San Jose 1015785 True False
2 Sacramento 485199 False True

Data selection

We can select the "large" cities by passing the boolean values to the .loc() function of a DataFrame:

In [75]:
cities_df['Large Cities']
Out[75]:
0    False
1     True
2    False
Name: Large Cities, dtype: bool
In [76]:
cities_df.loc[cities_df['Large Cities']]
Out[76]:
City Name Population Large Cities Small Cities
1 San Jose 1015785 True False

There is a faster, more concise, to achieve this!

In [77]:
cities_df['Population'] > 1e6
Out[77]:
0    False
1     True
2    False
Name: Population, dtype: bool
In [78]:
# this is equivalent to doing 
cities_df.loc[ cities_df['Population'] < 1e6 ]
Out[78]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
2 Sacramento 485199 False True

What about getting rows where "Large Cities" is False?

Use the Python tilde operator to do a logicial not operation:

In [79]:
# reverse the large cities boolean selection
cities_df.loc[~large_cities_sel]
Out[79]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
2 Sacramento 485199 False True
In [80]:
# or equivalently:
# NOTE: you need to put the whole expression in () and then apply the tilde!
cities_df.loc[~ (cities_df['Population'] > 1e6) ]
Out[80]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
2 Sacramento 485199 False True

An even faster way!

We can use the pandas query function.

The query() function will return a subset of your dataframe based on a string version of the boolean expression.

In [81]:
cities_df.query("Population < 1e6")
Out[81]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
2 Sacramento 485199 False True

How to remember the specifics of all of these functions?

The documentation is your best friend! Use the question mark operator!

In [82]:
# use the question mark
pd.DataFrame.loc?

Now onto a more interesting example...

Let's load census data on rentals rates from the 2019 Amercian Community Survey.

We've stored the data locally in the comma-separated value (CSV) format. Later in the course, you'll learn how to download them directly using Python.

Use the read_csv() function from pandas. The first argument to read_csv() is the file path to load.

In [83]:
census_df = pd.read_csv("./data/census/ACS_19_1YR_B25003.csv")
In [84]:
census_df.head(n=5)
Out[84]:
GEO.id GEO.display-label HD01_VD01 HD02_VD01 HD01_VD02 HD02_VD02 HD01_VD03 HD02_VD03
0 1600000US0103076 Auburn city, Alabama 26156 2687 13095 1829 13061 2108
1 1600000US0107000 Birmingham city, Alabama 93300 3618 37788 2635 55512 2788
2 1600000US0121184 Dothan city, Alabama 27103 1044 15891 824 11212 817
3 1600000US0135896 Hoover city, Alabama 32461 1606 22391 1515 10070 1496
4 1600000US0137000 Huntsville city, Alabama 88930 2823 48728 2758 40202 2550

Select certain columns and rename them.

  • Total: Total occupied housing units
  • Rentals: Renter-occupied housing units
In [85]:
## FIRST STEP: let's trim to the columns we want

# columns we want
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
census_df = census_df[columns]
In [86]:
# Peak at the dataframe with trimmed columns
census_df.head()
Out[86]:
GEO.display-label HD01_VD01 HD01_VD03
0 Auburn city, Alabama 26156 13061
1 Birmingham city, Alabama 93300 55512
2 Dothan city, Alabama 27103 11212
3 Hoover city, Alabama 32461 10070
4 Huntsville city, Alabama 88930 40202
In [87]:
## STEP 2: Let's rename the columns!

# rename columns
census_df.columns = ['City', 'Total', 'Rentals']

# Return the head (first 5 rows) from the cell
census_df.head(n=5)
Out[87]:
City Total Rentals
0 Auburn city, Alabama 26156 13061
1 Birmingham city, Alabama 93300 55512
2 Dothan city, Alabama 27103 11212
3 Hoover city, Alabama 32461 10070
4 Huntsville city, Alabama 88930 40202
In [88]:
# Step 3: Add the rental rate as a new column
census_df['Rental Rate'] = census_df['Rentals'] / census_df['Total'] * 100

census_df.head()
Out[88]:
City Total Rentals Rental Rate
0 Auburn city, Alabama 26156 13061 49.935005
1 Birmingham city, Alabama 93300 55512 59.498392
2 Dothan city, Alabama 27103 11212 41.368114
3 Hoover city, Alabama 32461 10070 31.021842
4 Huntsville city, Alabama 88930 40202 45.206342
In [89]:
# How many rows are in the dataframe? --> use the len() operator

len(census_df)
Out[89]:
634

We can select the largest cities by population using the Series.isin() function, which checks if each value in the Series is in the specified list.

In [90]:
# Define the list of city names we want
top_cities_list = [
    "Philadelphia city, Pennsylvania",
    "New York city, New York",
    "Los Angeles city, California",
    "Houston city, Texas",
    "Chicago city, Illinois",
    "Phoenix city, Arizona",
]
In [91]:
# Use the isin() to test whether a row value is in a city
census_df["City"].isin(top_cities_list)
Out[91]:
0      False
1      False
2      False
3      False
4      False
       ...  
629    False
630    False
631    False
632    False
633    False
Name: City, Length: 634, dtype: bool
In [92]:
# Now do the selection!
census_df.loc[census_df["City"].isin(top_cities_list)]  # select the valid rows
Out[92]:
City Total Rentals Rental Rate
19 Phoenix city, Arizona 586878 267225 45.533314
94 Los Angeles city, California 1398900 887785 63.463078
284 Chicago city, Illinois 1080345 603352 55.848086
439 New York city, New York 3211033 2188214 68.146730
487 Philadelphia city, Pennsylvania 619505 295783 47.745055
538 Houston city, Texas 876504 522942 59.662249

Only 6 rows as expected!

Alternatively, we could use the query() function. The syntax is a bit different (and I always forget and have to look at the documentation).

In the query() string, you can reference existing variables by prefacing the variable name with an "@" symbol.

For example:

In [93]:
# Get the rows where the City column is in "top_cities_list"
census_df.query("City in @top_cities_list")
Out[93]:
City Total Rentals Rental Rate
19 Phoenix city, Arizona 586878 267225 45.533314
94 Los Angeles city, California 1398900 887785 63.463078
284 Chicago city, Illinois 1080345 603352 55.848086
439 New York city, New York 3211033 2188214 68.146730
487 Philadelphia city, Pennsylvania 619505 295783 47.745055
538 Houston city, Texas 876504 522942 59.662249

An aside about string formatting

Python uses the "%" operator to insert variable values into strings. For example, the file name of the data we want to load for 2019 is:

In [94]:
year_str = "19"
path = "./data/census/ACS_%s_1YR_B25003.csv" % year_str
print(path)
./data/census/ACS_19_1YR_B25003.csv

See this guide on this type of string formatting for more info.

Python f-strings

Python version 3.6 introduced a new and improved string formatting syntax, where the variable that is being inserted can be referenced directly using the variable name.

The variable name should be enclosed in curly braces inside the string:

In [95]:
year_str = "19"
path = f"./data/census/ACS_{year_str}_1YR_B25003.csv"
print(path) 
./data/census/ACS_19_1YR_B25003.csv

For more info: see this guide

Now let's load data from 2005 to 2019

Let's use a for loop to loop over each year value from 2005 to 2019

In [96]:
all_rental_data = []
start_year = 2005
stop_year = 2019
for year in range(start_year, stop_year + 1):
    
    #---------------------------------------------
    # Step 0: Print out year 
    # The variable year changes for each iteration of the loop
    #----------------------------------------------
    print(f"Loading data for year = {year}...")
    
    #---------------------------------
    # Step 1: Read data for this year
    #---------------------------------
    year_str = str(year) # convert integer value of "year" to a string
    year_str = year_str[2:] # extract the last two digits of the year string
    rental_df = pd.read_csv(f"./data/census/ACS_{year_str}_1YR_B25003.csv")
     
    #---------------------------------
    # Step 2: Columns we want
    #---------------------------------
    columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
    rental_df = rental_df[columns]

    #---------------------------------
    # Step 3: Rename columns
    #---------------------------------
    rental_df.columns = ['City', 'Total', 'Rentals']
    
    #----------------------------------
    # Step 4: Calculate the rental rate
    #----------------------------------
    rental_df['Rental Rate'] = rental_df['Rentals'] / rental_df['Total'] * 100
    
    #----------------------------------
    # Step 5: Select the cities we want
    #----------------------------------
    selection = rental_df['City'].isin(top_cities_list)
    top_cities_df = rental_df.loc[selection].copy()
    
    #----------------------------------
    # Step 6: Add a column for the year
    #----------------------------------
    top_cities_df['Year'] = year
    
    #-----------------------------------
    # Step 7: Save this dataframe in the list
    #-----------------------------------
    all_rental_data.append(top_cities_df)

#---------------------------------------------------
# Step 8: After the for loop, combine all of the 
# data frames into one along the row axis
#---------------------------------------------------
all_rental_data = pd.concat(all_rental_data, axis=0)
Loading data for year = 2005...
Loading data for year = 2006...
Loading data for year = 2007...
Loading data for year = 2008...
Loading data for year = 2009...
Loading data for year = 2010...
Loading data for year = 2011...
Loading data for year = 2012...
Loading data for year = 2013...
Loading data for year = 2014...
Loading data for year = 2015...
Loading data for year = 2016...
Loading data for year = 2017...
Loading data for year = 2018...
Loading data for year = 2019...
In [97]:
# Note: the pd.concat() took a list of dataframes and converted them 
# into a single dataframe!
type(all_rental_data)
Out[97]:
pandas.core.frame.DataFrame
In [98]:
all_rental_data.head()
Out[98]:
City Total Rentals Rental Rate Year
13 Phoenix city, Arizona 503753 204914 40.677475 2005
75 Los Angeles city, California 1284124 771325 60.066240 2005
218 Chicago city, Illinois 1020605 525620 51.500825 2005
349 New York city, New York 3026196 2023924 66.880136 2005
393 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
In [99]:
# What about the tail?
all_rental_data.tail()
Out[99]:
City Total Rentals Rental Rate Year
94 Los Angeles city, California 1398900 887785 63.463078 2019
284 Chicago city, Illinois 1080345 603352 55.848086 2019
439 New York city, New York 3211033 2188214 68.146730 2019
487 Philadelphia city, Pennsylvania 619505 295783 47.745055 2019
538 Houston city, Texas 876504 522942 59.662249 2019
In [100]:
# How many rows? Hopefully (6 cities x 15 years = 90!)
len(all_rental_data)
Out[100]:
90

Basic plotting: matplotlib

Much more to come next week. For now, we'll use some very simple plotting commands with Python's main plotting libary matplotlib.

Import the library we'll use from matplotlib.

In [101]:
from matplotlib import pyplot as plt

We'll use the plot() function which plots a simple x vs y line.

The strategy

  1. Create a figure axes to plot to
  2. Loop over each unique city
  3. Select the subset of the data for each city
  4. Plot the 'Year' vs. 'Rental Rate' columns using the plot() function for the sub-selection data frame

Note

To find the unique city names in the 'City' column, we can use the unique() function:

In [102]:
all_rental_data['City'].unique()
Out[102]:
array(['Phoenix city, Arizona', 'Los Angeles city, California',
       'Chicago city, Illinois', 'New York city, New York',
       'Philadelphia city, Pennsylvania', 'Houston city, Texas'],
      dtype=object)
In [103]:
with plt.style.context('ggplot'):
    
    # Step 1: Create a figure and axes to plot on
    fig, ax = plt.subplots(figsize=(10,6))
    
    # Step 2: Loop over each city and plot
    for city in all_rental_data['City'].unique():
        
        # Print out the city in each loop iteration
        print(city)
        
        # Select data for this city
        this_city = all_rental_data['City'] == city
        city_df = all_rental_data.loc[this_city]
        
        # Plot Year vs Rental Rate
        ax.plot(city_df['Year'], city_df['Rental Rate'], label=city, linewidth=4)

    # Add a legend
    ax.legend(loc=0, ncol=3, fontsize=12)

    # Set the y limit
    ax.set_ylim(30, 72)

    # Add a y label
    ax.set_ylabel("Rental Rate in Percent");
Phoenix city, Arizona
Los Angeles city, California
Chicago city, Illinois
New York city, New York
Philadelphia city, Pennsylvania
Houston city, Texas

Note: we used the "ggplot" theme to style our plots. Matplotlib comes with a number of built in styles: see all of the examples here.

From 2005 to 2019, Philadelphia had the third largest percent increase

Exercise: Can you calculate what the increase was for Philadelphia?

You will need to:

  • Trim the full data set to just Philadelphia
  • Select the rental rate for 2019
  • Select the rental rate for 2005
  • Calculate the percent change

Exercise solution

In [104]:
# trim to just Philadelphia first, using this boolean selection index
all_rental_data['City'] == 'Philadelphia city, Pennsylvania'
Out[104]:
13     False
75     False
218    False
349    False
393     True
       ...  
94     False
284    False
439    False
487     True
538    False
Name: City, Length: 90, dtype: bool
In [105]:
# select Philadelphia only
philly = all_rental_data.loc[all_rental_data['City']=='Philadelphia city, Pennsylvania']
philly
Out[105]:
City Total Rentals Rental Rate Year
393 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
400 Philadelphia city, Pennsylvania 554048 231323 41.751437 2006
408 Philadelphia city, Pennsylvania 562384 239363 42.562199 2007
415 Philadelphia city, Pennsylvania 578263 253084 43.766245 2008
419 Philadelphia city, Pennsylvania 569835 258459 45.356814 2009
427 Philadelphia city, Pennsylvania 575413 264129 45.902508 2010
434 Philadelphia city, Pennsylvania 576429 265126 45.994563 2011
437 Philadelphia city, Pennsylvania 579874 277323 47.824700 2012
449 Philadelphia city, Pennsylvania 582528 285430 48.998503 2013
454 Philadelphia city, Pennsylvania 577862 277301 47.987409 2014
458 Philadelphia city, Pennsylvania 581604 275397 47.351291 2015
465 Philadelphia city, Pennsylvania 580205 277707 47.863600 2016
474 Philadelphia city, Pennsylvania 606142 309279 51.024182 2017
486 Philadelphia city, Pennsylvania 608233 287543 47.275140 2018
487 Philadelphia city, Pennsylvania 619505 295783 47.745055 2019

An aside about labels and the index in Pandas

Note how the first row of the above dataframe has an index label of 393. This means we can access it using the .loc[] function like:

In [106]:
philly.loc[487]
Out[106]:
City           Philadelphia city, Pennsylvania
Total                                   619505
Rentals                                 295783
Rental Rate                          47.745055
Year                                      2019
Name: 487, dtype: object

This is different than the .iloc[] function, which also indexes based on the integer value of the row, regardless of what the labels in the index are. For example, the first row of a dataframe can always be accessed using:

For more details on the iloc() function, see the documentation on indexing by position.

In [107]:
philly.iloc[2]
Out[107]:
City           Philadelphia city, Pennsylvania
Total                                   562384
Rentals                                 239363
Rental Rate                          42.562199
Year                                      2007
Name: 408, dtype: object

We can reset the index labels so they range from 0 to the length of the dataframe, using the reset_index() function. For example

In [108]:
philly.reset_index(drop=True)
Out[108]:
City Total Rentals Rental Rate Year
0 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
1 Philadelphia city, Pennsylvania 554048 231323 41.751437 2006
2 Philadelphia city, Pennsylvania 562384 239363 42.562199 2007
3 Philadelphia city, Pennsylvania 578263 253084 43.766245 2008
4 Philadelphia city, Pennsylvania 569835 258459 45.356814 2009
5 Philadelphia city, Pennsylvania 575413 264129 45.902508 2010
6 Philadelphia city, Pennsylvania 576429 265126 45.994563 2011
7 Philadelphia city, Pennsylvania 579874 277323 47.824700 2012
8 Philadelphia city, Pennsylvania 582528 285430 48.998503 2013
9 Philadelphia city, Pennsylvania 577862 277301 47.987409 2014
10 Philadelphia city, Pennsylvania 581604 275397 47.351291 2015
11 Philadelphia city, Pennsylvania 580205 277707 47.863600 2016
12 Philadelphia city, Pennsylvania 606142 309279 51.024182 2017
13 Philadelphia city, Pennsylvania 608233 287543 47.275140 2018
14 Philadelphia city, Pennsylvania 619505 295783 47.745055 2019

Next, to calculate our percent difference we need to get the values for Philadelphia in 2005 and 2019:

In [109]:
# select 2005
philly_2005 = philly.loc[philly['Year']==2005]


philly_2005
Out[109]:
City Total Rentals Rental Rate Year
393 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
In [110]:
# Notice b/c we have a dataframe with length of 1, we don't get back just a number
philly_2005["Rental Rate"]
Out[110]:
393    43.292839
Name: Rental Rate, dtype: float64
In [111]:
len(philly_2005["Rental Rate"])
Out[111]:
1

An aside: the squeeze() function

Notice that philly_2005 has only a single row. It's more useful if we had this a scalar (just a single number)

We can use the .squeeze() function to do this. It does just one it sounds like: if you have a DataFrame with only one row, it will "squeeze" the row dimension by removing it, returning just a Series object:

In [112]:
philly_2005 = philly.loc[philly['Year']==2005].squeeze()

philly_2005
Out[112]:
City           Philadelphia city, Pennsylvania
Total                                   565433
Rentals                                 244792
Rental Rate                          43.292839
Year                                      2005
Name: 393, dtype: object
In [113]:
value_2005 = philly_2005["Rental Rate"]

value_2005
Out[113]:
43.292839293072745
In [114]:
# Do the same thing for 2019 — make sure you squeeze!
philly_2019 = philly.loc[philly['Year']==2019].squeeze()

# Get the value
value_2019 = philly_2019['Rental Rate']

value_2019
Out[114]:
47.74505451933398
In [115]:
# calculate percent change
percent_change = (value_2019 - value_2005)/ value_2005 * 100
In [116]:
print(f"change in rental rate in Philadelphia from 2005 to 2019 = {percent_change}")
change in rental rate in Philadelphia from 2005 to 2019 = 10.283952956103837

Now let's compare Zillow rental and home value data

Load citywide Zillow Rent Index (ZRI) and Zillow Home Value Index (ZHVI) data.

Files were downloaded from https://www.zillow.com/research/data/

In [117]:
home_values = pd.read_csv("data/zillow/Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv")
rent_values = pd.read_csv("data/zillow/Metro_ZORI_AllHomesPlusMultifamily_SSA.csv")

Peek at the first few rows of the ZRI data:

In [118]:
rent_values.head()
Out[118]:
RegionID RegionName SizeRank 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 ... 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07
0 102001 United States 0 1356.0 1361 1367.0 1373 1378 1384 1390 ... 1712 1721 1729 1738 1747 1757 1766 1776.0 1786 1796
1 394913 New York, NY 1 2205.0 2214 2224.0 2234 2244 2254 2264 ... 2437 2433 2428 2424 2421 2418 2415 2414.0 2413 2413
2 753899 Los Angeles-Long Beach-Anaheim, CA 2 1868.0 1879 1890.0 1902 1913 1924 1935 ... 2529 2538 2546 2554 2563 2572 2581 2591.0 2601 2611
3 394463 Chicago, IL 3 1437.0 1441 1445.0 1449 1453 1456 1460 ... 1651 1653 1655 1657 1659 1662 1664 1667.0 1670 1674
4 394514 Dallas-Fort Worth, TX 4 1179.0 1182 1186.0 1190 1194 1198 1202 ... 1519 1529 1540 1551 1562 1573 1585 1597.0 1608 1620

5 rows × 94 columns

And do the same for the ZHVI data:

In [119]:
home_values.head()
Out[119]:
RegionID SizeRank RegionName RegionType StateName 1996-01-31 1996-02-29 1996-03-31 1996-04-30 1996-05-31 ... 2020-10-31 2020-11-30 2020-12-31 2021-01-31 2021-02-28 2021-03-31 2021-04-30 2021-05-31 2021-06-30 2021-07-31
0 102001 0 United States Country NaN 107860.0 107887.0 107937.0 108064.0 108208.0 ... 262913.0 265716.0 268690.0 271763.0 275071.0 278662.0 282735.0 287579.0 293121.0 298933.0
1 394913 1 New York, NY Msa NY 186908.0 186471.0 186194.0 185663.0 185347.0 ... 499371.0 504428.0 509356.0 514095.0 518935.0 524000.0 529570.0 536247.0 544198.0 552607.0
2 753899 2 Los Angeles-Long Beach-Anaheim, CA Msa CA 184839.0 185096.0 185116.0 185224.0 185197.0 ... 719725.0 727136.0 735212.0 743347.0 752071.0 761150.0 773063.0 790724.0 811628.0 831593.0
3 394463 3 Chicago, IL Msa IL 147491.0 147472.0 147351.0 147412.0 147317.0 ... 252974.0 255348.0 257714.0 259803.0 262422.0 265051.0 268420.0 271938.0 276069.0 280130.0
4 394514 4 Dallas-Fort Worth, TX Msa TX 112545.0 112609.0 112770.0 113092.0 113439.0 ... 268525.0 271296.0 274597.0 277507.0 281346.0 285684.0 291484.0 298128.0 305540.0 313393.0

5 rows × 312 columns

Selecting the cities we want

In [120]:
valid_cities = [
    "New York, NY",
    "Chicago, IL",
    "Los Angeles-Long Beach-Anaheim, CA",
    "Philadelphia, PA",
    "Houston, TX",
    "Phoenix, AZ",
]
In [121]:
selection =  home_values['RegionName'].isin(valid_cities)
home_values_trimmed = home_values.loc[selection]
In [122]:
selection = rent_values['RegionName'].isin(valid_cities)
rent_values_trimmed = rent_values.loc[selection]
In [123]:
rent_values_trimmed
Out[123]:
RegionID RegionName SizeRank 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 ... 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07
1 394913 New York, NY 1 2205.0 2214 2224.0 2234 2244 2254 2264 ... 2437 2433 2428 2424 2421 2418 2415 2414.0 2413 2413
2 753899 Los Angeles-Long Beach-Anaheim, CA 2 1868.0 1879 1890.0 1902 1913 1924 1935 ... 2529 2538 2546 2554 2563 2572 2581 2591.0 2601 2611
3 394463 Chicago, IL 3 1437.0 1441 1445.0 1449 1453 1456 1460 ... 1651 1653 1655 1657 1659 1662 1664 1667.0 1670 1674
5 394974 Philadelphia, PA 5 1456.0 1458 1459.0 1461 1463 1465 1467 ... 1723 1729 1735 1741 1748 1754 1761 1768.0 1774 1781
6 394692 Houston, TX 6 1135.0 1142 1149.0 1155 1161 1168 1174 ... 1319 1325 1331 1336 1342 1348 1354 1361.0 1368 1374
14 394976 Phoenix, AZ 14 997.0 1001 1005.0 1009 1013 1017 1021 ... 1530 1551 1573 1595 1617 1640 1662 1686.0 1709 1732

6 rows × 94 columns

Removing unwanted columns

Unwanted columns can be dropped from the data frame using the drop() function.

Note that the column axis is the second axis (axis=1), and if you wanted to remove rows, you could use the first axis (axis=0).

In [124]:
x = ['SizeRank', 'RegionID', "RegionType", "StateName"]
home_values_final = home_values_trimmed.drop(x, axis=1)
In [125]:
columns = ['SizeRank', 'RegionID']
rent_values_final = rent_values_trimmed.drop(columns, axis=1)
In [126]:
rent_values_final
Out[126]:
RegionName 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 ... 2020-10 2020-11 2020-12 2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07
1 New York, NY 2205.0 2214 2224.0 2234 2244 2254 2264 2273 2283 ... 2437 2433 2428 2424 2421 2418 2415 2414.0 2413 2413
2 Los Angeles-Long Beach-Anaheim, CA 1868.0 1879 1890.0 1902 1913 1924 1935 1947 1958 ... 2529 2538 2546 2554 2563 2572 2581 2591.0 2601 2611
3 Chicago, IL 1437.0 1441 1445.0 1449 1453 1456 1460 1463 1467 ... 1651 1653 1655 1657 1659 1662 1664 1667.0 1670 1674
5 Philadelphia, PA 1456.0 1458 1459.0 1461 1463 1465 1467 1469 1471 ... 1723 1729 1735 1741 1748 1754 1761 1768.0 1774 1781
6 Houston, TX 1135.0 1142 1149.0 1155 1161 1168 1174 1180 1186 ... 1319 1325 1331 1336 1342 1348 1354 1361.0 1368 1374
14 Phoenix, AZ 997.0 1001 1005.0 1009 1013 1017 1021 1025 1030 ... 1530 1551 1573 1595 1617 1640 1662 1686.0 1709 1732

6 rows × 92 columns

Wide vs long format for datasets

Currently, our data is in wide format $\rightarrow$ each observation has its own column. This usually results in many columns but few rows.

In [127]:
home_values_final
Out[127]:
RegionName 1996-01-31 1996-02-29 1996-03-31 1996-04-30 1996-05-31 1996-06-30 1996-07-31 1996-08-31 1996-09-30 ... 2020-10-31 2020-11-30 2020-12-31 2021-01-31 2021-02-28 2021-03-31 2021-04-30 2021-05-31 2021-06-30 2021-07-31
1 New York, NY 186908.0 186471.0 186194.0 185663.0 185347.0 185059.0 184882.0 184790.0 184835.0 ... 499371.0 504428.0 509356.0 514095.0 518935.0 524000.0 529570.0 536247.0 544198.0 552607.0
2 Los Angeles-Long Beach-Anaheim, CA 184839.0 185096.0 185116.0 185224.0 185197.0 185225.0 185325.0 185277.0 185163.0 ... 719725.0 727136.0 735212.0 743347.0 752071.0 761150.0 773063.0 790724.0 811628.0 831593.0
3 Chicago, IL 147491.0 147472.0 147351.0 147412.0 147317.0 147480.0 147523.0 148537.0 149784.0 ... 252974.0 255348.0 257714.0 259803.0 262422.0 265051.0 268420.0 271938.0 276069.0 280130.0
5 Philadelphia, PA 120665.0 120510.0 120370.0 120127.0 119962.0 119867.0 119823.0 119801.0 119842.0 ... 268118.0 271333.0 274637.0 277895.0 281276.0 285001.0 288799.0 293343.0 298129.0 302822.0
6 Houston, TX 110158.0 110234.0 110242.0 110391.0 110531.0 110631.0 110669.0 110729.0 110896.0 ... 228397.0 230431.0 232626.0 235215.0 238045.0 240714.0 243831.0 247646.0 252661.0 258174.0
14 Phoenix, AZ 113486.0 113802.0 114163.0 114858.0 115537.0 116149.0 116715.0 117234.0 117747.0 ... 315444.0 321009.0 326891.0 333799.0 340815.0 348899.0 356187.0 366575.0 378013.0 390733.0

6 rows × 308 columns

Usually it's better to have data in tidy (also known as long) format.

Tidy datasets are arranged such that each variable is a column and each observation is a row.

In our case, we want to have a column called ZRI and one called ZHVI and a row for each month that the indices were measured.

pandas provides the melt() function for converting from wide formats to tidy formats.

melt() doesn’t aggregate or summarize the data. It transforms it into a different shape, but it contains the exact same information as before.

Imagine you have 6 rows of data (each row is a unique city) with 10 columns of home values (each column is a different month). That is wide data and is the format usually seen in spreadsheets or tables in a report.

If you melt() that wide data, you would get a table with 60 rows and 3 columns. Each row would contain the city name, the month, and the home value that city and month. This tidy-formatted data contains the same info as the wide data, but in a different form.

tidyr-spread-gather.gif

This animation shows the transformation from wide to long / long to wide. You can ignore gather() and spread() - those are the R versions of the pandas functions.

In [128]:
pd.melt?

Now, let's melt our datasets:

In [129]:
ZHVI = pd.melt(
    home_values_final, 
    id_vars=["RegionName"], 
    value_name="ZHVI", 
    var_name="Date"
)
ZRI = pd.melt(
    rent_values_final, 
    id_vars=["RegionName"], 
    value_name="ZRI", 
    var_name="Date"
)

and take a look:

In [130]:
ZRI.tail()
Out[130]:
RegionName Date ZRI
541 Los Angeles-Long Beach-Anaheim, CA 2021-07 2611.0
542 Chicago, IL 2021-07 1674.0
543 Philadelphia, PA 2021-07 1781.0
544 Houston, TX 2021-07 1374.0
545 Phoenix, AZ 2021-07 1732.0
In [131]:
ZHVI.head()
Out[131]:
RegionName Date ZHVI
0 New York, NY 1996-01-31 186908.0
1 Los Angeles-Long Beach-Anaheim, CA 1996-01-31 184839.0
2 Chicago, IL 1996-01-31 147491.0
3 Philadelphia, PA 1996-01-31 120665.0
4 Houston, TX 1996-01-31 110158.0

Merging data frames

Another common operation is merging, also known as joining, two datasets.

We can use the merge() function to merge observations that have the same Date and RegionName values.

But first! Our date string formats don't match!

  • ZVHI has the Date column in the format of YYYY-MM-DD
  • ZRI has the Date column in the format of YYYY-MM

We need to put them into the same format before merging the data!

We can fix this by create Datetime objects and formatting the dates into the same format.

Datetime objects

Currently our Date column is stored as a string.

pandas includes additional functionality for dates, but first we must convert the strings using the to_datetime() function.

In [132]:
# Convert the Date column to Datetime objects
ZHVI["Date"] = pd.to_datetime(ZHVI["Date"])

The strftime function

We can use the ".dt" property of the Date column to access datetime functions of the new Datetime column.

For converting to strings in a certain format, we can use the "strftime" function (docs). This uses a special syntax to convert the date object to a string with a specific format.

Important reference: Use the this strftime guide to look up the syntax!

In [133]:
# Extract YYYY-MM string
date_strings = ZHVI["Date"].dt.strftime("%Y-%m")
In [134]:
# First entry is a string!
date_strings.iloc[0]
Out[134]:
'1996-01'
In [135]:
# Add the strings back as a column
ZHVI["Date"] = date_strings
In [136]:
ZHVI.head()
Out[136]:
RegionName Date ZHVI
0 New York, NY 1996-01 186908.0
1 Los Angeles-Long Beach-Anaheim, CA 1996-01 184839.0
2 Chicago, IL 1996-01 147491.0
3 Philadelphia, PA 1996-01 120665.0
4 Houston, TX 1996-01 110158.0

Now we can merge!

In [137]:
zillow_data = pd.merge(ZRI, ZHVI, on=['Date', 'RegionName'], how='outer')
In [138]:
zillow_data.sort_values("Date")
Out[138]:
RegionName Date ZRI ZHVI
546 New York, NY 1996-01 NaN 186908.0
551 Phoenix, AZ 1996-01 NaN 113486.0
550 Houston, TX 1996-01 NaN 110158.0
547 Los Angeles-Long Beach-Anaheim, CA 1996-01 NaN 184839.0
548 Chicago, IL 1996-01 NaN 147491.0
... ... ... ... ...
541 Los Angeles-Long Beach-Anaheim, CA 2021-07 2611.0 831593.0
542 Chicago, IL 2021-07 1674.0 280130.0
543 Philadelphia, PA 2021-07 1781.0 302822.0
544 Houston, TX 2021-07 1374.0 258174.0
545 Phoenix, AZ 2021-07 1732.0 390733.0

1842 rows × 4 columns

Merging is very powerful and the merge can be done in a number of ways. See the infographic on joining in this repository.

Rent prices in Philadelphia have remained relatively flat, relative to other large cities.

First Homework Assignment

Two parts:

  1. Download and install Python locally on your computer. Instructions in the assignment README!
  2. Analyze Zillow ZHVI data by ZIP code in Philadelphia, submitting your Jupyter notebook.

Due date: Monday 9/20 by the start of class (7pm)

That's it!

  • Next week: Data Visualization Fundamentals
  • Office hours starting next week
    • Nick: Mon and Wed before class, 6 to 7pm in lecture hall (or remote via Zoom, if preferred)
    • Stella: Monday 12:30-2pm, remote via Zoom
    • Sign up for 15-minute time slots on Canvas (Zoom info in calendar invite)
  • Post questions on Piazza!
  • Email questions/concerns to nhand@design.upenn.edu
In [ ]: