In [1]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
np.random.seed(42)
In [2]:
plt.rcParams['figure.figsize'] = (10,6)

Week 10: Clustering Analysis in Python

Nov 8, 2021

Housekeeping

  • Assignment #5 (optional) due on Wednesday (11/10)
  • Assignment #6 (optional) will be assigned on Wednesday, due in 2 weeks (11/24)
  • Remember, you have to do one of homeworks #4, #5, or #6

Recap

  • Last week: urban street networks + interactive web maps
  • New tools: OSMnx, Pandana, and Folium

Where we left off last week: choropleth maps in Folium

Example: A map of households without internet for US counties

Two ways:

  • The easy way: folium.Choropleth
  • The hard way: folium.GeoJson
In [3]:
import folium

Load the data, remove counties with no households and add our new column:

In [4]:
# Load CSV data from the data/ folder
census_data = pd.read_csv("./data/internet_avail_census.csv", dtype={"geoid": str})

# Remove counties with no households
valid = census_data['universe'] > 0
census_data = census_data.loc[valid]

# Calculate the percent without internet
census_data['percent_no_internet'] = census_data['no_internet'] / census_data['universe']
In [5]:
census_data.head()
Out[5]:
NAME universe no_internet state county geoid percent_no_internet
0 Washington County, Mississippi 18299.0 6166.0 28 151 28151 0.336958
1 Perry County, Mississippi 4563.0 1415.0 28 111 28111 0.310103
2 Choctaw County, Mississippi 3164.0 1167.0 28 19 28019 0.368837
3 Itawamba County, Mississippi 8706.0 1970.0 28 57 28057 0.226281
4 Carroll County, Mississippi 3658.0 1218.0 28 15 28015 0.332969

Load the counties geometry data too:

In [6]:
# Load counties GeoJSOn from the data/ folder
counties = gpd.read_file("./data/us-counties-10m.geojson")
In [7]:
counties.head()
Out[7]:
id geometry
0 53073 MULTIPOLYGON (((-120.85361 49.00011, -120.7674...
1 30105 POLYGON ((-106.11238 48.99904, -106.15187 48.8...
2 30029 POLYGON ((-114.06985 48.99904, -114.05908 48.8...
3 16021 POLYGON ((-116.04755 49.00065, -116.04755 48.5...
4 30071 POLYGON ((-107.17840 49.00011, -107.20712 48.9...

The hard way: use folium.GeoJson

  • The good:
    • More customizable, and can add user interaction
  • The bad:

The steps involved

  1. Join data and geometry features into a single GeoDataFrame
  2. Define a function to style features based on data values
  3. Create GeoJSON layer and add it to the map

Step 1: Join the census data and features

Note: this is different than using folium.Choropleth, where data and features are stored in two separate data frames.

In [8]:
# Merge the county geometries with census data
# Left column: "id"
# Right column: "geoid"
census_joined = counties.merge(census_data, left_on="id", right_on="geoid")
In [9]:
census_joined.head()
Out[9]:
id geometry NAME universe no_internet state county geoid percent_no_internet
0 53073 MULTIPOLYGON (((-120.85361 49.00011, -120.7674... Whatcom County, Washington 85008.0 9189.0 53 73 53073 0.108096
1 30105 POLYGON ((-106.11238 48.99904, -106.15187 48.8... Valley County, Montana 3436.0 672.0 30 105 30105 0.195576
2 30029 POLYGON ((-114.06985 48.99904, -114.05908 48.8... Flathead County, Montana 38252.0 5662.0 30 29 30029 0.148018
3 16021 POLYGON ((-116.04755 49.00065, -116.04755 48.5... Boundary County, Idaho 4605.0 1004.0 16 21 16021 0.218024
4 30071 POLYGON ((-107.17840 49.00011, -107.20712 48.9... Phillips County, Montana 1770.0 484.0 30 71 30071 0.273446

Step 2: Normalize the data column to 0 to 1

  • We will use a matplotlib color map that requires data to be between 0 and 1
  • Normalize our "percent_no_internet" column to be between 0 and 1
In [10]:
# Minimum
min_val = census_joined['percent_no_internet'].min()

# Maximum
max_val = census_joined['percent_no_internet'].max()

# Calculate a normalized column
normalized = (census_joined['percent_no_internet'] - min_val) / (max_val - min_val)

# Add to the dataframe
census_joined['percent_no_internet_normalized'] = normalized

Step 3: Define our style functions

  • Create a matplotlib colormap object using plt.get_cmap()
  • Color map objects are functions: give the function a number between 0 and 1 and it will return a corresponding color from the color map
  • Based on the feature data, evaluate the color map and convert to a hex string
In [11]:
import matplotlib.colors as mcolors
In [12]:
# Use a red-purple colorbrewer color scheme
cmap = plt.get_cmap('RdPu')
In [13]:
# The minimum value of the color map as an RGB tuple
cmap(0)
Out[13]:
(1.0, 0.9686274509803922, 0.9529411764705882, 1.0)
In [14]:
# The minimum value of the color map as a hex string
mcolors.rgb2hex(cmap(0.0))
Out[14]:
'#fff7f3'
In [15]:
# The maximum value of the color map as a hex string
mcolors.rgb2hex(cmap(1.0))
Out[15]:
'#49006a'
In [16]:
def get_style(feature):
    """
    Given an input GeoJSON feature, return a style dict.
    
    Notes
    -----
    The color in the style dict is determined by the 
    "percent_no_internet_normalized" column in the 
    input "feature".
    """
    # Get the data value from the feature
    value = feature['properties']['percent_no_internet_normalized']
    
    # Evaluate the color map
    # NOTE: value must between 0 and 1
    rgb_color = cmap(value) # this is an RGB tuple
    
    # Convert to hex string
    color = mcolors.rgb2hex(rgb_color)
    
    # Return the style dictionary
    return {'weight': 0.5, 'color': color, 'fillColor': color, "fillOpacity": 0.75}
In [17]:
def get_highlighted_style(feature):
    """
    Return a style dict to use when the user highlights a 
    feature with the mouse.
    """
    
    return {"weight": 3, "color": "black"}

Step 4: Convert our data to GeoJSON

  • Tip: To limit the amount of data Folium has to process, it's best to trim our GeoDataFrame to only the columns we'll need before converting to GeoJSON
  • You can use the .to_json() function to convert to a GeoJSON string
In [18]:
needed_cols = ['NAME', 'percent_no_internet', 'percent_no_internet_normalized', 'geometry']
census_json = census_joined[needed_cols].to_json()
In [19]:
# STEP 1: Initialize the map
m = folium.Map(location=[40, -98], zoom_start=4)

# STEP 2: Add the GeoJson to the map
folium.GeoJson(
    census_json, # The geometry + data columns in GeoJSON format
    style_function=get_style, # The style function to color counties differently
    highlight_function=get_highlighted_style, 
    tooltip=folium.GeoJsonTooltip(['NAME', 'percent_no_internet'])
).add_to(m)



# avoid a rendering bug by saving as HTML and re-loading
m.save('percent_no_internet.html')

And viola!

The hard way is harder, but we have a tooltip and highlight interactivity!

In [20]:
from IPython.display import IFrame
In [21]:
IFrame('percent_no_internet.html', width=800, height=500)
Out[21]:

At-home exercise: Can we repeat this with altair?

Try to replicate the above interactive map exactly (minus the background tiles). This includes:

  • Using the red-purple colorbrewer scheme
  • Having a tooltip with the percentage and county name

Note: Altair's syntax is similar to the folium.Choropleth syntax — you should pass the counties GeoDataFrame to the alt.Chart() and then use the transform_lookup() to merge those geometries to the census data and pull in the census data column we need ("percent_without_internet").

Hints

In [22]:
import altair as alt
In [23]:
# Initialize the chart with the counties data
alt.Chart(counties).mark_geoshape(stroke="white", strokeWidth=0.25).encode(
    # Encode the color 
    color=alt.Color(
        "percent_no_internet:Q",
        title="Percent Without Internet",
        scale=alt.Scale(scheme="redpurple"),
        legend=alt.Legend(format=".0%")
    ),
    # Tooltip
    tooltip=[
        alt.Tooltip("NAME:N", title="Name"),
        alt.Tooltip("percent_no_internet:Q", title="Percent Without Internet", format=".1%"),
    ],
).transform_lookup(
    lookup="id", # The column name in the counties data to match on
    from_=alt.LookupData(census_data, "geoid", ["percent_no_internet", "NAME"]), # Match census data on "geoid"
).project(
    type="albersUsa"
).properties(
    width=700, height=500
)
Out[23]:

Leaflet/Folium plugins

One of leaflet's strengths: a rich set of open-source plugins

https://leafletjs.com/plugins.html

Many of these are available in Folium!

Example: Heatmap

In [24]:
from folium.plugins import HeatMap
In [25]:
HeatMap?

Example: A heatmap of new construction permits in Philadelphia in the last 30 days

New construction in Philadelphia requires a building permit, which we can pull from Open Data Philly.

Step 1: Download the data from CARTO

The "current_date" variable in SQL databases

You can use the pre-defined "current_date" variable to get the current date. For example, to get the permits from the past 30 days, we could do:

SELECT * FROM permits WHERE permitissuedate >= current_date - 30

Selecting only new construction permits

To select new construction permits, you can use the "permitdescription" field. There are two relevant categories:

  • "RESIDENTIAL BUILDING PERMIT"
  • "COMMERCIAL BUILDING PERMIT"

We can use the SQL IN command (documentation) to easily select rows that have these categories.

In [26]:
import carto2gpd
In [27]:
# API URL
url = "https://phl.carto.com/api/v2/sql"

# Table name on CARTO
table_name = "permits"

# The where clause, with two parts
DAYS = 30
where = f"permitissuedate >= current_date - {DAYS}"
where += " and permitdescription IN ('RESIDENTIAL BUILDING PERMIT', 'COMMERCIAL BUILDING PERMIT')"

where
Out[27]:
"permitissuedate >= current_date - 30 and permitdescription IN ('RESIDENTIAL BUILDING PERMIT', 'COMMERCIAL BUILDING PERMIT')"
In [28]:
# Run the query
permits = carto2gpd.get(url, table_name, where=where)
In [29]:
len(permits)
Out[29]:
565
In [30]:
permits.head()
Out[30]:
geometry cartodb_id objectid permitnumber addressobjectid parcel_id_num permittype permitdescription commercialorresidential typeofwork ... unit_type unit_num zip censustract council_district opa_owner systemofrecord geocode_x geocode_y posse_jobid
0 POINT (-75.17740 39.93647) 4985 4861 CP-2021-006100 136746075 350978 BUILDING COMMERCIAL BUILDING PERMIT COMMERCIAL ADDITION AND/OR ALTERATION ... None None 19146-4354 21 2 CLEMONS TIMOTHY ECLIPSE 2.689812e+06 230221.856589 391995256
1 None 6036 5279 RP-2021-006695 313880290 Parcel A RESIDENTIAL BUILDING RESIDENTIAL BUILDING PERMIT RESIDENTIAL NEW CONSTRUCTION ... # PARCELA 19122 144 None None ECLIPSE NaN NaN 347147032
2 POINT (-75.16793 39.94323) 11470 10792 RP-2021-013327 130247323 452359 RESIDENTIAL BUILDING RESIDENTIAL BUILDING PERMIT RESIDENTIAL ADDITION AND/OR ALTERATION ... None None 19146-1634 14 2 HEISLER TRACY A, HEISLER ERIC J ECLIPSE 2.692394e+06 232759.345244 385569983
3 POINT (-75.13402 39.99340) 11493 10794 RP-2021-016523 134395653 182217 RESIDENTIAL BUILDING RESIDENTIAL BUILDING PERMIT RESIDENTIAL ADDITION AND/OR ALTERATION ... None None 19133-3529 176.01 7 BURGOS MICHELLE ECLIPSE 2.701358e+06 251308.340764 403583350
4 POINT (-75.18463 39.95285) 15459 15886 CP-2021-001669 132682359 152856 BUILDING COMMERCIAL BUILDING PERMIT COMMERCIAL NEW CONSTRUCTION (SHELL ONLY) ... None None 19104-3413 369 3 HORIZON HOUSE INC ECLIPSE 2.687613e+06 236125.949970 329652657

5 rows × 33 columns

Step 2: Remove missing geometries

Some permits don't have locations — use the .geometry.notnull() function to trim the data frame to those incidents with valid geometries.

In [31]:
permits = permits.loc[permits.geometry.notnull()].copy()

Step 3: Extract out the lat/lng coordinates

Note: We need an array of (latitude, longitude) pairs. Be careful about the order!

In [32]:
# Extract the lat and longitude from the geometery column
permits['lat'] = permits.geometry.y
permits['lng'] = permits.geometry.x
In [33]:
permits.head()
Out[33]:
geometry cartodb_id objectid permitnumber addressobjectid parcel_id_num permittype permitdescription commercialorresidential typeofwork ... zip censustract council_district opa_owner systemofrecord geocode_x geocode_y posse_jobid lat lng
0 POINT (-75.17740 39.93647) 4985 4861 CP-2021-006100 136746075 350978 BUILDING COMMERCIAL BUILDING PERMIT COMMERCIAL ADDITION AND/OR ALTERATION ... 19146-4354 21 2 CLEMONS TIMOTHY ECLIPSE 2.689812e+06 230221.856589 391995256 39.936471 -75.177401
2 POINT (-75.16793 39.94323) 11470 10792 RP-2021-013327 130247323 452359 RESIDENTIAL BUILDING RESIDENTIAL BUILDING PERMIT RESIDENTIAL ADDITION AND/OR ALTERATION ... 19146-1634 14 2 HEISLER TRACY A, HEISLER ERIC J ECLIPSE 2.692394e+06 232759.345244 385569983 39.943226 -75.167934
3 POINT (-75.13402 39.99340) 11493 10794 RP-2021-016523 134395653 182217 RESIDENTIAL BUILDING RESIDENTIAL BUILDING PERMIT RESIDENTIAL ADDITION AND/OR ALTERATION ... 19133-3529 176.01 7 BURGOS MICHELLE ECLIPSE 2.701358e+06 251308.340764 403583350 39.993400 -75.134018
4 POINT (-75.18463 39.95285) 15459 15886 CP-2021-001669 132682359 152856 BUILDING COMMERCIAL BUILDING PERMIT COMMERCIAL NEW CONSTRUCTION (SHELL ONLY) ... 19104-3413 369 3 HORIZON HOUSE INC ECLIPSE 2.687613e+06 236125.949970 329652657 39.952847 -75.184629
5 POINT (-75.16541 39.92964) 16078 16365 RP-2021-011113 15443506 428359 RESIDENTIAL BUILDING RESIDENTIAL BUILDING PERMIT RESIDENTIAL ADDITION AND/OR ALTERATION ... 19148-1007 29 1 BUCKMAN LAURIE G ECLIPSE 2.693247e+06 227834.344041 373025230 39.929644 -75.165408

5 rows × 35 columns

In [34]:
# Split out the residential and commercial
residential = permits.query("permitdescription == 'RESIDENTIAL BUILDING PERMIT'")
commercial = permits.query("permitdescription == 'COMMERCIAL BUILDING PERMIT'")
In [35]:
# Make a NumPy array (use the "values" attribute)
residential_coords = residential[['lat', 'lng']].values
commercial_coords = commercial[['lat', 'lng']].values
In [36]:
commercial_coords[:5]
Out[36]:
array([[ 39.936471, -75.177401],
       [ 39.952847, -75.184629],
       [ 39.949254, -75.170423],
       [ 40.058061, -75.078664],
       [ 39.971096, -75.21212 ]])

Step 4: Make a Folium map and add a HeatMap

The HeatMap takes the list of coordinates: the first column is latitude and the second column longitude

Commercial building permits

In [37]:
# Initialize map
m = folium.Map(
    location=[39.99, -75.13],
    tiles='Cartodb Positron',
    zoom_start=12
)


# Add heat map coordinates
HeatMap(commercial_coords, radius=15).add_to(m)

m
Out[37]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Residential building permits

In [38]:
HeatMap?
In [39]:
# Initialize map
m = folium.Map(
    location=[39.99, -75.13],
    tiles='Cartodb Positron',
    zoom_start=12
)


# Add heat map
HeatMap(residential_coords, radius=15).add_to(m)

m
Out[39]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Takeaways

Commercial construction concentrated in the greater Center City area while residential construction is primarily outside of Center City...

That's it for interactive maps w/ Folium...

Now on to clustering...

Clustering in Python

  • Both spatial and non-spatial datasets
  • Two new techniques:
    • Non-spatial: K-means
    • Spatial: DBSCAN
  • Two labs/exercises this week:
    1. Grouping Philadelphia neighborhoods by AirBnb listings
    2. Identifying clusters in taxi rides in NYC

"Machine learning"

  • The computer learns patterns and properties of an input data set without the user specifying them beforehand
  • Can be both supervised and unsupervised

Supervised

  • Example: classification
  • Given a training set of labeled data, learn to assign labels to new data

Unsupervised

  • Example: clustering
  • Identify structure / clusters in data without any prior knowledge

Machine learning in Python: scikit-learn

  • State-of-the-art machine learning in Python
  • Easy to use, lots of functionality

Clustering is just one (of many) features

https://scikit-learn.org/stable/

Note: We will focus on clustering algorithms today and discuss a few other machine learning techniques in the next two weeks. If there is a specific scikit-learn use case we won't cover, I'm open to ideas for incorporating it as part of the final project.

Part 1: Non-spatial clustering

The goal

Partition a dataset into groups that have a similar set of attributes, or features, within the group and a dissimilar set of features between groups.

Minimize the intra-cluster variance and maximize the inter-cluster variance of features.

Some intuition

K-Means clustering

  • Simple but robust clustering algorithm
  • Widely used
  • Important: user must specify the number of clusters
  • Cannot be used to find density-based clusters

A good introduction

Andrew Ng's Coursera lecture

How does it work?

Minimizes the intra-cluster variance: minimizes the sum of the squared distances between all points in a cluster and the cluster centroid

K-means in action

Example: Clustering countries by health and income

  • Health expectancy in years vs. GDP per capita and population for 187 countries (as of 2015)
  • Data from Gapminder
In [40]:
import altair as alt
from vega_datasets import data as vega_data

Read the data from a URL:

In [41]:
gapminder = pd.read_csv(vega_data.gapminder_health_income.url)
gapminder.head()
Out[41]:
country income health population
0 Afghanistan 1925 57.63 32526562
1 Albania 10620 76.00 2896679
2 Algeria 13434 76.50 39666519
3 Andorra 46577 84.10 70473
4 Angola 7615 61.00 25021974

Plot it with altair

In [42]:
alt.Chart(gapminder).mark_circle().encode(
    alt.X("income:Q", scale=alt.Scale(type="log")),
    alt.Y("health:Q", scale=alt.Scale(zero=False)),
    size='population:Q',
    tooltip=list(gapminder.columns),
).interactive()
Out[42]:

K-Means with scikit-learn

In [43]:
from sklearn.cluster import KMeans

Let's start with 5 clusters

In [44]:
KMeans?
In [45]:
kmeans = KMeans(n_clusters=5)

Lot's of optional parameters, but n_clusters is the most important:

In [46]:
 kmeans
Out[46]:
KMeans(n_clusters=5)

Let's fit just income first

Use the fit() function

In [47]:
kmeans.fit(gapminder[['income']])
Out[47]:
KMeans(n_clusters=5)

Extract the cluster labels

Use the labels_ attribute

In [48]:
gapminder['label'] = kmeans.labels_

How big are our clusters?

In [49]:
gapminder.groupby('label').size()
Out[49]:
label
0    106
1     24
2      6
3     50
4      1
dtype: int64

Plot it again, coloring by our labels

In [50]:
alt.Chart(gapminder).mark_circle().encode(
    alt.X('income:Q', scale=alt.Scale(type='log')),
    alt.Y('health:Q', scale=alt.Scale(zero=False)),
    size='population:Q',
    color=alt.Color('label:N', scale=alt.Scale(scheme='dark2')),
    tooltip=list(gapminder.columns)
).interactive()
Out[50]:

Calculate average income by group

In [51]:
gapminder.groupby("label")['income'].mean().sort_values()
Out[51]:
label
0      5279.830189
3     21040.820000
1     42835.500000
2     74966.166667
4    132877.000000
Name: income, dtype: float64

Data is nicely partitioned into income levels

How about health, income, and population?

In [52]:
# Fit all three columns
kmeans.fit(gapminder[['income', 'health', 'population']])

# Extract the labels
gapminder['label'] = kmeans.labels_

Plot the new labels

In [53]:
alt.Chart(gapminder).mark_circle().encode(
    alt.X('income:Q', scale=alt.Scale(type='log')),
    alt.Y('health:Q', scale=alt.Scale(zero=False)),
    size='population:Q',
    color=alt.Color('label:N', scale=alt.Scale(scheme='dark2')),
    tooltip=list(gapminder.columns)
).interactive()
Out[53]:

It....didn't work that well

What's wrong?

K-means is distance-based, but our features have wildly different distance scales

scikit-learn to the rescue: pre-processing

  • Scikit-learn has a utility to normalize features with an average of zero and a variance of 1
  • Use the StandardScaler class
In [54]:
from sklearn.preprocessing import MinMaxScaler
In [55]:
from sklearn.preprocessing import StandardScaler
In [56]:
scaler = StandardScaler()

Use the fit_transform() function to scale your features

In [57]:
gapminder_scaled = scaler.fit_transform(gapminder[['income', 'health', 'population']])

Important: The fit_transform() function converts the DataFrame to a numpy array:

In [58]:
# fit_transform() converts the data into a numpy array
gapminder_scaled[:5]
Out[58]:
array([[-0.79481258, -1.8171424 , -0.04592039],
       [-0.34333373,  0.55986273, -0.25325837],
       [-0.1972197 ,  0.62456075,  0.00404216],
       [ 1.52369617,  1.6079706 , -0.27303503],
       [-0.49936524, -1.38107777, -0.09843447]])
In [59]:
# mean of zero
gapminder_scaled.mean(axis=0)
Out[59]:
array([ 8.07434927e-17, -1.70511258e-15, -1.89984689e-17])
In [60]:
# variance of one
gapminder_scaled.std(axis=0)
Out[60]:
array([1., 1., 1.])

Now fit the scaled features

In [61]:
# Perform the fit
kmeans.fit(gapminder_scaled)

# Extract the labels
gapminder['label'] = kmeans.labels_
In [62]:
alt.Chart(gapminder).mark_circle().encode(
    alt.X('income:Q', scale=alt.Scale(type='log')),
    alt.Y('health:Q', scale=alt.Scale(zero=False)),
    size='population:Q',
    color=alt.Color('label:N', scale=alt.Scale(scheme='dark2')),
    tooltip=list(gapminder.columns)
).interactive()
Out[62]:
In [63]:
# Number of countries per cluster
gapminder.groupby("label").size()
Out[63]:
label
0    86
1     2
2    32
3    61
4     6
dtype: int64
In [64]:
# Average population per cluster
gapminder.groupby("label")['population'].mean().sort_values() / 1e6
Out[64]:
label
4       2.988746
3      21.441296
0      26.239937
2      32.501032
1    1343.549735
Name: population, dtype: float64
In [65]:
# Average life expectancy per cluster
gapminder.groupby("label")['health'].mean().sort_values()
Out[65]:
label
3    62.232951
1    71.850000
0    74.313837
2    80.806250
4    81.033333
Name: health, dtype: float64
In [66]:
# Average income per cluster
gapminder.groupby("label")['income'].mean().sort_values() / 1e3
Out[66]:
label
3     4.150623
1     9.618500
0    13.229907
2    40.322094
4    86.987500
Name: income, dtype: float64
In [67]:
gapminder.loc[gapminder['label']==4]
Out[67]:
country income health population label
24 Brunei 73003 78.7 423188 4
88 Kuwait 82633 80.7 3892115 4
97 Luxembourg 88314 81.1 567110 4
124 Norway 64304 81.6 5210967 4
134 Qatar 132877 82.0 2235355 4
145 Singapore 80794 82.1 5603740 4
In [68]:
kmeans.inertia_
Out[68]:
80.84493050696734
In [69]:
gapminder.loc[gapminder['label']==2]
Out[69]:
country income health population label
3 Andorra 46577 84.1 70473 2
8 Australia 44056 81.8 23968973 2
9 Austria 44401 81.0 8544586 2
12 Bahrain 44138 79.2 1377237 2
16 Belgium 41240 80.4 11299192 2
30 Canada 43294 81.7 35939927 2
44 Cyprus 29797 82.6 1165300 2
45 Czech Republic 29437 78.6 10543186 2
46 Denmark 43495 80.1 5669081 2
58 Finland 38923 80.8 5503457 2
59 France 37599 81.9 64395345 2
63 Germany 44053 81.1 80688545 2
65 Greece 25430 79.8 10954617 2
74 Iceland 42182 82.8 329425 2
79 Ireland 47758 80.4 4688465 2
80 Israel 31590 82.4 8064036 2
81 Italy 33297 82.1 59797685 2
83 Japan 36162 83.5 126573481 2
104 Malta 30265 82.1 418670 2
118 Netherlands 45784 80.6 16924929 2
119 New Zealand 34186 80.6 4528526 2
125 Oman 48226 75.7 4490541 2
133 Portugal 26437 79.8 10349803 2
140 Saudi Arabia 52469 78.1 31540372 2
147 Slovenia 28550 80.2 2067526 2
151 South Korea 34644 80.7 50293439 2
153 Spain 32979 81.7 46121699 2
160 Sweden 44892 82.0 9779426 2
161 Switzerland 56118 82.9 8298663 2
175 United Arab Emirates 60749 76.6 9156963 2
176 United Kingdom 38225 81.4 64715810 2
177 United States 53354 79.1 321773631 2

Exercise: Clustering neighborhoods by Airbnb stats

I've extracted neighborhood Airbnb statistics for Philadelphia neighborhoods from Tom Slee's website.

The data includes average price per person, overall satisfaction, and number of listings.

Two good references for Airbnb data

Original research study: How Airbnb's Data Hid the Facts in New York City

To be continued!

In [ ]: