In [1]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
np.random.seed(42)
In [2]:
plt.rcParams['figure.figsize'] = (10,6)

Lecture 10B: Clustering Analysis in Python

Nov 10, 2021

Housekeeping

  • Assignment #5 (optional) due by the end of the day today
  • Assignment #6 (optional) assigned today and due in two weeks (11/24)
    • Covers urban street networks and osmnx
    • Plotting a folium heatmap of a dataset of your choosing — similar to building permit example from last lecture
  • Remember, you have to do one of homeworks #4, #5, or #6

Last lecture (10A)

  • Making a Folium Choropleth the "hard way" — using folium.GeoJson()
  • Folium heatmap of new construction permits in Philadelphia
  • Introduction to clustering with scikit-learn
  • K-means algorithm for non-spatial clustering

Today

  • Exercise on non-spatial clustering with k-means: Airbnb data in Philly neighborhoods
  • Spatial clustering with the DBSCAN algorithm
  • Exercise on spatial clustering of NYC taxi trips
In [3]:
import altair as alt
from vega_datasets import data as vega_data
In [4]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Exercise: Clustering neighborhoods by Airbnb stats

I've extracted neighborhood Airbnb statistics for Philadelphia neighborhoods from Tom Slee's website.

The data includes average price per person, overall satisfaction, and number of listings.

Two good references for Airbnb data

Original research study: How Airbnb's Data Hid the Facts in New York City

Step 1: Load the data with pandas

The data is available in CSV format ("philly_airbnb_by_neighborhoods.csv") in the "data/" folder of the repository.

In [5]:
airbnb = pd.read_csv("data/philly_airbnb_by_neighborhoods.csv")
airbnb.head()
Out[5]:
neighborhood price_per_person overall_satisfaction N
0 ALLEGHENY_WEST 120.791667 4.666667 23
1 BELLA_VISTA 87.407920 3.158333 204
2 BELMONT 69.425000 3.250000 11
3 BREWERYTOWN 71.788188 1.943182 142
4 BUSTLETON 55.833333 1.250000 19

Step 2: Perform the K-Means fit

  • Use our three features: price_per_person, overall_satisfaction, N
  • I used 5 clusters, but you are welcome to experiment with different values!
  • Scaling the features is recommended, but if the scales aren't too different, so probably isn't necessary in this case
In [6]:
# Initialize the Kmeans object
kmeans = KMeans(n_clusters=5, random_state=42)

# Scale the data features we want
scaler = StandardScaler()
scaled_airbnb_data = scaler.fit_transform(airbnb[['price_per_person', 'overall_satisfaction', 'N']])
In [7]:
# Run the fit!
kmeans.fit(scaled_airbnb_data)

# Save the cluster labels
airbnb['label'] = kmeans.labels_
In [8]:
# New column "label"!
airbnb.head()
Out[8]:
neighborhood price_per_person overall_satisfaction N label
0 ALLEGHENY_WEST 120.791667 4.666667 23 2
1 BELLA_VISTA 87.407920 3.158333 204 2
2 BELMONT 69.425000 3.250000 11 2
3 BREWERYTOWN 71.788188 1.943182 142 1
4 BUSTLETON 55.833333 1.250000 19 1

Step 3: Calculate average features per cluster

To gain some insight into our clusters, after calculating the K-Means labels:

  • group by the label column
  • calculate the mean() of each of our features
  • calculate the number of neighborhoods per cluster
In [9]:
airbnb.groupby('label', as_index=False).size()
Out[9]:
label size
0 0 21
1 1 23
2 2 59
3 3 1
4 4 1
In [10]:
airbnb.groupby('label', as_index=False).mean().sort_values(by='price_per_person')
Out[10]:
label price_per_person overall_satisfaction N
2 2 67.526755 3.232236 69.864407
1 1 78.935232 0.920238 38.347826
0 0 119.329173 2.947605 313.000000
4 4 136.263996 3.000924 1499.000000
3 3 387.626984 5.000000 31.000000
In [11]:
airbnb.loc[airbnb['label'] == 2]
Out[11]:
neighborhood price_per_person overall_satisfaction N label
0 ALLEGHENY_WEST 120.791667 4.666667 23 2
1 BELLA_VISTA 87.407920 3.158333 204 2
2 BELMONT 69.425000 3.250000 11 2
5 CALLOWHILL 94.733269 3.017241 143 2
6 CEDAR_PARK 39.053171 3.220000 261 2
8 CHESTNUT_HILL 69.123478 3.444444 74 2
9 CHINATOWN 48.476667 2.984848 101 2
11 DICKINSON_NARROWS 62.626393 3.227273 125 2
12 EASTWICK 63.566667 2.142857 18 2
13 EAST_FALLS 64.491228 2.522222 156 2
14 EAST_KENSINGTON 58.984428 2.602564 108 2
15 EAST_OAK_LANE 113.243590 4.800000 20 2
17 EAST_PASSYUNK 81.253676 3.716981 190 2
18 EAST_POPLAR 69.164368 3.350000 36 2
24 FRANKFORD 28.500000 3.583333 19 2
25 FRANKLINVILLE 74.968750 2.409091 12 2
26 GARDEN_COURT 45.954375 3.467391 93 2
27 GERMANTOWN_EAST 86.875000 3.750000 12 2
28 GERMANTOWN_MORTON 37.357143 3.750000 12 2
29 GERMANTOWN_PENN_KNOX 36.030702 4.666667 21 2
30 GERMANTOWN_SOUTHWEST 39.105263 3.083333 25 2
31 GERMANTOWN_WESTSIDE 49.830128 2.375000 27 2
32 GERMANTOWN_WEST_CENT 57.004884 2.891892 90 2
33 GERMANY_HILL 135.208333 5.000000 20 2
34 GIRARD_ESTATES 90.461111 2.875000 78 2
36 GRAYS_FERRY 66.833535 3.035714 65 2
37 GREENWICH 57.557576 2.500000 14 2
39 HARROWGATE 24.796429 3.458333 16 2
40 HARTRANFT 77.105300 2.800000 73 2
41 HAVERFORD_NORTH 102.025000 4.000000 11 2
44 KINGSESSING 41.005000 2.781250 44 2
46 LOGAN 63.582598 2.800000 18 2
48 LOWER_MOYAMENSING 72.605479 2.838710 100 2
49 LUDLOW 63.333333 4.400000 11 2
51 MANTUA 96.527778 2.968750 162 2
53 MILL_CREEK 53.020833 4.357143 37 2
54 MOUNT_AIRY_EAST 47.598086 3.435484 93 2
55 MOUNT_AIRY_WEST 52.799916 3.468750 115 2
56 NEWBOLD 82.836628 2.976190 121 2
59 OGONTZ 35.389194 2.250000 13 2
61 OLD_KENSINGTON 75.641209 2.945946 118 2
62 OLNEY 66.591479 2.416667 22 2
66 PASCHALL 21.722222 4.071429 12 2
67 PASSYUNK_SQUARE 86.239845 3.742647 232 2
68 PENNSPORT 88.773897 3.100000 100 2
71 POWELTON 77.697032 2.673077 79 2
74 RICHMOND 79.432611 2.613636 71 2
77 ROXBOROUGH 88.504646 3.270833 124 2
81 SOUTHWEST_SCHUYLKILL 83.098639 2.375000 26 2
84 STADIUM_DISTRICT 118.981061 2.875000 31 2
87 TIOGA 44.636364 3.833333 14 2
89 UPPER_ROXBOROUGH 96.724194 3.230769 43 2
90 WALNUT_HILL 55.033582 3.104167 82 2
92 WEST_KENSINGTON 39.422287 3.119048 54 2
94 WEST_PASSYUNK 48.846491 3.500000 48 2
96 WEST_POWELTON 59.056780 2.733333 139 2
97 WHITMAN 63.386458 3.409091 50 2
98 WISSAHICKON 62.733060 2.600000 83 2
101 WOODLAND_TERRACE 66.902778 3.062500 22 2
In [12]:
airbnb.loc[airbnb['label'] == 3]
Out[12]:
neighborhood price_per_person overall_satisfaction N label
78 SHARSWOOD 387.626984 5.0 31 3
In [13]:
airbnb.loc[airbnb['label'] == 4]
Out[13]:
neighborhood price_per_person overall_satisfaction N label
75 RITTENHOUSE 136.263996 3.000924 1499 4

Step 4: Plot a choropleth, coloring neighborhoods by their cluster label

  • Part 1: Load the Philadelphia neighborhoods available in the data directory:
    • ./data/philly_neighborhoods.geojson
  • Part 2: Merge the Airbnb data (with labels) and the neighborhood polygons
  • Part 3: Use geopandas to plot the neighborhoods
    • The categorical=True and legend=True keywords will be useful here
In [14]:
hoods = gpd.read_file("./data/philly_neighborhoods.geojson")
hoods.head()
Out[14]:
Name geometry
0 LAWNDALE POLYGON ((-75.08616 40.05013, -75.08893 40.044...
1 ASTON_WOODBRIDGE POLYGON ((-75.00860 40.05369, -75.00861 40.053...
2 CARROLL_PARK POLYGON ((-75.22673 39.97720, -75.22022 39.974...
3 CHESTNUT_HILL POLYGON ((-75.21278 40.08637, -75.21272 40.086...
4 BURNHOLME POLYGON ((-75.08768 40.06861, -75.08758 40.068...
In [15]:
airbnb.head()
Out[15]:
neighborhood price_per_person overall_satisfaction N label
0 ALLEGHENY_WEST 120.791667 4.666667 23 2
1 BELLA_VISTA 87.407920 3.158333 204 2
2 BELMONT 69.425000 3.250000 11 2
3 BREWERYTOWN 71.788188 1.943182 142 1
4 BUSTLETON 55.833333 1.250000 19 1
In [16]:
# do the merge
airbnb2 = hoods.merge(airbnb, left_on='Name', right_on='neighborhood', how='left')

# assign -1 to the neighborhoods without any listings
airbnb2['label'] = airbnb2['label'].fillna(-1)
In [17]:
# plot the data
airbnb2 = airbnb2.to_crs(epsg=3857)

# setup the figure
f, ax = plt.subplots(figsize=(10, 8))

# plot, coloring by label column
# specify categorical data and add legend
airbnb2.plot(
    column="label",
    cmap="Dark2",
    categorical=True,
    legend=True,
    edgecolor="k",
    lw=0.5,
    ax=ax,
)


ax.set_axis_off()
plt.axis("equal");

Step 5: Plot an interactive map

Use altair to plot the clustering results with a tooltip for neighborhood name and tooltip.

Hint: See week 3B's lecture on interactive choropleth's with altair

In [18]:
# plot map, where variables ares nested within `properties`,
alt.Chart(airbnb2.to_crs(epsg=4326)).mark_geoshape().properties(
    width=500, height=400,
).encode(
    tooltip=["Name:N", "label:N"],
    color=alt.Color("label:N", scale=alt.Scale(scheme="Dark2")),
)
Out[18]:

Based on these results, where would you want to stay?

Group 2!

Why 5 groups?

Use the "Elbow" method)...

In [19]:
# Number of clusters to try out
n_clusters = list(range(2, 10))

# Run kmeans for each value of k
inertias = []
for k in n_clusters:
    
    # Initialize and run
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(scaled_airbnb_data)
    
    # Save the "inertia"
    inertias.append(kmeans.inertia_)
    
# Plot it!
plt.plot(n_clusters, inertias, marker='o', ms=10, mfc='white', lw=4, mew=3)
Out[19]:
[<matplotlib.lines.Line2D at 0x7fc05e8e2a60>]

The kneed package

There is also a nice package called kneed that can determine the “knee” point quantitatively, using the kneedle algorithm.

In [20]:
from kneed import KneeLocator

# Initialize the knee algorithm
kn = KneeLocator(n_clusters, inertias, curve='convex', direction='decreasing')

# Print out the knee 
print(kn.knee)
5

Part 2: Spatial clustering

Now on to the more traditional view of "clustering"...

DBSCAN

"Density-Based Spatial Clustering of Applications with Noise"

  • Clusters are areas of high density separated by areas of low density.
  • Can identify clusters of any shape
  • Good at separating core samples in high-density regions from low-density noise samples
  • Best for spatial data

Two key parameters

  1. eps: The maximum distance between two samples for them to be considered as in the same neighborhood.
  2. min_samples: The number of samples in a neighborhood for a point to be considered as a core point (including the point itself).

Example Scenario

DBSCAN-Illustration.svg

Example Scenario

  • min_samples = 4.
  • Point A and the other red points are core points
    • There are at least min_samples (4) points (including the point itself) within a distance of eps from each of these points.
    • These points are all reachable from one another, so they for a cluster
  • Points B and C are edge points of the cluster
    • They are reachable (within a distance of eps) from the core points so they are part of the cluster
    • They do not have at least min_pts within a distance of eps so they are not core points
  • Point N is a noise point — it not within a distance of eps from any of the cluster points

Importance of parameter choices

Higher min_samples or a lower eps requires a higher density necessary to form a cluster.

Example: OpenStreetMap GPS traces in Philadelphia

  • Data extracted from the set of 1 billion GPS traces from OSM.
  • CRS is EPSG=3857 — x and y in units of meters
In [22]:
coords = gpd.read_file('./data/osm_gps_philadelphia.geojson')
coords.head() 
Out[22]:
x y geometry
0 -8370750.5 4865303.0 POINT (-8370750.500 4865303.000)
1 -8368298.0 4859096.5 POINT (-8368298.000 4859096.500)
2 -8365991.0 4860380.0 POINT (-8365991.000 4860380.000)
3 -8372306.5 4868231.0 POINT (-8372306.500 4868231.000)
4 -8376768.5 4864341.0 POINT (-8376768.500 4864341.000)
In [23]:
num_points = len(coords)

print(f"Total number of points = {num_points}")
Total number of points = 52358

DBSCAN basics

In [24]:
from sklearn.cluster import dbscan 
In [25]:
dbscan?
In [32]:
# some parameters to start with
eps = 50  # in meters
min_samples = 50

cores, labels = dbscan(coords[["x", "y"]], eps=eps, min_samples=min_samples)

The function returns two objects, which we call cores and labels. cores contains the indices of each point which is classified as a core.

In [33]:
# The first 5 elements
cores[:5]
Out[33]:
array([ 1,  4,  6, 10, 12])

The length of cores tells you how many core samples we have:

In [34]:
num_cores = len(cores)
print(f"Number of core samples = {num_cores}")
Number of core samples = 19370

The labels tells you the cluster number each point belongs to. Those points classified as noise receive a cluster number of -1:

In [35]:
# The first 5 elements
labels[:5]
Out[35]:
array([-1,  0, -1, -1,  1])

The labels array is the same length as our input data, so we can add it as a column in our original data frame

In [36]:
# Add our labels to the original data
coords['label'] = labels
In [37]:
coords.head()
Out[37]:
x y geometry label
0 -8370750.5 4865303.0 POINT (-8370750.500 4865303.000) -1
1 -8368298.0 4859096.5 POINT (-8368298.000 4859096.500) 0
2 -8365991.0 4860380.0 POINT (-8365991.000 4860380.000) -1
3 -8372306.5 4868231.0 POINT (-8372306.500 4868231.000) -1
4 -8376768.5 4864341.0 POINT (-8376768.500 4864341.000) 1

The number of clusters is the number of unique labels minus one (because noise has a label of -1)

In [38]:
num_clusters = coords['label'].nunique() - 1
print(f"number of clusters = {num_clusters}")
number of clusters = 87

We can group by the label column to get the size of each cluster:

In [39]:
cluster_sizes = coords.groupby('label', as_index=False).size()

cluster_sizes
Out[39]:
label size
0 -1 29054
1 0 113
2 1 4073
3 2 1787
4 3 3370
... ... ...
83 82 54
84 83 56
85 84 50
86 85 53
87 86 50

88 rows × 2 columns

In [40]:
# All points get assigned a cluster label (-1 reserved for noise)
cluster_sizes['size'].sum() == num_points
Out[40]:
True

The number of noise points is the size of the cluster with label "-1":

In [41]:
num_noise = cluster_sizes.iloc[0]['size']

print(f"number of noise points = {num_noise}")
number of noise points = 29054

If points aren't noise or core samples, they must be edges:

In [42]:
num_edges = num_points - num_cores - num_noise
print(f"Number of edge points = {num_edges}")
Number of edge points = 3934

Now let's plot the noise and clusters

  • Extract each cluster: select points with the same label number
  • Plot the cluster centers: the mean x and mean y value for each cluster
In [43]:
# Setup figure and axis
f, ax = plt.subplots(figsize=(10, 10), facecolor="black")

# Plot the noise samples in grey
noise = coords.loc[coords["label"] == -1]
ax.scatter(noise["x"], noise["y"], c="grey", s=5, linewidth=0)

# Loop over each cluster number
for label_num in range(0, num_clusters):

    # Extract the samples with this label number
    this_cluster = coords.loc[coords["label"] == label_num]

    # Calculate the mean (x,y) point for this cluster in red
    x_mean = this_cluster["x"].mean()
    y_mean = this_cluster["y"].mean()
    
    # Plot this centroid point in red
    ax.scatter(x_mean, y_mean, linewidth=0, color="red")

# Format
ax.set_axis_off()
ax.set_aspect("equal")

Extending DBSCAN beyond just spatial coordinates

DBSCAN can perform high-density clusters from more than just spatial coordinates, as long as they are properly normalized

Exercise: Extracting patterns from NYC taxi rides

I've extracted data for taxi pickups or drop offs occurring in the Williamsburg neighborhood of NYC from the NYC taxi open data.

Includes data for:

  • Pickup/dropoff location
  • Fare amount
  • Trip distance
  • Pickup/dropoff hour

Goal: identify clusters of similar taxi rides that are not only clustered spatially, but also clustered for features like hour of day and trip distance

Inspired by this CARTO blog post

Step 1: Load the data

In [44]:
taxi = pd.read_csv("./data/williamsburg_taxi_trips.csv")
taxi.head()
Out[44]:
tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_x pickup_y dropoff_x dropoff_y fare_amount tip_amount dropoff_hour pickup_hour
0 2015-01-15 19:05:41 2015-01-15 19:20:22 2 7.13 -8223667.0 4979065.0 -8232341.0 4970922.0 21.5 4.50 19 19
1 2015-01-15 19:05:44 2015-01-15 19:17:44 1 2.92 -8237459.0 4971133.5 -8232725.0 4970482.5 12.5 2.70 19 19
2 2015-01-25 00:13:06 2015-01-25 00:34:32 1 3.05 -8236711.5 4972170.5 -8232267.0 4970362.0 16.5 5.34 0 0
3 2015-01-26 12:41:15 2015-01-26 12:59:22 1 8.10 -8222485.5 4978445.5 -8233442.5 4969903.5 24.5 5.05 12 12
4 2015-01-20 22:49:11 2015-01-20 22:58:46 1 3.50 -8236294.5 4970916.5 -8231820.5 4971722.0 12.5 2.00 22 22

Step 2: Extract and normalize several features

We will focus on the following columns:

  • pickup_x and pickup_y
  • dropoff_x and dropoff_y
  • trip_distance
  • pickup_hour

Use the StandardScaler to normalize these features.

In [45]:
feature_columns = [
    "pickup_x",
    "pickup_y",
    "dropoff_x",
    "dropoff_y",
    "trip_distance",
    "pickup_hour",
]
features = taxi[feature_columns].copy()
In [46]:
features.head()
Out[46]:
pickup_x pickup_y dropoff_x dropoff_y trip_distance pickup_hour
0 -8223667.0 4979065.0 -8232341.0 4970922.0 7.13 19
1 -8237459.0 4971133.5 -8232725.0 4970482.5 2.92 19
2 -8236711.5 4972170.5 -8232267.0 4970362.0 3.05 0
3 -8222485.5 4978445.5 -8233442.5 4969903.5 8.10 12
4 -8236294.5 4970916.5 -8231820.5 4971722.0 3.50 22
In [47]:
# Scale these features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

scaled_features
Out[47]:
array([[ 3.35254171e+00,  2.18196697e+00,  8.59345108e-02,
         1.89871932e-01, -2.58698769e-03,  8.16908067e-01],
       [-9.37728802e-01, -4.08167622e-01, -9.76176333e-02,
        -9.77141849e-03, -2.81690985e-03,  8.16908067e-01],
       [-7.05204353e-01, -6.95217705e-02,  1.21306539e-01,
        -6.45086739e-02, -2.80981012e-03, -1.32713022e+00],
       ...,
       [-1.32952083e+00, -1.14848599e+00, -3.37095821e-01,
        -1.09933782e-01, -2.76011198e-03,  7.04063946e-01],
       [-7.52953521e-01, -7.01094651e-01, -2.61571762e-01,
        -3.00037860e-01, -2.84530879e-03,  7.04063946e-01],
       [-3.97090015e-01, -1.71084059e-02, -1.11647543e+00,
         2.84810408e-01, -2.93269014e-03,  8.16908067e-01]])
In [48]:
print(scaled_features.shape)
print(features.shape)
(223722, 6)
(223722, 6)

Step 3: Run DBSCAN to extract high-density clusters

  • We want the highest density clusters, ideally no more than about 30-50 clusters.
  • Run the DBSCAN and experiment with different values of eps and min_samples

    • I started with eps of 0.25 and min_samples of 50
  • Add the labels to the original data frame and calculate the number of clusters. It should be less than 50 or so.

Hint: If the algorithm is taking a long time to run (more than a few minutes), the eps is probably too big!

In [49]:
# Run DBSCAN 
cores, labels = dbscan(scaled_features, eps=0.25, min_samples=50)

# Add the labels back to the original (unscaled) dataset
features['label'] = labels
In [50]:
# Extract the number of clusters 
num_clusters = features['label'].nunique() - 1
print(num_clusters)
27
In [51]:
features
Out[51]:
pickup_x pickup_y dropoff_x dropoff_y trip_distance pickup_hour label
0 -8223667.0 4979065.0 -8232341.0 4970922.0 7.13 19 0
1 -8237459.0 4971133.5 -8232725.0 4970482.5 2.92 19 1
2 -8236711.5 4972170.5 -8232267.0 4970362.0 3.05 0 2
3 -8222485.5 4978445.5 -8233442.5 4969903.5 8.10 12 -1
4 -8236294.5 4970916.5 -8231820.5 4971722.0 3.50 22 1
... ... ... ... ... ... ... ...
223717 -8234402.0 4975683.0 -8231220.0 4971597.0 4.28 21 -1
223718 -8238660.0 4970696.5 -8233435.0 4969915.0 3.60 17 1
223719 -8238718.5 4968866.5 -8233226.0 4970262.0 3.96 18 1
223720 -8236865.0 4970236.5 -8233068.0 4969843.5 2.40 18 1
223721 -8235721.0 4972331.0 -8234856.5 4971131.0 0.80 19 7

223722 rows × 7 columns

Step 4: Identify the 5 largest clusters

Group by the label, calculate and sort the sizes to find the label numbers of the top 5 largest clusters

In [52]:
N = features.groupby('label').size()
print(N)
label
-1     101292
 0       4481
 1      50673
 2      33277
 3      24360
 4        912
 5       2215
 6       2270
 7       1459
 8        254
 9        519
 10       183
 11       414
 12       224
 13       211
 14       116
 15        70
 16        85
 17       143
 18        76
 19        69
 20        86
 21        49
 22        51
 23        97
 24        52
 25        41
 26        43
dtype: int64
In [53]:
# sort from largest to smallest
N = N.sort_values(ascending=False)

N
Out[53]:
label
-1     101292
 1      50673
 2      33277
 3      24360
 0       4481
 6       2270
 5       2215
 7       1459
 4        912
 9        519
 11       414
 8        254
 12       224
 13       211
 10       183
 17       143
 14       116
 23        97
 20        86
 16        85
 18        76
 15        70
 19        69
 24        52
 22        51
 21        49
 26        43
 25        41
dtype: int64
In [59]:
# extract labels (ignoring label -1 for noise)
top5 = N.iloc[1:6]

top5
Out[59]:
label
1    50673
2    33277
3    24360
0     4481
6     2270
dtype: int64
In [60]:
top5_labels = top5.index.tolist()

top5_labels
Out[60]:
[1, 2, 3, 0, 6]

Step 5: Get mean statistics for the top 5 largest clusters

To better identify trends in the top 5 clusters, calculate the mean trip distance and pickup_hour for each of the clusters.

In [61]:
# get the features for the top 5 labels
selection = features['label'].isin(top5_labels)

# select top 5 and groupby by the label
grps = features.loc[selection].groupby('label')

# calculate average pickup hour and trip distance per cluster
avg_values = grps[['pickup_hour', 'trip_distance']].mean()

avg_values
Out[61]:
pickup_hour trip_distance
label
0 18.599643 7.508730
1 20.127405 4.025859
2 1.699943 3.915581
3 9.536905 1.175154
6 1.494714 2.620546

Step 6a: Visualize the top 5 largest clusters

Now visualize the top 5 largest clusters:

  • plot the dropoffs and pickups (same color) for the 5 largest clusters
  • include the "noise" samples, shown in gray

Hints:

  • For a given cluster, plot the dropoffs and pickups with the same color so we can visualize patterns in the taxi trips
  • A good color scheme for a black background is given below
In [62]:
# a good color scheme for a black background
colors = ['aqua', 'lime', 'red', 'fuchsia', 'yellow']
In [63]:
# EXAMPLE: enumerating a list
for i, label_num in enumerate([0, 1, 2, 3, 4]):
    print(f"i = {i}")
    print(f"label_num = {label_num}")
i = 0
label_num = 0
i = 1
label_num = 1
i = 2
label_num = 2
i = 3
label_num = 3
i = 4
label_num = 4
In [64]:
# Setup figure and axis
f, ax = plt.subplots(figsize=(10, 10), facecolor="black")

# Plot noise in grey
noise = features.loc[features["label"] == -1]
ax.scatter(noise["pickup_x"], noise["pickup_y"], c="grey", s=5, linewidth=0)

# specify colors for each of the top 5 clusters
colors = ["aqua", "lime", "red", "fuchsia", "yellow"]

# loop over top 5 largest clusters
for i, label_num in enumerate(top5_labels):
    print(f"Plotting cluster #{label_num}...")

    # select all the samples with label equals "label_num"
    this_cluster = features.loc[features["label"] == label_num]

    # plot pickups
    ax.scatter(
        this_cluster["pickup_x"],
        this_cluster["pickup_y"],
        linewidth=0,
        color=colors[i],
        s=5,
        alpha=0.3,
    )

    # plot dropoffs
    ax.scatter(
        this_cluster["dropoff_x"],
        this_cluster["dropoff_y"],
        linewidth=0,
        color=colors[i],
        s=5,
        alpha=0.3,
    )

# Display the figure
ax.set_axis_off()
Plotting cluster #1...
Plotting cluster #2...
Plotting cluster #3...
Plotting cluster #0...
Plotting cluster #6...

If you're feeling ambitious, and time-permitting...

Step 6b: Visualizing one cluster at a time

Another good way to visualize the results is to explore the other clusters one at a time, plotting both the pickups and dropoffs to identify the trends.

Use different colors for pickups/dropoffs to easily identify them.

Make it a function so we can repeat it easily:

In [ ]:
def plot_taxi_cluster(label_num):
    """
    Plot the pickups and dropoffs for the input cluster label
    """
    # Setup figure and axis
    f, ax = plt.subplots(figsize=(10, 10), facecolor='black')

    # Plot noise in grey
    noise = features.loc[features['label']==-1]
    ax.scatter(noise['pickup_x'], noise['pickup_y'], c='grey', s=5, linewidth=0)

    # get this cluster
    this_cluster = features.loc[features['label']==label_num]

    # Plot pickups in fuchsia
    ax.scatter(this_cluster['pickup_x'], this_cluster['pickup_y'], 
               linewidth=0, color='fuchsia', s=5, alpha=0.3)

    # Plot dropoffs in aqua
    ax.scatter(this_cluster['dropoff_x'], this_cluster['dropoff_y'], 
               linewidth=0, color='aqua', s=5, alpha=0.3)

    # Display the figure
    ax.set_axis_off()
    plt.show()
In [ ]:
# plot specific label
plot_taxi_cluster(label_num=6)

Step 7: an interactive map of clusters with hvplot + datashader

  • We'll plot the pickup/dropoff locations for the top 5 clusters
  • Use the .hvplot.scatter() function to plot the x/y points
  • Specify the c= keyword as the column holding the cluster label
  • Use the below color_key as the cmap= keyword
  • Specify the aggregator as ds.count_cat("label") — this will color the data points using our categorical color map
  • Use the datashade=True keyword to tell hvplot to use datashader
  • Add a background tile using Geoviews
  • Combine the pickups, dropoffs, and background tile into a single interactive map
In [65]:
# map colors to the top 5 cluster labels
color_key = dict(zip(top5_labels, ['aqua', 'lime', 'red', 'fuchsia', 'yellow']))
print(color_key)
{1: 'aqua', 2: 'lime', 3: 'red', 0: 'fuchsia', 6: 'yellow'}
In [66]:
# extract the features for the top 5 clusters
top5_features = features.loc[features['label'].isin(top5_labels)]
In [67]:
import hvplot.pandas
import datashader as ds
import geoviews as gv
In [68]:
# Pickups using a categorical color map
pickups = top5_features.hvplot.scatter(
    x="pickup_x",
    y="pickup_y",
    c="label",
    aggregator=ds.count_cat("label"),
    datashade=True,
    cmap=color_key,
    width=800,
    height=600,
)

# Dropoffs using a categorical color map
dropoffs = top5_features.hvplot.scatter(
    x="dropoff_x",
    y="dropoff_y",
    c="label",
    aggregator=ds.count_cat("label"),
    datashade=True,
    cmap=color_key,
    width=800,
    height=600,
)


pickups * dropoffs * gv.tile_sources.CartoDark
/Users/nhand/opt/miniconda3/envs/musa-550-fall-2021/lib/python3.8/site-packages/holoviews/operation/datashader.py:413: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[category] = df[category].astype('category')
Out[68]:
In [ ]: