Week 5B
Getting Data Part 1: Working with APIs

Oct 6, 2021

Week #5 Agenda

Last time:

  • Introduction to APIs
  • Pulling census data and shape files using Python

Today:

  • API Example: Lead poisoning in Philadelphia
  • Using the Twitter API
    • Plotting geocoded tweets
    • Word frequencies
    • Sentiment analysis
In [43]:
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import Point

from matplotlib import pyplot as plt
import seaborn as sns

import hvplot.pandas
import holoviews as hv
In [44]:
import esri2gpd
import carto2gpd
import cenpy
In [45]:
pd.options.display.max_columns = 999

Last time: Querying the census with cenpy

First, let's initialize a connection to the 2019 5-year ACS dataset

In [46]:
acs = cenpy.remote.APIConnection("ACSDT5Y2019")
In [47]:
# Set the map service for pulling geometries
acs.set_mapservice("tigerWMS_ACS2019")
Out[47]:
Connection to American Community Survey: 5-Year Estimates: Detailed Tables 5-Year(ID: https://api.census.gov/data/id/ACSDT5Y2019)
With MapServer: Census ACS 2019 WMS

Exercise: lead poisoning in Philadelphia

Let's use demographic census data in Philadelphia by census tract and compare to a dataset of childhood lead poisoning.

Step 1. Download the demographic data for Philadelphia

  • We are going to be examining the percent of the population that identifies as black in each census tract, so we will need:
    • Total population: 'B03002_001E'
    • Non-Hispanic, Black population: 'B03002_004E'
  • You'll want to use the state --> county --> tract hierarchy , using the * operator to get all tracts in Philadelphia county
  • Remember PA has a FIPS code of "42" and Philadelphia County is "101"
In [48]:
philly_demo_tract = acs.query(
    cols=["NAME", "B03002_001E", "B03002_004E"],
    geo_unit="tract:*",
    geo_filter={
                "state" : "42", 
                "county" : "101"
               },
)
In [49]:
philly_demo_tract.head()
Out[49]:
NAME B03002_001E B03002_004E state county tract
0 Census Tract 260, Philadelphia County, Pennsyl... 3055 2717 42 101 026000
1 Census Tract 263.02, Philadelphia County, Penn... 4927 4704 42 101 026302
2 Census Tract 264, Philadelphia County, Pennsyl... 5660 5576 42 101 026400
3 Census Tract 101, Philadelphia County, Pennsyl... 6925 6671 42 101 010100
4 Census Tract 107, Philadelphia County, Pennsyl... 3612 3301 42 101 010700
In [50]:
philly_demo_tract.dtypes # "object" means string!
Out[50]:
NAME           object
B03002_001E    object
B03002_004E    object
state          object
county         object
tract          object
dtype: object

Step 2. Download and merge in the census tract geometries

In [51]:
# Census tracts are the 9th layer (index 8 starting from 0)
acs.mapservice.layers[8]
Out[51]:
(ESRILayer) Census Tracts

Option 1: Use esri2gpd to load the data

In [52]:
# The base url for the map service API endpoint
acs.mapservice._baseurl
Out[52]:
'http://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/tigerWMS_ACS2019/MapServer'
In [53]:
## We're just querying a GeoService — let's use esri2gpd

# Only Philadelphia
where_clause = "STATE = 42 AND COUNTY = 101"

# Create the API url with the layer ID add the end
url = f"{acs.mapservice._baseurl}/8" 

# Query
philly_census_tracts = esri2gpd.get(url, where=where_clause)

Option 2: Query the map service in cenpy

In [54]:
# Query for census tracts using cenpy API
# philly_census_tracts = acs.mapservice.layers[8].query(where=where_clause)
In [55]:
philly_census_tracts.head(n=1)
Out[55]:
geometry MTFCC OID GEOID STATE COUNTY TRACT BASENAME NAME LSADC FUNCSTAT AREALAND AREAWATER CENTLAT CENTLON INTPTLAT INTPTLON OBJECTID
0 POLYGON ((-75.05110 40.02641, -75.05065 40.025... G5020 20759510236568 42101032500 42 101 032500 325 Census Tract 325 CT S 839319 0 +40.0263918 -075.0452561 +40.0263918 -075.0452561 1554
In [56]:
philly_census_tracts.dtypes
Out[56]:
geometry     geometry
MTFCC          object
OID            object
GEOID          object
STATE          object
COUNTY         object
TRACT          object
BASENAME       object
NAME           object
LSADC          object
FUNCSTAT       object
AREALAND        int64
AREAWATER       int64
CENTLAT        object
CENTLON        object
INTPTLAT       object
INTPTLON       object
OBJECTID        int64
dtype: object
In [57]:
philly_demo_tract.head(n=1)
Out[57]:
NAME B03002_001E B03002_004E state county tract
0 Census Tract 260, Philadelphia County, Pennsyl... 3055 2717 42 101 026000
In [58]:
# Merge them together
# IMPORTANT: Make sure your merge keys are the same dtypes (e.g., all strings or all ints)
philly_demo_tract = philly_census_tracts.merge(
    philly_demo_tract,
    left_on=["STATE", "COUNTY", "TRACT"],
    right_on=["state", "county", "tract"],
)

Step 3. Calculate the black percentage

Add a new column to your data called percent_black.

Important: Make sure you convert the data to floats!

In [59]:
for col in ['B03002_001E', 'B03002_004E']:
    philly_demo_tract[col] = philly_demo_tract[col].astype(float)
In [60]:
philly_demo_tract["percent_black"] = (
    100 * philly_demo_tract["B03002_004E"] / philly_demo_tract["B03002_001E"]
)

Step 4. Query CARTO to get the childhood lead levels by census tract

In [61]:
# Documentation includes an example for help!
# carto2gpd.get?
In [62]:
table_name = 'child_blood_lead_levels_by_ct'
lead_levels = carto2gpd.get("https://phl.carto.com/api/v2/sql", table_name)
In [63]:
lead_levels.head()
Out[63]:
geometry cartodb_id census_tract data_redacted num_bll_5plus num_screen perc_5plus
0 POLYGON ((-75.14147 39.95171, -75.14150 39.951... 1 42101000100 False 0.0 100.0 0.0
1 POLYGON ((-75.16238 39.95766, -75.16236 39.957... 2 42101000200 True NaN 109.0 NaN
2 POLYGON ((-75.17821 39.95981, -75.17743 39.959... 3 42101000300 True NaN 110.0 NaN
3 POLYGON ((-75.17299 39.95464, -75.17301 39.954... 4 42101000401 True NaN 61.0 NaN
4 POLYGON ((-75.16333 39.95334, -75.16340 39.953... 5 42101000402 False 0.0 41.0 0.0

Step 5. Remove census tracts with missing lead measurements

See the .dropna() function and the subset= keyword.

In [64]:
lead_levels = lead_levels.dropna(subset=['perc_5plus'])

Step 6. Merge the demographic and lead level data frames

  • From the lead data, we only need the 'census_tract' and 'perc_5plus'. Before merging, trim your data to only these columns.
  • You can perform the merge by comparing the census_tract and GEOID fields
  • Remember: when merging, the left data frame should be the GeoDataFrame — use GeoDataFrame.merge(...)
In [65]:
# Trim the lead levels data
lead_levels_trimmed = lead_levels[['census_tract', 'perc_5plus']]

# Merge into the demographic data
# Use "GEOID" — that is the unique identifier here
merged = philly_demo_tract.merge(lead_levels_trimmed, 
                                 how='left', 
                                 left_on='GEOID', 
                                 right_on='census_tract')
In [66]:
merged.head()
Out[66]:
geometry MTFCC OID GEOID STATE COUNTY TRACT BASENAME NAME_x LSADC FUNCSTAT AREALAND AREAWATER CENTLAT CENTLON INTPTLAT INTPTLON OBJECTID NAME_y B03002_001E B03002_004E state county tract percent_black census_tract perc_5plus
0 POLYGON ((-75.05110 40.02641, -75.05065 40.025... G5020 20759510236568 42101032500 42 101 032500 325 Census Tract 325 CT S 839319 0 +40.0263918 -075.0452561 +40.0263918 -075.0452561 1554 Census Tract 325, Philadelphia County, Pennsyl... 6143.0 1200.0 42 101 032500 19.534429 42101032500 2.9
1 POLYGON ((-75.16012 39.94340, -75.16020 39.943... G5020 207593717001443 42101001102 42 101 001102 11.02 Census Tract 11.02 CT S 204063 0 +39.9442654 -075.1566960 +39.9442654 -075.1566960 1910 Census Tract 11.02, Philadelphia County, Penns... 2687.0 257.0 42 101 001102 9.564570 NaN NaN
2 POLYGON ((-75.17776 39.94897, -75.17784 39.948... G5020 207593717001453 42101000803 42 101 000803 8.03 Census Tract 8.03 CT S 152822 0 +39.9493748 -075.1742489 +39.9493748 -075.1742489 1911 Census Tract 8.03, Philadelphia County, Pennsy... 2774.0 81.0 42 101 000803 2.919971 42101000803 0.0
3 POLYGON ((-75.26108 39.87660, -75.26125 39.876... G5020 207593717001459 42101980900 42 101 980900 9809 Census Tract 9809 CT S 17332840 3354833 +39.9015541 -075.2140251 +39.9051802 -075.2174146 1912 Census Tract 9809, Philadelphia County, Pennsy... 0.0 0.0 42 101 980900 NaN NaN NaN
4 POLYGON ((-75.17226 39.99671, -75.17235 39.996... G5020 207593717001480 42101017201 42 101 017201 172.01 Census Tract 172.01 CT S 262958 0 +39.9993972 -075.1686145 +39.9993972 -075.1686145 1957 Census Tract 172.01, Philadelphia County, Penn... 2835.0 2146.0 42 101 017201 75.696649 42101017201 12.8

Step 7. Trim to the columns we need

We only need the 'geometry', 'percent_black', and 'perc_5plus', and 'NAME' columns

In [67]:
merged = merged[['NAME_x', 'geometry', 'percent_black', 'perc_5plus']]

Step 8. Plot the results

Make two plots:

  1. A two panel, side-by-side chart showing a choropleth of the lead levels and the percent black
  2. A scatter plot showing the percentage

You can make these using hvplot or geopandas/matplotlib — whichever you prefer!

In [68]:
# Lead levels plot
img1 = merged.hvplot(geo=True, 
                     crs=3857, 
                     c='perc_5plus', 
                     width=500, 
                     height=400, 
                     cmap='viridis', 
                     title='Lead Levels')

# Percent black 
img2 = merged.hvplot(geo=True,
                     crs=3857,
                     c='percent_black', 
                     width=500, 
                     height=400, 
                     cmap='viridis', 
                     title='% Black')

img1 + img2
Out[68]:
In [69]:
cols = ['perc_5plus', 'percent_black']
merged[cols].hvplot.scatter(x=cols[0], y=cols[1])
Out[69]:

Challenge: Step 9. Use seaborn to plot a 2d density map

In the previous plots, it's still hard to see the relationship. Use the kdeplot() function in seaborn to better visualize the relationship.

You will need to remove any NaN entries first.

You should see two peaks in the distribution clearly now!

In [70]:
fig, ax = plt.subplots(figsize=(8,6))

X = merged.dropna()
sns.kdeplot(x=X['perc_5plus'], y=X['percent_black'], ax=ax);

API Example #2: the Twitter API

Twitter provides a rich source of information, but the challenge is how to extract the information from semi-structured data.

Semi-structured data

Data that contains some elements that cannot be easily consumed by computers

Examples: human-readable text, audio, images, etc

Key challenges

  • Text mining: analyzing blocks of text to extract the relevant pieces of information
  • Natural language processing (NLP): programming computers to process and analyze human languages
  • Sentiment analysis: analyzing blocks of text to derive the attitude or emotional state of the person

First: Getting an API key

Step 1: Make a Twitter account

Step 2: Apply for Developer access

See: https://developer.twitter.com/apps

You will need to apply for a Developer Account, answering a few questions, and then confirm your email address.

Once you submit you'll application, you'll need to wait for approval...usually this happens immediately, but there can sometimes be a short delay

Sample answer

Needs to be at least 100 characters

  1. I'm using Twitter's API to perform a sentiment analysis as part of a class teaching Python. I will be interacting with the API using the Python package tweepy.
  2. I plan to analyze tweets to understand topic sentiments.
  3. I will not be interacting with Twitter users as part of
  4. I will not be displaying Twitter content off of Twitter.

Step 4: Create your API keys

In the "Keys and Tokens" section, generate new access tokens.

You will need the Consumer API keys and access tokens to use the Twitter API.

We'll be using tweepy to search recent tweets

The standard, free API let's you search tweets from the last 7 days

For more information, see the Twitter Developer docs

Tweepy: a Python interface to Twitter

https://tweepy.readthedocs.io

In [71]:
import tweepy as tw

Define your API keys

In [72]:
# INPUT YOUR API AND ACCESS KEYS HERE
api_key = ""
api_key_secret = ""
access_token = ""
access_token_secret = ""

Initialize an API object

We need to:

  • step up authentication
  • intialize a tweepy.API object
In [73]:
auth = tw.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)


api = tw.API(auth, wait_on_rate_limit=True)

Rate Limits

Be careful: With the free API, you are allowed 15 API calls every 15 minutes.

See the Rate Limits documentation for more details.

What does wait_on_rate_limit do?

If you run into a rate limit while pulling tweets, this will tell tweepy to wait 15 minutes until it can continue.

Unfortunately, you need to sign up (and pay) for the premium API to avoid these rate limits.

How to find out how many requests you've made?

In [74]:
data = api.rate_limit_status() 
In [75]:
data
Out[75]:
{'rate_limit_context': {'access_token': '706239336-KiVl84XESmuZozBm6yflr8quudPMHU7l2NZ5u9cH'},
 'resources': {'lists': {'/lists/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/lists/:id/members': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/lists/memberships': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/lists/:id&DELETE': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/lists/subscriptions': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/lists/members': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/lists/subscribers/show': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/lists/:id&PUT': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/lists/show': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/lists/ownerships': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/lists/subscribers': {'limit': 180, 'remaining': 180, 'reset': 1633655934},
   '/lists/:id/members/:user_id&DELETE': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/lists/:id/members&POST': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/lists/members/show': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/lists/statuses': {'limit': 900, 'remaining': 900, 'reset': 1633655934}},
  'application': {'/application/rate_limit_status': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'mutes': {'/mutes/users/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/mutes/users/ids': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'verify': {'/verify/:version/badge-violation': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/badge-violation/violations': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/intake': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/document-formats': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/id-document&GET': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/verify/:version/id-document&POST': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/access': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/account-eligibility': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'admin_users': {'/admin_users': {'limit': 2000,
    'remaining': 2000,
    'reset': 1633655934}},
  'live_video_stream': {'/live_video_stream/status/:id': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934}},
  'friendships': {'/friendships/outgoing': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friendships/list': {'limit': 200, 'remaining': 200, 'reset': 1633655934},
   '/friendships/no_retweets/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friendships/lookup': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/friendships/incoming': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friendships/show': {'limit': 180, 'remaining': 180, 'reset': 1633655934}},
  'guide': {'/guide': {'limit': 180, 'remaining': 180, 'reset': 1633655934},
   '/guide/get_explore_locations': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/guide/explore_locations_with_autocomplete': {'limit': 200,
    'remaining': 200,
    'reset': 1633655934}},
  'auth': {'/auth/csrf_token': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'compliance': {'/compliance/jobs&POST': {'limit': 150,
    'remaining': 150,
    'reset': 1633655934},
   '/compliance/jobs&GET': {'limit': 150,
    'remaining': 150,
    'reset': 1633655934},
   '/compliance/jobs/:job_id': {'limit': 150,
    'remaining': 150,
    'reset': 1633655934}},
  'paseto': {'/paseto/token': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934}},
  'blocks': {'/blocks/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/blocks/ids': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'geo': {'/geo/similar_places': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/geo/place_page': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/geo/id/:place_id': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/geo/reverse_geocode': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/geo/search': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'users': {'/users/': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id/muting&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/report_spam': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/pinned_lists/:list_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:source_user_id/blocking/:target_user_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/contributors/pending': {'limit': 2000,
    'remaining': 2000,
    'reset': 1633655934},
   '/users/:id/followed_lists': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/show/:id': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:source_user_id/following&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:id/tweets': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id/retweets/:source_tweet_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/search': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id/likes&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/suggestions/:slug': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/users/:id/pinned_lists': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/contributees/pending': {'limit': 200,
    'remaining': 200,
    'reset': 1633655934},
   '/users/:id/retweets&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/profile_banner': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/users/by/username/:source_username/following/:target_user_name&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:handle/tweets': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/users/derived_info': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/blocking&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:source_username/following&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:id/followers': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/suggestions/:slug/members': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/users/:id/muting': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/mentions': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/users/:id/following': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/pinned_lists&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:source_user_id/following/:target_user_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:username': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/users/:id/followed_lists&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:username/followers': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/users/lookup': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id/followed_lists/:list_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:id/blocking': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/suggestions': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/likes/:tweet_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:username/following': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/users/by/username/:handle/mentions': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/users/by': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:source_user_id/muting/:target_user_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:id/liked_tweets': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934}},
  'teams': {'/teams/authorize': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'followers': {'/followers/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/followers/list': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'collections': {'/collections/list': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/collections/entries': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/collections/show': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934}},
  'permissions': {'/permissions/user_permissions/admin_email_verification': {'limit': 3,
    'remaining': 3,
    'reset': 1633655934},
   '/permissions/user_permissions': {'limit': 3,
    'remaining': 3,
    'reset': 1633655934}},
  'tweets&POST': {'/tweets&POST': {'limit': 200,
    'remaining': 200,
    'reset': 1633655934}},
  'statuses': {'/statuses/retweeters/ids': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/statuses/retweets_of_me': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/statuses/home_timeline': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/statuses/show/:id': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/statuses/user_timeline': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/statuses/friends': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/statuses/retweets/:id': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/statuses/mentions_timeline': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/statuses/oembed': {'limit': 180, 'remaining': 180, 'reset': 1633655934},
   '/statuses/lookup': {'limit': 900, 'remaining': 900, 'reset': 1633655934}},
  'custom_profiles': {'/custom_profiles/list': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/custom_profiles/show': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'webhooks': {'/webhooks/subscriptions/direct_messages': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/webhooks': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'contacts': {'/contacts/uploaded_by': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/contacts/users': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/contacts/addressbook': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/contacts/users_and_uploaded_by': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/contacts/delete/status': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934}},
  'labs': {'/labs/2/platform_manipulation/reports': {'limit': 5,
    'remaining': 5,
    'reset': 1633655934},
   '/labs/:version/tweets/:id/hidden&PUT': {'limit': 10,
    'remaining': 10,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/labs/:version/users/:id/tweets': {'limit': 225,
    'remaining': 225,
    'reset': 1633655934},
   '/labs/2/reports': {'limit': 5, 'remaining': 5, 'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/rules&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/sample': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/labs/:version/users/by/username/:handle/tweets': {'limit': 225,
    'remaining': 225,
    'reset': 1633655934},
   '/labs/:version/tweets/metrics/private': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/rules/:instance_name': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/labs/:version/tweets/*': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/labs/:version/users/*': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/:instance_name': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/rules/': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/compliance': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/labs/:version/tweets/search': {'limit': 225,
    'remaining': 225,
    'reset': 1633655934}},
  'i': {'/i/config': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/i/tfb/v1/smb/web/:account_id/payment/save': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'tweet_prompts': {'/tweet_prompts/report_interaction': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/tweet_prompts/show': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'moments': {'/moments/statuses/update': {'limit': 5,
    'remaining': 5,
    'reset': 1633655934},
   '/moments/create': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/moments/permissions': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934}},
  'limiter_scalding_report_creation': {'/limiter_scalding_report_creation': {'limit': 4500,
    'remaining': 4500,
    'reset': 1633655934}},
  'fleets': {'/fleets/:version/mutes/create': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/viewers': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/delete': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/fleets/:version/avatar_content': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/create': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/fleets/:version/user_fleets': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/fleets/:version/fleetline': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/track_events': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/update': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/fleets/:version/fleet_threads': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/fleets/:version/mutes/list': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/mutes/destroy': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/home_timeline': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/feedback/create': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/fleets/:version/mark_read': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934}},
  'help': {'/help/tos': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/help/configuration': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/help/settings': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/help/privacy': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/help/languages': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'feedback': {'/feedback/show/:id': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/feedback/events': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934}},
  'business_experience': {'/business_experience/dashboard_settings/destroy': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/business_experience/dashboard_features': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/business_experience/keywords': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/business_experience/dashboard_settings/update': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/business_experience/dashboard_settings/show': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934}},
  'graphql&POST': {'/graphql&POST': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'friends': {'/friends/following/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friends/following/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friends/list': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/friends/ids': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'sandbox': {'/sandbox/account_activity/webhooks/:id/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934}},
  'drafts': {'/drafts/statuses/update': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/destroy': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/ids': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/list': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/show': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/create': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934}},
  'direct_messages': {'/direct_messages/sent': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/direct_messages/broadcasts/list': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934},
   '/direct_messages/subscribers/lists/members/show': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/direct_messages/mark_read': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/direct_messages/subscribers/ids': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/sent_and_received': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/direct_messages/broadcasts/statuses/list': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934},
   '/direct_messages': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/direct_messages/subscribers/lists/members/ids': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/subscribers/show': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/broadcasts/show': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934},
   '/direct_messages/broadcasts/statuses/show': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934},
   '/direct_messages/subscribers/lists/list': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/show': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/direct_messages/events/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/direct_messages/subscribers/lists/show': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/events/show': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'media': {'/media/upload': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934}},
  'traffic': {'/traffic/map': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'strato': {'/strato/column/None/:id/cms/*': {'limit': 150,
    'remaining': 150,
    'reset': 1633655934}},
  'account_activity': {'/account_activity/all/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/all/:instance_name/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/account_activity/direct_messages/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions/direct_messages/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions/all': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/account_activity/direct_messages/:instance_name/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions/all/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions/direct_messages': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/account_activity/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/direct_messages/:instance_name/subscriptions': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/account_activity/all/:instance_name/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'account': {'/account/login_verification_enrollment': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account/update_profile': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account/authenticate_web_view': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/account/verify_credentials': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/account/settings': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/account/change_password': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'safety': {'/safety/detection_feedback': {'limit': 450000,
    'remaining': 450000,
    'reset': 1633655934}},
  'favorites': {'/favorites/list': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934}},
  'lists&POST': {'/lists&POST': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934}},
  'device': {'/device/token': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'tweets': {'/tweets/search/all': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/tweets/search/stream/rules': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/tweets/search/recent': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/tweets/sample/stream': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/tweets/:id&DELETE': {'limit': 50, 'remaining': 50, 'reset': 1633655934},
   '/tweets/': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/tweets/counts/all': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/tweets/search/stream': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/tweets/search/:product/:label': {'limit': 1800,
    'remaining': 1800,
    'reset': 1633655934},
   '/tweets/search/stream/rules/validation&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/tweets/search/:product/:instance/counts': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/tweets/:id/retweeted_by': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/tweets/:tweet_id/liking_users': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/tweets/:id': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/tweets/search/stream/rules&DELETE': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/tweets/counts/recent': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/tweets/search/stream/rules&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/tweets/:id/hidden&PUT': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934}},
  'saved_searches': {'/saved_searches/destroy/:id': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/saved_searches/show/:id': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/saved_searches/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'oauth': {'/oauth/revoke': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/oauth/invalidate_token': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/oauth/revoke_html': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'search': {'/search/tweets': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'trends': {'/trends/closest': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/trends/available': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/trends/place': {'limit': 75, 'remaining': 75, 'reset': 1633655934}},
  'live_pipeline': {'/live_pipeline/events': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'graphql': {'/graphql': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}}}}
In [76]:
data['resources']['search']
Out[76]:
{'/search/tweets': {'limit': 180, 'remaining': 180, 'reset': 1633655934}}

Tip: converting a time stamp to a date

In [77]:
import datetime
datetime.datetime.fromtimestamp(1633566967)
Out[77]:
datetime.datetime(2021, 10, 6, 20, 36, 7)

Several different APIs available

Including user tweets, mentions, searching keywords, favoriting, direct messages, and more...

See the Tweepy API documentation

You can also stream tweets

You can set up a listener to listen for new tweets and download them in real time (subject to rate limits).

We won't focus on this, but there is a nice tutorial on the Tweepy documentation.

You can also tweet (if you want!)

You can post tweets using the update_status() function. For example:

tweet = 'Hello World! @PennMusa'
api.update_status(tweet)

We'll focus on the search API

We'll use a tweepy.Cursor object to query the API.

In [79]:
# collect tweets related to the eagles
search_words = "#eagles"
In [80]:
# initialize the cursor
cursor = tw.Cursor(api.search,
                   q=search_words,
                   lang="en",
                   tweet_mode='extended')
cursor
Out[80]:
<tweepy.cursor.Cursor at 0x7fb92bedafa0>

Next, specify how many tweets we want

Use the Cursor.items() function:

In [81]:
# select 5 tweets
tweets = cursor.items(5)
tweets
Out[81]:
<tweepy.cursor.ItemIterator at 0x7fb92bece4c0>

Python iterators

As the name suggests, iterators need to be iterated over to return objects. In our case, we can use a for loop to iterate through the tweets that we pulled from the API.

In [82]:
# Iterate on tweets
for tweet in tweets:
    print(tweet.full_text)
RT @Jeff_McLane: Jason Kelce, who didn’t practice Wed because of “foot/rest,” just went indoors. Hasn’t come out in a bit. Will update. #Ea…
RT @Jeff_McLane: No Lane Johnson again at #Eagles practice. Still dealing with a personal issue. Looking more and more unlikely he’ll be in…
RT @Jeff_McLane: #Eagles beat writers make their predictions for the Panthers game in Week 5: Will the losing streak end?

From @EJSmith94,…
RT @Jeff_McLane: Narrative Fletcher Cox sees only doubles is false. Number based upon film review is exactly half.

More obvious answer for…
RT @Jeff_McLane: #Eagles’ Lane Johnson not back with team; Jordan Mailata making progress. @EJSmith94: https://t.co/zIwrUL8wXO via @phillyi…

The concept of "paging"

Unfortunately, there is no way to search for tweets between specific dates.

The API pulls tweets from the most recent page of the search result, and then grabs tweets from the previous page, and so on, to return the requested number of tweets.

Customizing our search query

The API documentation has examples of different query string use cases

Examples

Let's remove retweets

In [83]:
new_search = search_words + " -filter:retweets"

Get a new tweets using our new search query:

In [84]:
cursor = tw.Cursor(api.search,
                   q=new_search,
                   lang="en",
                   tweet_mode='extended')
tweets = cursor.items(5)

Did it work?

In [85]:
for tweet in tweets:
    print(tweet.full_text)
**Ending Soon**
#EAGLES 7" #VINYL #SINGLE #1970s EX/EX https://t.co/vcSOVppSBl via @eBay_UK #records #OneOfTheseNights #ebay
November 20, 1960 - Chuck Bednarik (60) of the Eagles walks off the field after game against the Giants at Yankee Stadium.
#Philadelphia #Eagles #NFL #1960s #FlyEaglesFly https://t.co/wFfUjQS8W3
Every time I watch the #Seahawks I say THANK GOD the #Eagles passed on this turd DK Metcalf for… &lt;CHECKS NOTES&gt;… JJ Arcega-Whiteside! 🧐😞
New Kultural Sport Episode!! NFL 2021 WEEK 4 RECAP/ WEEK 5 PREDICTIONS. Click link below and hit that subscribe button greatly appreciate it! #ColtsNation #colts #Eagles #NFL #sports #

https://t.co/t1MhprSIcg https://t.co/HQugdiCsaQ
NEW: #Eagles Thurs notebook:
-OL speculation. "If you knew what the lineup would be, it would be let’s attack this...right now we’re just trying to attack scheme the best we can”
-DeVonta Smith's training and "stepping stone"
-Christian McCaffery
And more:
https://t.co/rTLSBf7Rm3

How to save the tweets?

Create a list of the tweets using Python's inline syntax

In [86]:
# select the next 10 tweets
tweets = [t for t in cursor.items(10)]

print(len(tweets))
10

A wealth of information is available

Beyond the text of the tweet, things like favorite and retweet counts are available. You can also extract info on the user behind the tweet.

In [87]:
first_tweet = tweets[0]
In [88]:
first_tweet.full_text
Out[88]:
'Ryan Paganetti, Former #Eagles Assistant Coach &amp; Gameday management specialist Breaks Down 2021 Eagles... https://t.co/BnLOGa1sKe via @YouTube'
In [89]:
first_tweet.favorite_count
Out[89]:
1
In [90]:
first_tweet.created_at
Out[90]:
datetime.datetime(2021, 10, 7, 23, 44, 51)
In [91]:
first_tweet.retweet_count
Out[91]:
1
In [92]:
first_tweet.user.description
Out[92]:
'#NFL 🏈, #Eagles insider 🎙️📝for @sinow, @thephillyvoice, @JAKIBMedia, @birds365show, @6ABC postgame, @ExtendthePlay & #Football247 on @1490SportsBet. DM Me'
In [93]:
first_tweet.user.followers_count
Out[93]:
7185

Let's extract screen names and locations

A fraction of tweets have locations associated with user profiles, giving (very rough) location data.

In [94]:
users_locs = [[tweet.user.screen_name, tweet.user.location] for tweet in tweets]
users_locs
Out[94]:
[['JFMcMullen', 'Philadelphia'],
 ['champton0927', ''],
 ['FlamingOldies1', 'Worcester, Massachusetts'],
 ['ThomasFox_4th', 'Vermont, USA'],
 ['designguyca', 'South Coast of Canada'],
 ['ByADiBona', 'Brooklyn, NY'],
 ['nflrums', 'Georgia '],
 ['kracze', 'Bucks County, PA'],
 ['RonBohning', 'Fayetteville, AR'],
 ['DSM_Media', 'Philadelphia, PA']]

Note: only about 1% of tweets have a latitude/longitude.

Difficult to extract geographic trends without pulling a large number of tweets, requiring a premium API.

Use case #1: calculating word frequencies

An example of text mining

Load the most recent 1,000 tweets

Save the text of 1,000 tweets after querying our cursor object.

In [95]:
cursor = tw.Cursor(api.search,
                   q="#eagles -filter:retweets",
                   lang="en",
                   tweet_mode='extended')
tweets = [tweet for tweet in cursor.items(1000)]
In [96]:
# get the text of the tweets
tweets_text = [tweet.full_text for tweet in tweets]
In [97]:
# the first five tweets
tweets_text[:5]
Out[97]:
['**Ending Soon**\n#EAGLES 7" #VINYL #SINGLE #1970s EX/EX https://t.co/vcSOVppSBl via @eBay_UK #records #OneOfTheseNights #ebay',
 'November 20, 1960 - Chuck Bednarik (60) of the Eagles walks off the field after game against the Giants at Yankee Stadium.\n#Philadelphia #Eagles #NFL #1960s #FlyEaglesFly https://t.co/wFfUjQS8W3',
 'Every time I watch the #Seahawks I say THANK GOD the #Eagles passed on this turd DK Metcalf for… &lt;CHECKS NOTES&gt;… JJ Arcega-Whiteside! 🧐😞',
 'New Kultural Sport Episode!! NFL 2021 WEEK 4 RECAP/ WEEK 5 PREDICTIONS. Click link below and hit that subscribe button greatly appreciate it! #ColtsNation #colts #Eagles #NFL #sports #\n\nhttps://t.co/t1MhprSIcg https://t.co/HQugdiCsaQ',
 'NEW: #Eagles Thurs notebook:\n-OL speculation. "If you knew what the lineup would be, it would be let’s attack this...right now we’re just trying to attack scheme the best we can”\n-DeVonta Smith\'s training and "stepping stone"\n-Christian McCaffery\nAnd more:\nhttps://t.co/rTLSBf7Rm3']

Text mining and dealing with messy data

  1. Remove URLs $\rightarrow$ regular expressions
  2. Remove stop words
  3. Remove the search terms

Step 1: removing URLs

Regular expressions

image.png

This will identify "t.co" in URLs, e.g. https://t.co/Sp1Qtf5Fnl

Don't worry about mastering regular expression syntax...

StackOverflow is your friend

In [98]:
def remove_url(txt):
    """
    Replace URLs found in a text string with nothing 
    (i.e. it will remove the URL from the string).

    Parameters
    ----------
    txt : string
        A text string that you want to parse and remove urls.

    Returns
    -------
    The same txt string with url's removed.
    """
    import re
    return " ".join(re.sub("https://t.co/[A-Za-z\\d]+|&amp", "", txt).split())

Remove any URLs

In [99]:
tweets_no_urls = [remove_url(tweet) for tweet in tweets_text]
tweets_no_urls[:5]
Out[99]:
['**Ending Soon** #EAGLES 7" #VINYL #SINGLE #1970s EX/EX via @eBay_UK #records #OneOfTheseNights #ebay',
 'November 20, 1960 - Chuck Bednarik (60) of the Eagles walks off the field after game against the Giants at Yankee Stadium. #Philadelphia #Eagles #NFL #1960s #FlyEaglesFly',
 'Every time I watch the #Seahawks I say THANK GOD the #Eagles passed on this turd DK Metcalf for… &lt;CHECKS NOTES&gt;… JJ Arcega-Whiteside! 🧐😞',
 'New Kultural Sport Episode!! NFL 2021 WEEK 4 RECAP/ WEEK 5 PREDICTIONS. Click link below and hit that subscribe button greatly appreciate it! #ColtsNation #colts #Eagles #NFL #sports #',
 'NEW: #Eagles Thurs notebook: -OL speculation. "If you knew what the lineup would be, it would be let’s attack this...right now we’re just trying to attack scheme the best we can” -DeVonta Smith\'s training and "stepping stone" -Christian McCaffery And more:']

Extract a list of lower-cased words in a tweet

  • .lower() makes all words lower cased
  • .split() splits a string into the individual words
In [100]:
"This is an Example".lower()
Out[100]:
'this is an example'
In [101]:
"This is an Example".lower().split()
Out[101]:
['this', 'is', 'an', 'example']

Apply these functions to all tweets:

In [102]:
words_in_tweet = [tweet.lower().split() for tweet in tweets_no_urls]
words_in_tweet[:2]
Out[102]:
[['**ending',
  'soon**',
  '#eagles',
  '7"',
  '#vinyl',
  '#single',
  '#1970s',
  'ex/ex',
  'via',
  '@ebay_uk',
  '#records',
  '#oneofthesenights',
  '#ebay'],
 ['november',
  '20,',
  '1960',
  '-',
  'chuck',
  'bednarik',
  '(60)',
  'of',
  'the',
  'eagles',
  'walks',
  'off',
  'the',
  'field',
  'after',
  'game',
  'against',
  'the',
  'giants',
  'at',
  'yankee',
  'stadium.',
  '#philadelphia',
  '#eagles',
  '#nfl',
  '#1960s',
  '#flyeaglesfly']]

Counting word frequencies

We'll define a helper function to calculate word frequencies from our lists of words.

In [103]:
def count_word_frequencies(words_in_tweet, top=15):
    """
    Given a list of all words for every tweet, count
    word frequencies across all tweets.
    
    By default, this returns the top 15 words, but you 
    can specify a different value for `top`.
    """
    import itertools, collections

    # List of all words across tweets
    all_words = list(itertools.chain(*words_in_tweet))

    # Create counter
    counter = collections.Counter(all_words)
    
    return pd.DataFrame(counter.most_common(top),
                        columns=['words', 'count'])
In [104]:
counts_no_urls = count_word_frequencies(words_in_tweet, top=15)
counts_no_urls.head(n=15)
Out[104]:
words count
0 #eagles 953
1 the 937
2 to 463
3 and 362
4 a 360
5 in 286
6 on 255
7 of 235
8 for 225
9 is 219
10 with 172
11 #flyeaglesfly 150
12 this 145
13 be 141
14 i 136

Now let's plot the frequencies

Use seaborn to plot our DataFrame of word counts...

In [105]:
fig, ax = plt.subplots(figsize=(8, 8))

# Plot horizontal bar graph
sns.barplot(
    y="words",
    x="count",
    data=counts_no_urls.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Including All Words)", fontsize=16)
Out[105]:
Text(0.5, 1.0, 'Common Words Found in Tweets (Including All Words)')

Step 2: remove stop words and punctuation

Common words that do not carry much significance and are often ignored in text analysis.

We can use the nltk package.

The "Natural Language Toolkit" https://www.nltk.org/

Import and download the stop words

In [106]:
import nltk
nltk.download('stopwords');
[nltk_data] Downloading package stopwords to /Users/nhand/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Get the list of common stop words

In [107]:
stop_words = list(set(nltk.corpus.stopwords.words('english')))

stop_words[:10]
Out[107]:
["needn't", 'y', 'hasn', 'no', 'isn', 'why', "don't", 'ain', "won't", 'if']
In [108]:
len(stop_words)
Out[108]:
179

Get the list of common punctuation

In [109]:
import string
In [110]:
punctuation = list(string.punctuation)
In [111]:
punctuation[:5]
Out[111]:
['!', '"', '#', '$', '%']

Remove stop words from our tweets

In [112]:
ignored = stop_words + punctuation
In [113]:
ignored[:10]
Out[113]:
["needn't", 'y', 'hasn', 'no', 'isn', 'why', "don't", 'ain', "won't", 'if']
In [114]:
# Remove stop words from each tweet list of words
tweets_nsw = [[word for word in tweet_words if word not in ignored]
              for tweet_words in words_in_tweet]

tweets_nsw[0]
Out[114]:
['**ending',
 'soon**',
 '#eagles',
 '7"',
 '#vinyl',
 '#single',
 '#1970s',
 'ex/ex',
 'via',
 '@ebay_uk',
 '#records',
 '#oneofthesenights',
 '#ebay']

Get our DataFrame of frequencies

In [115]:
counts_nsw = count_word_frequencies(tweets_nsw)
counts_nsw.head()
Out[115]:
words count
0 #eagles 953
1 #flyeaglesfly 150
2 #nfl 107
3 eagles 96
4 get 79

And plot...

In [116]:
fig, ax = plt.subplots(figsize=(8, 8))

sns.barplot(
    y="words",
    x="count",
    data=counts_nsw.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Without Stop Words)", fontsize=16);

Step 3: remove our query terms

Now, we'll be left with only the meanigful words...

In [117]:
search_terms = ['#eagles', "eagles", "@eagles"]
tweets_final = [[w for w in word if w not in search_terms]
                 for word in tweets_nsw]
In [118]:
# frequency counts
counts_final = count_word_frequencies(tweets_final)

And now, plot the cleaned tweets...

In [119]:
fig, ax = plt.subplots(figsize=(8, 8))

sns.barplot(
    y="words",
    x="count",
    data=counts_final.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Cleaned)", fontsize=16)
Out[119]:
Text(0.5, 1.0, 'Common Words Found in Tweets (Cleaned)')

At home exercise

Get 1,000 tweets using a query string of your choice and plot the word frequencies.

Be sure to:

  • remove URLs
  • remove stop words / punctuation
  • remove your search query terms

Note: if you try to pull more than 1,000 tweets you will likely run into the rate limit and have to wait 15 minutes.

Remember: the API documentation describes how to customize a query string.

In [ ]:
 

Use case #2: sentiment analysis

The goal of a sentiment analysis is to determine the attitude or emotional state of the person who sent a particular tweet.

Often used by brands to evaluate public opinion about a product.

The goal:

Determine the "sentiment" of every word in the English language

The hard way

Train a machine learning algorithm to classify words as positive vs. negative, given an input training sample of words.

The easy way

Luckily, this is a very common task in NLP and there are several packages available that have done the hard work for you.

They provide out-of-the-box sentiment analysis using pre-trained machine learning algorithms.

We'll be using textblob

In [120]:
import textblob

Let's analyze our set of 1,000 #eagles tweets

Create our "text blobs"

Simply pass the tweet text to the TextBlob() object.

Note: it's best to remove any URLs first!

In [121]:
blobs = [textblob.TextBlob(remove_url(t.full_text)) for t in tweets]
In [122]:
blobs[0]
Out[122]:
TextBlob("**Ending Soon** #EAGLES 7" #VINYL #SINGLE #1970s EX/EX via @eBay_UK #records #OneOfTheseNights #ebay")
In [123]:
blobs[0].sentiment
Out[123]:
Sentiment(polarity=-0.07142857142857142, subjectivity=0.21428571428571427)

Combine the data into a DataFrame

Track the polarity, subjectivity, and date of each tweet.

In [124]:
data = {}
data['date'] = [tweet.created_at for tweet in tweets]
data['polarity'] = [blob.sentiment.polarity for blob in blobs]
data['subjectivity'] = [blob.sentiment.subjectivity for blob in blobs]
data['text'] = [remove_url(tweet.full_text) for tweet in tweets]
data = pd.DataFrame(data)
In [125]:
data.head()
Out[125]:
date polarity subjectivity text
0 2021-10-08 00:57:10 -0.071429 0.214286 **Ending Soon** #EAGLES 7" #VINYL #SINGLE #197...
1 2021-10-08 00:48:11 -0.400000 0.400000 November 20, 1960 - Chuck Bednarik (60) of the...
2 2021-10-08 00:47:09 0.000000 0.000000 Every time I watch the #Seahawks I say THANK G...
3 2021-10-08 00:40:30 0.606534 0.602273 New Kultural Sport Episode!! NFL 2021 WEEK 4 R...
4 2021-10-08 00:40:27 0.409091 0.313636 NEW: #Eagles Thurs notebook: -OL speculation. ...

How many are unbiased?

We can remove tweets with a polarity of zero to get a better sense of emotions.

In [126]:
zero = (data['polarity']==0).sum()
print("number of unbiased tweets = ", zero)
number of unbiased tweets =  366
In [127]:
# remove unbiased tweets
biased = data.loc[ data['polarity'] != 0 ].copy()

What does a polarized tweet look like?

We can find the tweet with the maximum positive/negative scores

The most negative

Use the idxmin() function:

In [128]:
biased.loc[biased['polarity'].idxmin(), 'text']
Out[128]:
'This was nasty.. @DeVontaSmith_6 #Eagles'

The most positive

Use the idxmax() function

In [129]:
biased.loc[biased['polarity'].idxmax(), 'text']
Out[129]:
'Eagles are underdogs to one of the NFL’s best teams against the spread #NFLBeast #NFL #NFLTwitter #NFLUpdate #NFLNews #NFLBlogs #Philadelphia #Eagles #PhiladelphiaEagles #NFC #BleedingGreenNation By: Brandon Lee Gowton Mark J. Rebilas-USA TODAY ...'

Plot a histogram of polarity

Important: Polarity runs from -1 (most negative) to +1 (most positive)

We can use matplotlib's hist() function:

In [130]:
# create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# histogram
ax.hist(biased['polarity'], bins='auto')
ax.axvline(x=0, c='k', lw=2)

# format
ax.set_xlabel("Polarity")
ax.set_title("Polarity of #eagles Tweets", fontsize=16);
In [131]:
biased['polarity'].median()
Out[131]:
0.1777087279040404
In [132]:
biased['polarity'].mean()
Out[132]:
0.1813234361242552

And subjectivity too...

The most objective

In [133]:
biased.loc[biased['subjectivity'].idxmin(), 'text']
Out[133]:
'It was “dress like your team day” on our “Jean-tober” calendar! Thankful for this leadership team and their passion for our @brookglenn3 students (Missing Marie🤣)! @theotooles @BerniceMJackso1 @MarieHavran #soaringtonewheights #Eagles'

The most subjective

In [134]:
biased.loc[biased['subjectivity'].idxmax(), 'text']
Out[134]:
'Got to cheer on both of my favorite schools at the volleyball match! #eagles #dragons #wearefcs 🏐'

The distribution of subjectivity

Important: Subjectivity runs from 0 (most objective) to +1 (most subjective)

In [135]:
# create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# histogram
ax.hist(biased['subjectivity'], bins='auto')
ax.axvline(x=0.5, c='k', lw=2)

# format
ax.set_xlabel("Subjectivity")
ax.set_title("Subjectivity of #eagles Tweets", fontsize=16);

How does polarity influence subjectivity?

Are positive/negative tweets more or less objective?

Seaborn's regplot() function

Is there a linear trend?

In [136]:
ax = sns.regplot(x=biased['subjectivity'], y=biased['polarity'])

Seaborn's kdeplot()

Shade the bivariate relationship

In [137]:
ax = sns.kdeplot(data=biased['subjectivity'], data2=biased['polarity'])
/Users/nhand/opt/miniconda3/envs/musa-550-fall-2021/lib/python3.8/site-packages/seaborn/distributions.py:1681: FutureWarning: Use `x` and `y` rather than `data` `and `data2`
  warnings.warn(msg, FutureWarning)

Insight: the most subjective tweets tend to be most polarized as well...

We can plot the distribution of polarity by the tweet's hour

First, we'll add a new column that gives the day and hour of the tweet.

We can use the built-in strftime() function.

In [138]:
# this is month/day hour AM/PM
biased['date_string'] = biased['date'].dt.strftime("%-m/%d %I %p")
In [139]:
biased.head()
Out[139]:
date polarity subjectivity text date_string
0 2021-10-08 00:57:10 -0.071429 0.214286 **Ending Soon** #EAGLES 7" #VINYL #SINGLE #197... 10/08 12 AM
1 2021-10-08 00:48:11 -0.400000 0.400000 November 20, 1960 - Chuck Bednarik (60) of the... 10/08 12 AM
3 2021-10-08 00:40:30 0.606534 0.602273 New Kultural Sport Episode!! NFL 2021 WEEK 4 R... 10/08 12 AM
4 2021-10-08 00:40:27 0.409091 0.313636 NEW: #Eagles Thurs notebook: -OL speculation. ... 10/08 12 AM
5 2021-10-08 00:39:01 0.291667 0.416667 Excited to watch former #Eagles CB, Sidney Jon... 10/08 12 AM

Sort the tweets in chronological order...

In [140]:
biased = biased.sort_values(by='date', ascending=True)

Make a box and whiskers plot of the polarity

Use Seaborn's boxplot() function

In [141]:
fig, ax = plt.subplots(figsize=(8, 14))

sns.boxplot(y='date_string', x='polarity', data=biased, ax=ax)
ax.axvline(x=0, c='k', lw=2) # neutral

# Set yticks to every other hour
yticks = ax.get_yticks()
ax.set_yticks(range(0, len(yticks), 2))
plt.setp(ax.get_yticklabels(), fontsize=10);

And subjectivity over time...

In [142]:
fig, ax = plt.subplots(figsize=(8,14))

sns.boxplot(y='date_string', x='subjectivity', data=biased)
ax.axvline(x=0.5, c='k', lw=2) # neutral

# Set yticks to every other hour
yticks = ax.get_yticks()
ax.set_yticks(range(0, len(yticks), 2))
plt.setp(ax.get_yticklabels(), fontsize=10);

At home exercise: sentiment analysis

Analyze your set of tweets from the last exercise (or get a new set), and explore the sentiments by:

  • plotting histograms of the subjectivity and polarity
  • finding the most/least subjective and polarized tweets
  • plotting the relationship between polarity and subjectivity
  • showing hourly trends in polarity/subjectivity

Or explore trends in some new way!

In [ ]:
 

That's it!

  • Next week: creating your own datasets through web scraping
  • See you on Monday!
In [ ]: