Week 5B
Getting Data Part 1: Working with APIs

Oct 6, 2021

Week #5 Agenda¶

Last time:

Introduction to APIs
Pulling census data and shape files using Python

Today:

API Example: Lead poisoning in Philadelphia
Using the Twitter API
- Plotting geocoded tweets
- Word frequencies
- Sentiment analysis

import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import Point

from matplotlib import pyplot as plt
import seaborn as sns

import hvplot.pandas
import holoviews as hv

import esri2gpd
import carto2gpd
import cenpy

pd.options.display.max_columns = 999

Last time: Querying the census with cenpy¶

First, let's initialize a connection to the 2019 5-year ACS dataset

acs = cenpy.remote.APIConnection("ACSDT5Y2019")

# Set the map service for pulling geometries
acs.set_mapservice("tigerWMS_ACS2019")

Connection to American Community Survey: 5-Year Estimates: Detailed Tables 5-Year(ID: https://api.census.gov/data/id/ACSDT5Y2019)
With MapServer: Census ACS 2019 WMS

Exercise: lead poisoning in Philadelphia¶

Let's use demographic census data in Philadelphia by census tract and compare to a dataset of childhood lead poisoning.

Step 1. Download the demographic data for Philadelphia¶

We are going to be examining the percent of the population that identifies as black in each census tract, so we will need:
- Total population: 'B03002_001E'
- Non-Hispanic, Black population: 'B03002_004E'
You'll want to use the state --> county --> tract hierarchy , using the * operator to get all tracts in Philadelphia county
Remember PA has a FIPS code of "42" and Philadelphia County is "101"

philly_demo_tract = acs.query(
    cols=["NAME", "B03002_001E", "B03002_004E"],
    geo_unit="tract:*",
    geo_filter={
                "state" : "42", 
                "county" : "101"
               },
)

philly_demo_tract.head()

philly_demo_tract.dtypes # "object" means string!

NAME           object
B03002_001E    object
B03002_004E    object
state          object
county         object
tract          object
dtype: object

Step 2. Download and merge in the census tract geometries¶

# Census tracts are the 9th layer (index 8 starting from 0)
acs.mapservice.layers[8]

(ESRILayer) Census Tracts

Option 1: Use esri2gpd to load the data¶

# The base url for the map service API endpoint
acs.mapservice._baseurl

'http://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/tigerWMS_ACS2019/MapServer'

## We're just querying a GeoService — let's use esri2gpd

# Only Philadelphia
where_clause = "STATE = 42 AND COUNTY = 101"

# Create the API url with the layer ID add the end
url = f"{acs.mapservice._baseurl}/8" 

# Query
philly_census_tracts = esri2gpd.get(url, where=where_clause)

Option 2: Query the map service in cenpy¶

# Query for census tracts using cenpy API
# philly_census_tracts = acs.mapservice.layers[8].query(where=where_clause)

philly_census_tracts.head(n=1)

philly_census_tracts.dtypes

geometry     geometry
MTFCC          object
OID            object
GEOID          object
STATE          object
COUNTY         object
TRACT          object
BASENAME       object
NAME           object
LSADC          object
FUNCSTAT       object
AREALAND        int64
AREAWATER       int64
CENTLAT        object
CENTLON        object
INTPTLAT       object
INTPTLON       object
OBJECTID        int64
dtype: object

philly_demo_tract.head(n=1)

# Merge them together
# IMPORTANT: Make sure your merge keys are the same dtypes (e.g., all strings or all ints)
philly_demo_tract = philly_census_tracts.merge(
    philly_demo_tract,
    left_on=["STATE", "COUNTY", "TRACT"],
    right_on=["state", "county", "tract"],
)

Step 3. Calculate the black percentage¶

Add a new column to your data called percent_black.

Important: Make sure you convert the data to floats!

for col in ['B03002_001E', 'B03002_004E']:
    philly_demo_tract[col] = philly_demo_tract[col].astype(float)

philly_demo_tract["percent_black"] = (
    100 * philly_demo_tract["B03002_004E"] / philly_demo_tract["B03002_001E"]
)

Step 4. Query CARTO to get the childhood lead levels by census tract¶

The API documentation for this data is here: https://cityofphiladelphia.github.io/carto-api-explorer/#child_blood_lead_levels_by_ct
You can use the carto2gpd package
- You'll need the API endpoint for CARTO and the name of the table

# Documentation includes an example for help!
# carto2gpd.get?

table_name = 'child_blood_lead_levels_by_ct'
lead_levels = carto2gpd.get("https://phl.carto.com/api/v2/sql", table_name)

lead_levels.head()

Step 5. Remove census tracts with missing lead measurements¶

See the .dropna() function and the subset= keyword.

lead_levels = lead_levels.dropna(subset=['perc_5plus'])

Step 6. Merge the demographic and lead level data frames¶

From the lead data, we only need the 'census_tract' and 'perc_5plus'. Before merging, trim your data to only these columns.
You can perform the merge by comparing the census_tract and GEOID fields
Remember: when merging, the left data frame should be the GeoDataFrame — use GeoDataFrame.merge(...)

# Trim the lead levels data
lead_levels_trimmed = lead_levels[['census_tract', 'perc_5plus']]

# Merge into the demographic data
# Use "GEOID" — that is the unique identifier here
merged = philly_demo_tract.merge(lead_levels_trimmed, 
                                 how='left', 
                                 left_on='GEOID', 
                                 right_on='census_tract')

merged.head()

Step 7. Trim to the columns we need¶

We only need the 'geometry', 'percent_black', and 'perc_5plus', and 'NAME' columns

merged = merged[['NAME_x', 'geometry', 'percent_black', 'perc_5plus']]

Step 8. Plot the results¶

Make two plots:

A two panel, side-by-side chart showing a choropleth of the lead levels and the percent black
A scatter plot showing the percentage

You can make these using hvplot or geopandas/matplotlib — whichever you prefer!

# Lead levels plot
img1 = merged.hvplot(geo=True, 
                     crs=3857, 
                     c='perc_5plus', 
                     width=500, 
                     height=400, 
                     cmap='viridis', 
                     title='Lead Levels')

# Percent black 
img2 = merged.hvplot(geo=True,
                     crs=3857,
                     c='percent_black', 
                     width=500, 
                     height=400, 
                     cmap='viridis', 
                     title='% Black')

img1 + img2

cols = ['perc_5plus', 'percent_black']
merged[cols].hvplot.scatter(x=cols[0], y=cols[1])

Challenge: Step 9. Use seaborn to plot a 2d density map¶

In the previous plots, it's still hard to see the relationship. Use the kdeplot() function in seaborn to better visualize the relationship.

You will need to remove any NaN entries first.

You should see two peaks in the distribution clearly now!

fig, ax = plt.subplots(figsize=(8,6))

X = merged.dropna()
sns.kdeplot(x=X['perc_5plus'], y=X['percent_black'], ax=ax);

API Example #2: the Twitter API¶

Twitter provides a rich source of information, but the challenge is how to extract the information from semi-structured data.

Semi-structured data¶

Data that contains some elements that cannot be easily consumed by computers

Examples: human-readable text, audio, images, etc

Key challenges¶

Text mining: analyzing blocks of text to extract the relevant pieces of information
Natural language processing (NLP): programming computers to process and analyze human languages
Sentiment analysis: analyzing blocks of text to derive the attitude or emotional state of the person

First: Getting an API key¶

Step 1: Make a Twitter account¶

Step 2: Apply for Developer access¶

See: https://developer.twitter.com/apps

You will need to apply for a Developer Account, answering a few questions, and then confirm your email address.

Once you submit you'll application, you'll need to wait for approval...usually this happens immediately, but there can sometimes be a short delay

Sample answer¶

Needs to be at least 100 characters

I'm using Twitter's API to perform a sentiment analysis as part of a class teaching Python. I will be interacting with the API using the Python package tweepy.
I plan to analyze tweets to understand topic sentiments.
I will not be interacting with Twitter users as part of
I will not be displaying Twitter content off of Twitter.

Step 3: Create a new app¶

https://developer.twitter.com/en/apps/create

Step 4: Create your API keys¶

In the "Keys and Tokens" section, generate new access tokens.

You will need the Consumer API keys and access tokens to use the Twitter API.

We'll be using `tweepy` to search recent tweets¶

The standard, free API let's you search tweets from the last 7 days

For more information, see the Twitter Developer docs

Tweepy: a Python interface to Twitter¶

https://tweepy.readthedocs.io

import tweepy as tw

Define your API keys¶

# INPUT YOUR API AND ACCESS KEYS HERE
api_key = ""
api_key_secret = ""
access_token = ""
access_token_secret = ""

Initialize an API object¶

We need to:

step up authentication
intialize a tweepy.API object

auth = tw.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)


api = tw.API(auth, wait_on_rate_limit=True)

Rate Limits¶

Be careful: With the free API, you are allowed 15 API calls every 15 minutes.

See the Rate Limits documentation for more details.

What does `wait_on_rate_limit` do?¶

If you run into a rate limit while pulling tweets, this will tell tweepy to wait 15 minutes until it can continue.

Unfortunately, you need to sign up (and pay) for the premium API to avoid these rate limits.

How to find out how many requests you've made?¶

data = api.rate_limit_status()

data

{'rate_limit_context': {'access_token': '706239336-KiVl84XESmuZozBm6yflr8quudPMHU7l2NZ5u9cH'},
 'resources': {'lists': {'/lists/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/lists/:id/members': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/lists/memberships': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/lists/:id&DELETE': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/lists/subscriptions': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/lists/members': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/lists/subscribers/show': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/lists/:id&PUT': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/lists/show': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/lists/ownerships': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/lists/subscribers': {'limit': 180, 'remaining': 180, 'reset': 1633655934},
   '/lists/:id/members/:user_id&DELETE': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/lists/:id/members&POST': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/lists/members/show': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/lists/statuses': {'limit': 900, 'remaining': 900, 'reset': 1633655934}},
  'application': {'/application/rate_limit_status': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'mutes': {'/mutes/users/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/mutes/users/ids': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'verify': {'/verify/:version/badge-violation': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/badge-violation/violations': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/intake': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/document-formats': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/id-document&GET': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/verify/:version/id-document&POST': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/access': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/verify/:version/account-eligibility': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'admin_users': {'/admin_users': {'limit': 2000,
    'remaining': 2000,
    'reset': 1633655934}},
  'live_video_stream': {'/live_video_stream/status/:id': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934}},
  'friendships': {'/friendships/outgoing': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friendships/list': {'limit': 200, 'remaining': 200, 'reset': 1633655934},
   '/friendships/no_retweets/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friendships/lookup': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/friendships/incoming': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friendships/show': {'limit': 180, 'remaining': 180, 'reset': 1633655934}},
  'guide': {'/guide': {'limit': 180, 'remaining': 180, 'reset': 1633655934},
   '/guide/get_explore_locations': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/guide/explore_locations_with_autocomplete': {'limit': 200,
    'remaining': 200,
    'reset': 1633655934}},
  'auth': {'/auth/csrf_token': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'compliance': {'/compliance/jobs&POST': {'limit': 150,
    'remaining': 150,
    'reset': 1633655934},
   '/compliance/jobs&GET': {'limit': 150,
    'remaining': 150,
    'reset': 1633655934},
   '/compliance/jobs/:job_id': {'limit': 150,
    'remaining': 150,
    'reset': 1633655934}},
  'paseto': {'/paseto/token': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934}},
  'blocks': {'/blocks/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/blocks/ids': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'geo': {'/geo/similar_places': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/geo/place_page': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/geo/id/:place_id': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/geo/reverse_geocode': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/geo/search': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'users': {'/users/': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id/muting&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/report_spam': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/pinned_lists/:list_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:source_user_id/blocking/:target_user_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/contributors/pending': {'limit': 2000,
    'remaining': 2000,
    'reset': 1633655934},
   '/users/:id/followed_lists': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/show/:id': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:source_user_id/following&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:id/tweets': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id/retweets/:source_tweet_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/search': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id/likes&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/suggestions/:slug': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/users/:id/pinned_lists': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/contributees/pending': {'limit': 200,
    'remaining': 200,
    'reset': 1633655934},
   '/users/:id/retweets&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/profile_banner': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/users/by/username/:source_username/following/:target_user_name&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:handle/tweets': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/users/derived_info': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/blocking&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:source_username/following&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:id/followers': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/suggestions/:slug/members': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/users/:id/muting': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/mentions': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/users/:id/following': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/pinned_lists&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:source_user_id/following/:target_user_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:username': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/users/:id/followed_lists&POST': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:username/followers': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/users/lookup': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:id/followed_lists/:list_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:id/blocking': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/suggestions': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/users/:id/likes/:tweet_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/by/username/:username/following': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/users/by/username/:handle/mentions': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/users/by': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/users/:source_user_id/muting/:target_user_id&DELETE': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/users/:id/liked_tweets': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934}},
  'teams': {'/teams/authorize': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'followers': {'/followers/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/followers/list': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'collections': {'/collections/list': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/collections/entries': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/collections/show': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934}},
  'permissions': {'/permissions/user_permissions/admin_email_verification': {'limit': 3,
    'remaining': 3,
    'reset': 1633655934},
   '/permissions/user_permissions': {'limit': 3,
    'remaining': 3,
    'reset': 1633655934}},
  'tweets&POST': {'/tweets&POST': {'limit': 200,
    'remaining': 200,
    'reset': 1633655934}},
  'statuses': {'/statuses/retweeters/ids': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/statuses/retweets_of_me': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/statuses/home_timeline': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/statuses/show/:id': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/statuses/user_timeline': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/statuses/friends': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/statuses/retweets/:id': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/statuses/mentions_timeline': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/statuses/oembed': {'limit': 180, 'remaining': 180, 'reset': 1633655934},
   '/statuses/lookup': {'limit': 900, 'remaining': 900, 'reset': 1633655934}},
  'custom_profiles': {'/custom_profiles/list': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/custom_profiles/show': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'webhooks': {'/webhooks/subscriptions/direct_messages': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/webhooks': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'contacts': {'/contacts/uploaded_by': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/contacts/users': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/contacts/addressbook': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/contacts/users_and_uploaded_by': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/contacts/delete/status': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934}},
  'labs': {'/labs/2/platform_manipulation/reports': {'limit': 5,
    'remaining': 5,
    'reset': 1633655934},
   '/labs/:version/tweets/:id/hidden&PUT': {'limit': 10,
    'remaining': 10,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/labs/:version/users/:id/tweets': {'limit': 225,
    'remaining': 225,
    'reset': 1633655934},
   '/labs/2/reports': {'limit': 5, 'remaining': 5, 'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/rules&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/sample': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/labs/:version/users/by/username/:handle/tweets': {'limit': 225,
    'remaining': 225,
    'reset': 1633655934},
   '/labs/:version/tweets/metrics/private': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/rules/:instance_name': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/labs/:version/tweets/*': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/labs/:version/users/*': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/:instance_name': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/filter/rules/': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/labs/:version/tweets/stream/compliance': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/labs/:version/tweets/search': {'limit': 225,
    'remaining': 225,
    'reset': 1633655934}},
  'i': {'/i/config': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/i/tfb/v1/smb/web/:account_id/payment/save': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'tweet_prompts': {'/tweet_prompts/report_interaction': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/tweet_prompts/show': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'moments': {'/moments/statuses/update': {'limit': 5,
    'remaining': 5,
    'reset': 1633655934},
   '/moments/create': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/moments/permissions': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934}},
  'limiter_scalding_report_creation': {'/limiter_scalding_report_creation': {'limit': 4500,
    'remaining': 4500,
    'reset': 1633655934}},
  'fleets': {'/fleets/:version/mutes/create': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/viewers': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/delete': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/fleets/:version/avatar_content': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/create': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/fleets/:version/user_fleets': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/fleets/:version/fleetline': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/track_events': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/update': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/fleets/:version/fleet_threads': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/fleets/:version/mutes/list': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/mutes/destroy': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/home_timeline': {'limit': 100,
    'remaining': 100,
    'reset': 1633655934},
   '/fleets/:version/feedback/create': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/fleets/:version/mark_read': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934}},
  'help': {'/help/tos': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/help/configuration': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/help/settings': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/help/privacy': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/help/languages': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'feedback': {'/feedback/show/:id': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/feedback/events': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934}},
  'business_experience': {'/business_experience/dashboard_settings/destroy': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/business_experience/dashboard_features': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/business_experience/keywords': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/business_experience/dashboard_settings/update': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/business_experience/dashboard_settings/show': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934}},
  'graphql&POST': {'/graphql&POST': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'friends': {'/friends/following/ids': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friends/following/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/friends/list': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/friends/ids': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'sandbox': {'/sandbox/account_activity/webhooks/:id/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934}},
  'drafts': {'/drafts/statuses/update': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/destroy': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/ids': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/list': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/show': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/drafts/statuses/create': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934}},
  'direct_messages': {'/direct_messages/sent': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/direct_messages/broadcasts/list': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934},
   '/direct_messages/subscribers/lists/members/show': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/direct_messages/mark_read': {'limit': 1000,
    'remaining': 1000,
    'reset': 1633655934},
   '/direct_messages/subscribers/ids': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/sent_and_received': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/direct_messages/broadcasts/statuses/list': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934},
   '/direct_messages': {'limit': 300, 'remaining': 300, 'reset': 1633655934},
   '/direct_messages/subscribers/lists/members/ids': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/subscribers/show': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/broadcasts/show': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934},
   '/direct_messages/broadcasts/statuses/show': {'limit': 60,
    'remaining': 60,
    'reset': 1633655934},
   '/direct_messages/subscribers/lists/list': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/show': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934},
   '/direct_messages/events/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/direct_messages/subscribers/lists/show': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/direct_messages/events/show': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'media': {'/media/upload': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934}},
  'traffic': {'/traffic/map': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'strato': {'/strato/column/None/:id/cms/*': {'limit': 150,
    'remaining': 150,
    'reset': 1633655934}},
  'account_activity': {'/account_activity/all/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/all/:instance_name/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/account_activity/direct_messages/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions/direct_messages/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions/all': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/account_activity/direct_messages/:instance_name/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions/all/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions/direct_messages': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/account_activity/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/direct_messages/:instance_name/subscriptions': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account_activity/webhooks/:id/subscriptions': {'limit': 500,
    'remaining': 500,
    'reset': 1633655934},
   '/account_activity/all/:instance_name/webhooks': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'account': {'/account/login_verification_enrollment': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account/update_profile': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/account/authenticate_web_view': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/account/verify_credentials': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/account/settings': {'limit': 15, 'remaining': 15, 'reset': 1633655934},
   '/account/change_password': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'safety': {'/safety/detection_feedback': {'limit': 450000,
    'remaining': 450000,
    'reset': 1633655934}},
  'favorites': {'/favorites/list': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934}},
  'lists&POST': {'/lists&POST': {'limit': 300,
    'remaining': 300,
    'reset': 1633655934}},
  'device': {'/device/token': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'tweets': {'/tweets/search/all': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/tweets/search/stream/rules': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/tweets/search/recent': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934},
   '/tweets/sample/stream': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/tweets/:id&DELETE': {'limit': 50, 'remaining': 50, 'reset': 1633655934},
   '/tweets/': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/tweets/counts/all': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/tweets/search/stream': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934},
   '/tweets/search/:product/:label': {'limit': 1800,
    'remaining': 1800,
    'reset': 1633655934},
   '/tweets/search/stream/rules/validation&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/tweets/search/:product/:instance/counts': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/tweets/:id/retweeted_by': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/tweets/:tweet_id/liking_users': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/tweets/:id': {'limit': 900, 'remaining': 900, 'reset': 1633655934},
   '/tweets/search/stream/rules&DELETE': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/tweets/counts/recent': {'limit': 900,
    'remaining': 900,
    'reset': 1633655934},
   '/tweets/search/stream/rules&POST': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/tweets/:id/hidden&PUT': {'limit': 50,
    'remaining': 50,
    'reset': 1633655934}},
  'saved_searches': {'/saved_searches/destroy/:id': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/saved_searches/show/:id': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/saved_searches/list': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}},
  'oauth': {'/oauth/revoke': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934},
   '/oauth/invalidate_token': {'limit': 450,
    'remaining': 450,
    'reset': 1633655934},
   '/oauth/revoke_html': {'limit': 15, 'remaining': 15, 'reset': 1633655934}},
  'search': {'/search/tweets': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'trends': {'/trends/closest': {'limit': 75,
    'remaining': 75,
    'reset': 1633655934},
   '/trends/available': {'limit': 75, 'remaining': 75, 'reset': 1633655934},
   '/trends/place': {'limit': 75, 'remaining': 75, 'reset': 1633655934}},
  'live_pipeline': {'/live_pipeline/events': {'limit': 180,
    'remaining': 180,
    'reset': 1633655934}},
  'graphql': {'/graphql': {'limit': 15,
    'remaining': 15,
    'reset': 1633655934}}}}

data['resources']['search']

{'/search/tweets': {'limit': 180, 'remaining': 180, 'reset': 1633655934}}

Tip: converting a time stamp to a date¶

import datetime
datetime.datetime.fromtimestamp(1633566967)

datetime.datetime(2021, 10, 6, 20, 36, 7)

Several different APIs available¶

Including user tweets, mentions, searching keywords, favoriting, direct messages, and more...

See the Tweepy API documentation

You can also stream tweets¶

You can set up a listener to listen for new tweets and download them in real time (subject to rate limits).

We won't focus on this, but there is a nice tutorial on the Tweepy documentation.

You can also tweet (if you want!)¶

You can post tweets using the update_status() function. For example:

tweet = 'Hello World! @PennMusa'
api.update_status(tweet)

We'll focus on the search API¶

We'll use a tweepy.Cursor object to query the API.

# collect tweets related to the eagles
search_words = "#eagles"

# initialize the cursor
cursor = tw.Cursor(api.search,
                   q=search_words,
                   lang="en",
                   tweet_mode='extended')
cursor

<tweepy.cursor.Cursor at 0x7fb92bedafa0>

Next, specify how many tweets we want¶

Use the Cursor.items() function:

# select 5 tweets
tweets = cursor.items(5)
tweets

<tweepy.cursor.ItemIterator at 0x7fb92bece4c0>

Python iterators¶

As the name suggests, iterators need to be iterated over to return objects. In our case, we can use a for loop to iterate through the tweets that we pulled from the API.

# Iterate on tweets
for tweet in tweets:
    print(tweet.full_text)

RT @Jeff_McLane: Jason Kelce, who didn’t practice Wed because of “foot/rest,” just went indoors. Hasn’t come out in a bit. Will update. #Ea…
RT @Jeff_McLane: No Lane Johnson again at #Eagles practice. Still dealing with a personal issue. Looking more and more unlikely he’ll be in…
RT @Jeff_McLane: #Eagles beat writers make their predictions for the Panthers game in Week 5: Will the losing streak end?

From @EJSmith94,…
RT @Jeff_McLane: Narrative Fletcher Cox sees only doubles is false. Number based upon film review is exactly half.

More obvious answer for…
RT @Jeff_McLane: #Eagles’ Lane Johnson not back with team; Jordan Mailata making progress. @EJSmith94: https://t.co/zIwrUL8wXO via @phillyi…

The concept of "paging"¶

Unfortunately, there is no way to search for tweets between specific dates.

The API pulls tweets from the most recent page of the search result, and then grabs tweets from the previous page, and so on, to return the requested number of tweets.

Customizing our search query¶

The API documentation has examples of different query string use cases

Examples¶

Let's remove retweets¶

new_search = search_words + " -filter:retweets"

Get a new tweets using our new search query:

cursor = tw.Cursor(api.search,
                   q=new_search,
                   lang="en",
                   tweet_mode='extended')
tweets = cursor.items(5)

Did it work?¶

for tweet in tweets:
    print(tweet.full_text)

**Ending Soon**
#EAGLES 7" #VINYL #SINGLE #1970s EX/EX https://t.co/vcSOVppSBl via @eBay_UK #records #OneOfTheseNights #ebay
November 20, 1960 - Chuck Bednarik (60) of the Eagles walks off the field after game against the Giants at Yankee Stadium.
#Philadelphia #Eagles #NFL #1960s #FlyEaglesFly https://t.co/wFfUjQS8W3
Every time I watch the #Seahawks I say THANK GOD the #Eagles passed on this turd DK Metcalf for… &lt;CHECKS NOTES&gt;… JJ Arcega-Whiteside! 🧐😞
New Kultural Sport Episode!! NFL 2021 WEEK 4 RECAP/ WEEK 5 PREDICTIONS. Click link below and hit that subscribe button greatly appreciate it! #ColtsNation #colts #Eagles #NFL #sports #

https://t.co/t1MhprSIcg https://t.co/HQugdiCsaQ
NEW: #Eagles Thurs notebook:
-OL speculation. "If you knew what the lineup would be, it would be let’s attack this...right now we’re just trying to attack scheme the best we can”
-DeVonta Smith's training and "stepping stone"
-Christian McCaffery
And more:
https://t.co/rTLSBf7Rm3

How to save the tweets?¶

Create a list of the tweets using Python's inline syntax

# select the next 10 tweets
tweets = [t for t in cursor.items(10)]

print(len(tweets))

10

A wealth of information is available¶

Beyond the text of the tweet, things like favorite and retweet counts are available. You can also extract info on the user behind the tweet.

first_tweet = tweets[0]

first_tweet.full_text

'Ryan Paganetti, Former #Eagles Assistant Coach &amp; Gameday management specialist Breaks Down 2021 Eagles... https://t.co/BnLOGa1sKe via @YouTube'

first_tweet.favorite_count

1

first_tweet.created_at

datetime.datetime(2021, 10, 7, 23, 44, 51)

first_tweet.retweet_count

1

first_tweet.user.description

'#NFL 🏈, #Eagles insider 🎙️📝for @sinow, @thephillyvoice, @JAKIBMedia, @birds365show, @6ABC postgame, @ExtendthePlay & #Football247 on @1490SportsBet. DM Me'

first_tweet.user.followers_count

7185

Let's extract screen names and locations¶

A fraction of tweets have locations associated with user profiles, giving (very rough) location data.

users_locs = [[tweet.user.screen_name, tweet.user.location] for tweet in tweets]
users_locs

[['JFMcMullen', 'Philadelphia'],
 ['champton0927', ''],
 ['FlamingOldies1', 'Worcester, Massachusetts'],
 ['ThomasFox_4th', 'Vermont, USA'],
 ['designguyca', 'South Coast of Canada'],
 ['ByADiBona', 'Brooklyn, NY'],
 ['nflrums', 'Georgia '],
 ['kracze', 'Bucks County, PA'],
 ['RonBohning', 'Fayetteville, AR'],
 ['DSM_Media', 'Philadelphia, PA']]

Note: only about 1% of tweets have a latitude/longitude.

Difficult to extract geographic trends without pulling a large number of tweets, requiring a premium API.

Use case #1: calculating word frequencies¶

An example of text mining

Load the most recent 1,000 tweets¶

Save the text of 1,000 tweets after querying our cursor object.

cursor = tw.Cursor(api.search,
                   q="#eagles -filter:retweets",
                   lang="en",
                   tweet_mode='extended')
tweets = [tweet for tweet in cursor.items(1000)]

# get the text of the tweets
tweets_text = [tweet.full_text for tweet in tweets]

# the first five tweets
tweets_text[:5]

['**Ending Soon**\n#EAGLES 7" #VINYL #SINGLE #1970s EX/EX https://t.co/vcSOVppSBl via @eBay_UK #records #OneOfTheseNights #ebay',
 'November 20, 1960 - Chuck Bednarik (60) of the Eagles walks off the field after game against the Giants at Yankee Stadium.\n#Philadelphia #Eagles #NFL #1960s #FlyEaglesFly https://t.co/wFfUjQS8W3',
 'Every time I watch the #Seahawks I say THANK GOD the #Eagles passed on this turd DK Metcalf for… &lt;CHECKS NOTES&gt;… JJ Arcega-Whiteside! 🧐😞',
 'New Kultural Sport Episode!! NFL 2021 WEEK 4 RECAP/ WEEK 5 PREDICTIONS. Click link below and hit that subscribe button greatly appreciate it! #ColtsNation #colts #Eagles #NFL #sports #\n\nhttps://t.co/t1MhprSIcg https://t.co/HQugdiCsaQ',
 'NEW: #Eagles Thurs notebook:\n-OL speculation. "If you knew what the lineup would be, it would be let’s attack this...right now we’re just trying to attack scheme the best we can”\n-DeVonta Smith\'s training and "stepping stone"\n-Christian McCaffery\nAnd more:\nhttps://t.co/rTLSBf7Rm3']

Text mining and dealing with messy data¶

Remove URLs $\rightarrow$ regular expressions
Remove stop words
Remove the search terms

Step 1: removing URLs¶

Regular expressions¶

This will identify "t.co" in URLs, e.g. https://t.co/Sp1Qtf5Fnl

Don't worry about mastering regular expression syntax...

StackOverflow is your friend

def remove_url(txt):
    """
    Replace URLs found in a text string with nothing 
    (i.e. it will remove the URL from the string).

    Parameters
    ----------
    txt : string
        A text string that you want to parse and remove urls.

    Returns
    -------
    The same txt string with url's removed.
    """
    import re
    return " ".join(re.sub("https://t.co/[A-Za-z\\d]+|&amp", "", txt).split())

Remove any URLs¶

tweets_no_urls = [remove_url(tweet) for tweet in tweets_text]
tweets_no_urls[:5]

['**Ending Soon** #EAGLES 7" #VINYL #SINGLE #1970s EX/EX via @eBay_UK #records #OneOfTheseNights #ebay',
 'November 20, 1960 - Chuck Bednarik (60) of the Eagles walks off the field after game against the Giants at Yankee Stadium. #Philadelphia #Eagles #NFL #1960s #FlyEaglesFly',
 'Every time I watch the #Seahawks I say THANK GOD the #Eagles passed on this turd DK Metcalf for… &lt;CHECKS NOTES&gt;… JJ Arcega-Whiteside! 🧐😞',
 'New Kultural Sport Episode!! NFL 2021 WEEK 4 RECAP/ WEEK 5 PREDICTIONS. Click link below and hit that subscribe button greatly appreciate it! #ColtsNation #colts #Eagles #NFL #sports #',
 'NEW: #Eagles Thurs notebook: -OL speculation. "If you knew what the lineup would be, it would be let’s attack this...right now we’re just trying to attack scheme the best we can” -DeVonta Smith\'s training and "stepping stone" -Christian McCaffery And more:']

Extract a list of lower-cased words in a tweet¶

.lower() makes all words lower cased
.split() splits a string into the individual words

"This is an Example".lower()

'this is an example'

"This is an Example".lower().split()

['this', 'is', 'an', 'example']

Apply these functions to all tweets:

words_in_tweet = [tweet.lower().split() for tweet in tweets_no_urls]
words_in_tweet[:2]

[['**ending',
  'soon**',
  '#eagles',
  '7"',
  '#vinyl',
  '#single',
  '#1970s',
  'ex/ex',
  'via',
  '@ebay_uk',
  '#records',
  '#oneofthesenights',
  '#ebay'],
 ['november',
  '20,',
  '1960',
  '-',
  'chuck',
  'bednarik',
  '(60)',
  'of',
  'the',
  'eagles',
  'walks',
  'off',
  'the',
  'field',
  'after',
  'game',
  'against',
  'the',
  'giants',
  'at',
  'yankee',
  'stadium.',
  '#philadelphia',
  '#eagles',
  '#nfl',
  '#1960s',
  '#flyeaglesfly']]

Counting word frequencies¶

We'll define a helper function to calculate word frequencies from our lists of words.

def count_word_frequencies(words_in_tweet, top=15):
    """
    Given a list of all words for every tweet, count
    word frequencies across all tweets.
    
    By default, this returns the top 15 words, but you 
    can specify a different value for `top`.
    """
    import itertools, collections

    # List of all words across tweets
    all_words = list(itertools.chain(*words_in_tweet))

    # Create counter
    counter = collections.Counter(all_words)
    
    return pd.DataFrame(counter.most_common(top),
                        columns=['words', 'count'])

counts_no_urls = count_word_frequencies(words_in_tweet, top=15)
counts_no_urls.head(n=15)

Now let's plot the frequencies¶

Use seaborn to plot our DataFrame of word counts...

fig, ax = plt.subplots(figsize=(8, 8))

# Plot horizontal bar graph
sns.barplot(
    y="words",
    x="count",
    data=counts_no_urls.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Including All Words)", fontsize=16)

Text(0.5, 1.0, 'Common Words Found in Tweets (Including All Words)')

Step 2: remove stop words and punctuation¶

Common words that do not carry much significance and are often ignored in text analysis.

We can use the nltk package.

The "Natural Language Toolkit" https://www.nltk.org/

Import and download the stop words¶

import nltk
nltk.download('stopwords');

[nltk_data] Downloading package stopwords to /Users/nhand/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Get the list of common stop words¶

stop_words = list(set(nltk.corpus.stopwords.words('english')))

stop_words[:10]

["needn't", 'y', 'hasn', 'no', 'isn', 'why', "don't", 'ain', "won't", 'if']

len(stop_words)

179

Get the list of common punctuation¶

import string

punctuation = list(string.punctuation)

punctuation[:5]

['!', '"', '#', '$', '%']

Remove stop words from our tweets¶

ignored = stop_words + punctuation

ignored[:10]

["needn't", 'y', 'hasn', 'no', 'isn', 'why', "don't", 'ain', "won't", 'if']

# Remove stop words from each tweet list of words
tweets_nsw = [[word for word in tweet_words if word not in ignored]
              for tweet_words in words_in_tweet]

tweets_nsw[0]

['**ending',
 'soon**',
 '#eagles',
 '7"',
 '#vinyl',
 '#single',
 '#1970s',
 'ex/ex',
 'via',
 '@ebay_uk',
 '#records',
 '#oneofthesenights',
 '#ebay']

Get our DataFrame of frequencies¶

counts_nsw = count_word_frequencies(tweets_nsw)
counts_nsw.head()

And plot...

fig, ax = plt.subplots(figsize=(8, 8))

sns.barplot(
    y="words",
    x="count",
    data=counts_nsw.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Without Stop Words)", fontsize=16);

Step 3: remove our query terms¶

Now, we'll be left with only the meanigful words...

search_terms = ['#eagles', "eagles", "@eagles"]
tweets_final = [[w for w in word if w not in search_terms]
                 for word in tweets_nsw]

# frequency counts
counts_final = count_word_frequencies(tweets_final)

And now, plot the cleaned tweets...¶

fig, ax = plt.subplots(figsize=(8, 8))

sns.barplot(
    y="words",
    x="count",
    data=counts_final.sort_values(by="count", ascending=False),
    ax=ax,
    color="#cc3000",
    saturation=1.0,
)

ax.set_title("Common Words Found in Tweets (Cleaned)", fontsize=16)

Text(0.5, 1.0, 'Common Words Found in Tweets (Cleaned)')

At home exercise¶

Get 1,000 tweets using a query string of your choice and plot the word frequencies.

Be sure to:

remove URLs
remove stop words / punctuation
remove your search query terms

Note: if you try to pull more than 1,000 tweets you will likely run into the rate limit and have to wait 15 minutes.

Remember: the API documentation describes how to customize a query string.

Use case #2: sentiment analysis¶

The goal of a sentiment analysis is to determine the attitude or emotional state of the person who sent a particular tweet.

Often used by brands to evaluate public opinion about a product.

The goal:¶

Determine the "sentiment" of every word in the English language

The hard way¶

Train a machine learning algorithm to classify words as positive vs. negative, given an input training sample of words.

The easy way¶

Luckily, this is a very common task in NLP and there are several packages available that have done the hard work for you.

They provide out-of-the-box sentiment analysis using pre-trained machine learning algorithms.

We'll be using `textblob`¶

import textblob

Let's analyze our set of 1,000 #eagles tweets¶

Create our "text blobs"¶

Simply pass the tweet text to the TextBlob() object.

Note: it's best to remove any URLs first!

blobs = [textblob.TextBlob(remove_url(t.full_text)) for t in tweets]

blobs[0]

TextBlob("**Ending Soon** #EAGLES 7" #VINYL #SINGLE #1970s EX/EX via @eBay_UK #records #OneOfTheseNights #ebay")

blobs[0].sentiment

Sentiment(polarity=-0.07142857142857142, subjectivity=0.21428571428571427)

Combine the data into a DataFrame¶

Track the polarity, subjectivity, and date of each tweet.

data = {}
data['date'] = [tweet.created_at for tweet in tweets]
data['polarity'] = [blob.sentiment.polarity for blob in blobs]
data['subjectivity'] = [blob.sentiment.subjectivity for blob in blobs]
data['text'] = [remove_url(tweet.full_text) for tweet in tweets]
data = pd.DataFrame(data)

data.head()

How many are unbiased?¶

We can remove tweets with a polarity of zero to get a better sense of emotions.

zero = (data['polarity']==0).sum()
print("number of unbiased tweets = ", zero)

number of unbiased tweets =  366

# remove unbiased tweets
biased = data.loc[ data['polarity'] != 0 ].copy()

What does a polarized tweet look like?¶

We can find the tweet with the maximum positive/negative scores

The most negative¶

Use the idxmin() function:

biased.loc[biased['polarity'].idxmin(), 'text']

'This was nasty.. @DeVontaSmith_6 #Eagles'

The most positive¶

Use the idxmax() function

biased.loc[biased['polarity'].idxmax(), 'text']

'Eagles are underdogs to one of the NFL’s best teams against the spread #NFLBeast #NFL #NFLTwitter #NFLUpdate #NFLNews #NFLBlogs #Philadelphia #Eagles #PhiladelphiaEagles #NFC #BleedingGreenNation By: Brandon Lee Gowton Mark J. Rebilas-USA TODAY ...'

Plot a histogram of polarity¶

Important: Polarity runs from -1 (most negative) to +1 (most positive)

We can use matplotlib's hist() function:

# create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# histogram
ax.hist(biased['polarity'], bins='auto')
ax.axvline(x=0, c='k', lw=2)

# format
ax.set_xlabel("Polarity")
ax.set_title("Polarity of #eagles Tweets", fontsize=16);

biased['polarity'].median()

0.1777087279040404

biased['polarity'].mean()

0.1813234361242552

And subjectivity too...¶

The most objective¶

biased.loc[biased['subjectivity'].idxmin(), 'text']

'It was “dress like your team day” on our “Jean-tober” calendar! Thankful for this leadership team and their passion for our @brookglenn3 students (Missing Marie🤣)! @theotooles @BerniceMJackso1 @MarieHavran #soaringtonewheights #Eagles'

The most subjective¶

biased.loc[biased['subjectivity'].idxmax(), 'text']

'Got to cheer on both of my favorite schools at the volleyball match! #eagles #dragons #wearefcs 🏐'

The distribution of subjectivity¶

Important: Subjectivity runs from 0 (most objective) to +1 (most subjective)

# create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# histogram
ax.hist(biased['subjectivity'], bins='auto')
ax.axvline(x=0.5, c='k', lw=2)

# format
ax.set_xlabel("Subjectivity")
ax.set_title("Subjectivity of #eagles Tweets", fontsize=16);

How does polarity influence subjectivity?¶

Are positive/negative tweets more or less objective?

Seaborn's `regplot()` function¶

Is there a linear trend?

ax = sns.regplot(x=biased['subjectivity'], y=biased['polarity'])

Seaborn's `kdeplot()`¶

Shade the bivariate relationship

ax = sns.kdeplot(data=biased['subjectivity'], data2=biased['polarity'])

/Users/nhand/opt/miniconda3/envs/musa-550-fall-2021/lib/python3.8/site-packages/seaborn/distributions.py:1681: FutureWarning: Use `x` and `y` rather than `data` `and `data2`
  warnings.warn(msg, FutureWarning)

Insight: the most subjective tweets tend to be most polarized as well...

Let's check for hourly trends too¶

We can plot the distribution of polarity by the tweet's hour

First, we'll add a new column that gives the day and hour of the tweet.

We can use the built-in strftime() function.

# this is month/day hour AM/PM
biased['date_string'] = biased['date'].dt.strftime("%-m/%d %I %p")

biased.head()

Sort the tweets in chronological order...

biased = biased.sort_values(by='date', ascending=True)

Make a box and whiskers plot of the polarity¶

Use Seaborn's boxplot() function

fig, ax = plt.subplots(figsize=(8, 14))

sns.boxplot(y='date_string', x='polarity', data=biased, ax=ax)
ax.axvline(x=0, c='k', lw=2) # neutral

# Set yticks to every other hour
yticks = ax.get_yticks()
ax.set_yticks(range(0, len(yticks), 2))
plt.setp(ax.get_yticklabels(), fontsize=10);

And subjectivity over time...¶

fig, ax = plt.subplots(figsize=(8,14))

sns.boxplot(y='date_string', x='subjectivity', data=biased)
ax.axvline(x=0.5, c='k', lw=2) # neutral

# Set yticks to every other hour
yticks = ax.get_yticks()
ax.set_yticks(range(0, len(yticks), 2))
plt.setp(ax.get_yticklabels(), fontsize=10);

At home exercise: sentiment analysis¶

Analyze your set of tweets from the last exercise (or get a new set), and explore the sentiments by:

plotting histograms of the subjectivity and polarity
finding the most/least subjective and polarized tweets
plotting the relationship between polarity and subjectivity
showing hourly trends in polarity/subjectivity

Or explore trends in some new way!

That's it!¶

Next week: creating your own datasets through web scraping
See you on Monday!

	NAME	B03002_001E	B03002_004E	state	county	tract
0	Census Tract 260, Philadelphia County, Pennsyl...	3055	2717	42	101	026000
1	Census Tract 263.02, Philadelphia County, Penn...	4927	4704	42	101	026302
2	Census Tract 264, Philadelphia County, Pennsyl...	5660	5576	42	101	026400
3	Census Tract 101, Philadelphia County, Pennsyl...	6925	6671	42	101	010100
4	Census Tract 107, Philadelphia County, Pennsyl...	3612	3301	42	101	010700

	geometry	cartodb_id	census_tract	data_redacted	num_bll_5plus	num_screen	perc_5plus
0	POLYGON ((-75.14147 39.95171, -75.14150 39.951...	1	42101000100	False	0.0	100.0	0.0
1	POLYGON ((-75.16238 39.95766, -75.16236 39.957...	2	42101000200	True	NaN	109.0	NaN
2	POLYGON ((-75.17821 39.95981, -75.17743 39.959...	3	42101000300	True	NaN	110.0	NaN
3	POLYGON ((-75.17299 39.95464, -75.17301 39.954...	4	42101000401	True	NaN	61.0	NaN
4	POLYGON ((-75.16333 39.95334, -75.16340 39.953...	5	42101000402	False	0.0	41.0	0.0

	geometry	MTFCC	OID	GEOID	STATE	COUNTY	TRACT	BASENAME	NAME_x	LSADC	FUNCSTAT	AREALAND	AREAWATER	CENTLAT	CENTLON	INTPTLAT	INTPTLON	OBJECTID	NAME_y	B03002_001E	B03002_004E	state	county	tract	percent_black	census_tract	perc_5plus
0	POLYGON ((-75.05110 40.02641, -75.05065 40.025...	G5020	20759510236568	42101032500	42	101	032500	325	Census Tract 325	CT	S	839319	0	+40.0263918	-075.0452561	+40.0263918	-075.0452561	1554	Census Tract 325, Philadelphia County, Pennsyl...	6143.0	1200.0	42	101	032500	19.534429	42101032500	2.9
1	POLYGON ((-75.16012 39.94340, -75.16020 39.943...	G5020	207593717001443	42101001102	42	101	001102	11.02	Census Tract 11.02	CT	S	204063	0	+39.9442654	-075.1566960	+39.9442654	-075.1566960	1910	Census Tract 11.02, Philadelphia County, Penns...	2687.0	257.0	42	101	001102	9.564570	NaN	NaN
2	POLYGON ((-75.17776 39.94897, -75.17784 39.948...	G5020	207593717001453	42101000803	42	101	000803	8.03	Census Tract 8.03	CT	S	152822	0	+39.9493748	-075.1742489	+39.9493748	-075.1742489	1911	Census Tract 8.03, Philadelphia County, Pennsy...	2774.0	81.0	42	101	000803	2.919971	42101000803	0.0
3	POLYGON ((-75.26108 39.87660, -75.26125 39.876...	G5020	207593717001459	42101980900	42	101	980900	9809	Census Tract 9809	CT	S	17332840	3354833	+39.9015541	-075.2140251	+39.9051802	-075.2174146	1912	Census Tract 9809, Philadelphia County, Pennsy...	0.0	0.0	42	101	980900	NaN	NaN	NaN
4	POLYGON ((-75.17226 39.99671, -75.17235 39.996...	G5020	207593717001480	42101017201	42	101	017201	172.01	Census Tract 172.01	CT	S	262958	0	+39.9993972	-075.1686145	+39.9993972	-075.1686145	1957	Census Tract 172.01, Philadelphia County, Penn...	2835.0	2146.0	42	101	017201	75.696649	42101017201	12.8

	date	polarity	subjectivity	text
0	2021-10-08 00:57:10	-0.071429	0.214286	Ending Soon #EAGLES 7" #VINYL #SINGLE #197...
1	2021-10-08 00:48:11	-0.400000	0.400000	November 20, 1960 - Chuck Bednarik (60) of the...
2	2021-10-08 00:47:09	0.000000	0.000000	Every time I watch the #Seahawks I say THANK G...
3	2021-10-08 00:40:30	0.606534	0.602273	New Kultural Sport Episode!! NFL 2021 WEEK 4 R...
4	2021-10-08 00:40:27	0.409091	0.313636	NEW: #Eagles Thurs notebook: -OL speculation. ...

	date	polarity	subjectivity	text	date_string
0	2021-10-08 00:57:10	-0.071429	0.214286	Ending Soon #EAGLES 7" #VINYL #SINGLE #197...	10/08 12 AM
1	2021-10-08 00:48:11	-0.400000	0.400000	November 20, 1960 - Chuck Bednarik (60) of the...	10/08 12 AM
3	2021-10-08 00:40:30	0.606534	0.602273	New Kultural Sport Episode!! NFL 2021 WEEK 4 R...	10/08 12 AM
4	2021-10-08 00:40:27	0.409091	0.313636	NEW: #Eagles Thurs notebook: -OL speculation. ...	10/08 12 AM
5	2021-10-08 00:39:01	0.291667	0.416667	Excited to watch former #Eagles CB, Sidney Jon...	10/08 12 AM

	words	count
0	#eagles	953
1	the	937
2	to	463
3	and	362
4	a	360
5	in	286
6	on	255
7	of	235
8	for	225
9	is	219
10	with	172
11	#flyeaglesfly	150
12	this	145
13	be	141
14	i	136

Week 5BGetting Data Part 1: Working with APIs