Oct 6, 2021
Last time:
Today:
import geopandas as gpd
import pandas as pd
import numpy as np
from shapely.geometry import Point
from matplotlib import pyplot as plt
import seaborn as sns
import hvplot.pandas
import holoviews as hv
import esri2gpd
import carto2gpd
import cenpy
pd.options.display.max_columns = 999
First, let's initialize a connection to the 2019 5-year ACS dataset
acs = cenpy.remote.APIConnection("ACSDT5Y2019")
# Set the map service for pulling geometries
acs.set_mapservice("tigerWMS_ACS2019")
Let's use demographic census data in Philadelphia by census tract and compare to a dataset of childhood lead poisoning.
*
operator to get all tracts in Philadelphia county philly_demo_tract = acs.query(
cols=["NAME", "B03002_001E", "B03002_004E"],
geo_unit="tract:*",
geo_filter={
"state" : "42",
"county" : "101"
},
)
philly_demo_tract.head()
philly_demo_tract.dtypes # "object" means string!
# Census tracts are the 9th layer (index 8 starting from 0)
acs.mapservice.layers[8]
# The base url for the map service API endpoint
acs.mapservice._baseurl
## We're just querying a GeoService — let's use esri2gpd
# Only Philadelphia
where_clause = "STATE = 42 AND COUNTY = 101"
# Create the API url with the layer ID add the end
url = f"{acs.mapservice._baseurl}/8"
# Query
philly_census_tracts = esri2gpd.get(url, where=where_clause)
# Query for census tracts using cenpy API
# philly_census_tracts = acs.mapservice.layers[8].query(where=where_clause)
philly_census_tracts.head(n=1)
philly_census_tracts.dtypes
philly_demo_tract.head(n=1)
# Merge them together
# IMPORTANT: Make sure your merge keys are the same dtypes (e.g., all strings or all ints)
philly_demo_tract = philly_census_tracts.merge(
philly_demo_tract,
left_on=["STATE", "COUNTY", "TRACT"],
right_on=["state", "county", "tract"],
)
Add a new column to your data called percent_black
.
Important: Make sure you convert the data to floats!
for col in ['B03002_001E', 'B03002_004E']:
philly_demo_tract[col] = philly_demo_tract[col].astype(float)
philly_demo_tract["percent_black"] = (
100 * philly_demo_tract["B03002_004E"] / philly_demo_tract["B03002_001E"]
)
carto2gpd
package# Documentation includes an example for help!
# carto2gpd.get?
table_name = 'child_blood_lead_levels_by_ct'
lead_levels = carto2gpd.get("https://phl.carto.com/api/v2/sql", table_name)
lead_levels.head()
See the .dropna()
function and the subset=
keyword.
lead_levels = lead_levels.dropna(subset=['perc_5plus'])
census_tract
and GEOID
fieldsGeoDataFrame.merge(...)
# Trim the lead levels data
lead_levels_trimmed = lead_levels[['census_tract', 'perc_5plus']]
# Merge into the demographic data
# Use "GEOID" — that is the unique identifier here
merged = philly_demo_tract.merge(lead_levels_trimmed,
how='left',
left_on='GEOID',
right_on='census_tract')
merged.head()
We only need the 'geometry', 'percent_black', and 'perc_5plus', and 'NAME' columns
merged = merged[['NAME_x', 'geometry', 'percent_black', 'perc_5plus']]
Make two plots:
You can make these using hvplot or geopandas/matplotlib — whichever you prefer!
# Lead levels plot
img1 = merged.hvplot(geo=True,
crs=3857,
c='perc_5plus',
width=500,
height=400,
cmap='viridis',
title='Lead Levels')
# Percent black
img2 = merged.hvplot(geo=True,
crs=3857,
c='percent_black',
width=500,
height=400,
cmap='viridis',
title='% Black')
img1 + img2
cols = ['perc_5plus', 'percent_black']
merged[cols].hvplot.scatter(x=cols[0], y=cols[1])
In the previous plots, it's still hard to see the relationship. Use the kdeplot()
function in seaborn
to better visualize the relationship.
You will need to remove any NaN entries first.
You should see two peaks in the distribution clearly now!
fig, ax = plt.subplots(figsize=(8,6))
X = merged.dropna()
sns.kdeplot(x=X['perc_5plus'], y=X['percent_black'], ax=ax);
Twitter provides a rich source of information, but the challenge is how to extract the information from semi-structured data.
Data that contains some elements that cannot be easily consumed by computers
Examples: human-readable text, audio, images, etc
You will need to apply for a Developer Account, answering a few questions, and then confirm your email address.
Once you submit you'll application, you'll need to wait for approval...usually this happens immediately, but there can sometimes be a short delay
Needs to be at least 100 characters
In the "Keys and Tokens" section, generate new access tokens.
You will need the Consumer API keys and access tokens to use the Twitter API.
tweepy
to search recent tweets¶The standard, free API let's you search tweets from the last 7 days
For more information, see the Twitter Developer docs
import tweepy as tw
# INPUT YOUR API AND ACCESS KEYS HERE
api_key = ""
api_key_secret = ""
access_token = ""
access_token_secret = ""
auth = tw.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)
Be careful: With the free API, you are allowed 15 API calls every 15 minutes.
See the Rate Limits documentation for more details.
wait_on_rate_limit
do?¶If you run into a rate limit while pulling tweets, this will tell tweepy
to wait 15 minutes until it can continue.
Unfortunately, you need to sign up (and pay) for the premium API to avoid these rate limits.
data = api.rate_limit_status()
data
data['resources']['search']
import datetime
datetime.datetime.fromtimestamp(1633566967)
Including user tweets, mentions, searching keywords, favoriting, direct messages, and more...
See the Tweepy API documentation
You can post tweets using the update_status()
function. For example:
tweet = 'Hello World! @PennMusa'
api.update_status(tweet)
We'll use a tweepy.Cursor
object to query the API.
# collect tweets related to the eagles
search_words = "#eagles"
# initialize the cursor
cursor = tw.Cursor(api.search,
q=search_words,
lang="en",
tweet_mode='extended')
cursor
Use the Cursor.items()
function:
# select 5 tweets
tweets = cursor.items(5)
tweets
As the name suggests, iterators need to be iterated over to return objects. In our case, we can use a for loop to iterate through the tweets that we pulled from the API.
# Iterate on tweets
for tweet in tweets:
print(tweet.full_text)
Unfortunately, there is no way to search for tweets between specific dates.
The API pulls tweets from the most recent page of the search result, and then grabs tweets from the previous page, and so on, to return the requested number of tweets.
The API documentation has examples of different query string use cases
new_search = search_words + " -filter:retweets"
Get a new tweets using our new search query:
cursor = tw.Cursor(api.search,
q=new_search,
lang="en",
tweet_mode='extended')
tweets = cursor.items(5)
for tweet in tweets:
print(tweet.full_text)
Create a list of the tweets using Python's inline syntax
# select the next 10 tweets
tweets = [t for t in cursor.items(10)]
print(len(tweets))
Beyond the text of the tweet, things like favorite and retweet counts are available. You can also extract info on the user behind the tweet.
first_tweet = tweets[0]
first_tweet.full_text
first_tweet.favorite_count
first_tweet.created_at
first_tweet.retweet_count
first_tweet.user.description
first_tweet.user.followers_count
A fraction of tweets have locations associated with user profiles, giving (very rough) location data.
users_locs = [[tweet.user.screen_name, tweet.user.location] for tweet in tweets]
users_locs
Note: only about 1% of tweets have a latitude/longitude.
Difficult to extract geographic trends without pulling a large number of tweets, requiring a premium API.
An example of text mining
Save the text of 1,000 tweets after querying our cursor object.
cursor = tw.Cursor(api.search,
q="#eagles -filter:retweets",
lang="en",
tweet_mode='extended')
tweets = [tweet for tweet in cursor.items(1000)]
# get the text of the tweets
tweets_text = [tweet.full_text for tweet in tweets]
# the first five tweets
tweets_text[:5]
This will identify "t.co" in URLs, e.g. https://t.co/Sp1Qtf5Fnl
Don't worry about mastering regular expression syntax...
StackOverflow is your friend
def remove_url(txt):
"""
Replace URLs found in a text string with nothing
(i.e. it will remove the URL from the string).
Parameters
----------
txt : string
A text string that you want to parse and remove urls.
Returns
-------
The same txt string with url's removed.
"""
import re
return " ".join(re.sub("https://t.co/[A-Za-z\\d]+|&", "", txt).split())
tweets_no_urls = [remove_url(tweet) for tweet in tweets_text]
tweets_no_urls[:5]
.lower()
makes all words lower cased.split()
splits a string into the individual words"This is an Example".lower()
"This is an Example".lower().split()
Apply these functions to all tweets:
words_in_tweet = [tweet.lower().split() for tweet in tweets_no_urls]
words_in_tweet[:2]
We'll define a helper function to calculate word frequencies from our lists of words.
def count_word_frequencies(words_in_tweet, top=15):
"""
Given a list of all words for every tweet, count
word frequencies across all tweets.
By default, this returns the top 15 words, but you
can specify a different value for `top`.
"""
import itertools, collections
# List of all words across tweets
all_words = list(itertools.chain(*words_in_tweet))
# Create counter
counter = collections.Counter(all_words)
return pd.DataFrame(counter.most_common(top),
columns=['words', 'count'])
counts_no_urls = count_word_frequencies(words_in_tweet, top=15)
counts_no_urls.head(n=15)
Use seaborn
to plot our DataFrame of word counts...
fig, ax = plt.subplots(figsize=(8, 8))
# Plot horizontal bar graph
sns.barplot(
y="words",
x="count",
data=counts_no_urls.sort_values(by="count", ascending=False),
ax=ax,
color="#cc3000",
saturation=1.0,
)
ax.set_title("Common Words Found in Tweets (Including All Words)", fontsize=16)
Common words that do not carry much significance and are often ignored in text analysis.
We can use the nltk
package.
The "Natural Language Toolkit" https://www.nltk.org/
import nltk
nltk.download('stopwords');
stop_words = list(set(nltk.corpus.stopwords.words('english')))
stop_words[:10]
len(stop_words)
import string
punctuation = list(string.punctuation)
punctuation[:5]
ignored = stop_words + punctuation
ignored[:10]
# Remove stop words from each tweet list of words
tweets_nsw = [[word for word in tweet_words if word not in ignored]
for tweet_words in words_in_tweet]
tweets_nsw[0]
counts_nsw = count_word_frequencies(tweets_nsw)
counts_nsw.head()
And plot...
fig, ax = plt.subplots(figsize=(8, 8))
sns.barplot(
y="words",
x="count",
data=counts_nsw.sort_values(by="count", ascending=False),
ax=ax,
color="#cc3000",
saturation=1.0,
)
ax.set_title("Common Words Found in Tweets (Without Stop Words)", fontsize=16);
Now, we'll be left with only the meanigful words...
search_terms = ['#eagles', "eagles", "@eagles"]
tweets_final = [[w for w in word if w not in search_terms]
for word in tweets_nsw]
# frequency counts
counts_final = count_word_frequencies(tweets_final)
fig, ax = plt.subplots(figsize=(8, 8))
sns.barplot(
y="words",
x="count",
data=counts_final.sort_values(by="count", ascending=False),
ax=ax,
color="#cc3000",
saturation=1.0,
)
ax.set_title("Common Words Found in Tweets (Cleaned)", fontsize=16)
Get 1,000 tweets using a query string of your choice and plot the word frequencies.
Be sure to:
Note: if you try to pull more than 1,000 tweets you will likely run into the rate limit and have to wait 15 minutes.
Remember: the API documentation describes how to customize a query string.
The goal of a sentiment analysis is to determine the attitude or emotional state of the person who sent a particular tweet.
Often used by brands to evaluate public opinion about a product.
Determine the "sentiment" of every word in the English language
Train a machine learning algorithm to classify words as positive vs. negative, given an input training sample of words.
Luckily, this is a very common task in NLP and there are several packages available that have done the hard work for you.
They provide out-of-the-box sentiment analysis using pre-trained machine learning algorithms.
textblob
¶import textblob
Simply pass the tweet text to the TextBlob()
object.
Note: it's best to remove any URLs first!
blobs = [textblob.TextBlob(remove_url(t.full_text)) for t in tweets]
blobs[0]
blobs[0].sentiment
Track the polarity, subjectivity, and date of each tweet.
data = {}
data['date'] = [tweet.created_at for tweet in tweets]
data['polarity'] = [blob.sentiment.polarity for blob in blobs]
data['subjectivity'] = [blob.sentiment.subjectivity for blob in blobs]
data['text'] = [remove_url(tweet.full_text) for tweet in tweets]
data = pd.DataFrame(data)
data.head()
We can remove tweets with a polarity of zero to get a better sense of emotions.
zero = (data['polarity']==0).sum()
print("number of unbiased tweets = ", zero)
# remove unbiased tweets
biased = data.loc[ data['polarity'] != 0 ].copy()
We can find the tweet with the maximum positive/negative scores
Use the idxmin()
function:
biased.loc[biased['polarity'].idxmin(), 'text']
Use the idxmax()
function
biased.loc[biased['polarity'].idxmax(), 'text']
Important: Polarity runs from -1 (most negative) to +1 (most positive)
We can use matplotlib's hist()
function:
# create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# histogram
ax.hist(biased['polarity'], bins='auto')
ax.axvline(x=0, c='k', lw=2)
# format
ax.set_xlabel("Polarity")
ax.set_title("Polarity of #eagles Tweets", fontsize=16);
biased['polarity'].median()
biased['polarity'].mean()
biased.loc[biased['subjectivity'].idxmin(), 'text']
biased.loc[biased['subjectivity'].idxmax(), 'text']
Important: Subjectivity runs from 0 (most objective) to +1 (most subjective)
# create a figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# histogram
ax.hist(biased['subjectivity'], bins='auto')
ax.axvline(x=0.5, c='k', lw=2)
# format
ax.set_xlabel("Subjectivity")
ax.set_title("Subjectivity of #eagles Tweets", fontsize=16);
Are positive/negative tweets more or less objective?
regplot()
function¶Is there a linear trend?
ax = sns.regplot(x=biased['subjectivity'], y=biased['polarity'])
kdeplot()
¶Shade the bivariate relationship
ax = sns.kdeplot(data=biased['subjectivity'], data2=biased['polarity'])
Insight: the most subjective tweets tend to be most polarized as well...
We can plot the distribution of polarity by the tweet's hour
First, we'll add a new column that gives the day and hour of the tweet.
We can use the built-in strftime() function.
# this is month/day hour AM/PM
biased['date_string'] = biased['date'].dt.strftime("%-m/%d %I %p")
biased.head()
Sort the tweets in chronological order...
biased = biased.sort_values(by='date', ascending=True)
Use Seaborn's boxplot()
function
fig, ax = plt.subplots(figsize=(8, 14))
sns.boxplot(y='date_string', x='polarity', data=biased, ax=ax)
ax.axvline(x=0, c='k', lw=2) # neutral
# Set yticks to every other hour
yticks = ax.get_yticks()
ax.set_yticks(range(0, len(yticks), 2))
plt.setp(ax.get_yticklabels(), fontsize=10);
fig, ax = plt.subplots(figsize=(8,14))
sns.boxplot(y='date_string', x='subjectivity', data=biased)
ax.axvline(x=0.5, c='k', lw=2) # neutral
# Set yticks to every other hour
yticks = ax.get_yticks()
ax.set_yticks(range(0, len(yticks), 2))
plt.setp(ax.get_yticklabels(), fontsize=10);
Analyze your set of tweets from the last exercise (or get a new set), and explore the sentiments by:
Or explore trends in some new way!