Week 6
Web Scraping

Oct 11, 2021

Housekeeping¶

Homework #3 (required) due a week from today (10/18)
Homework #4 will be assigned next week
You must complete one of homeworks #4, #5, and #6
Final project due at the end of the final period...more details soon

The roadmap¶

Last time: APIs, Census data, Twitter
Today: web scraping
Next: big data, geo data science in the wild, machine learning, interactive web maps, dashboarding & web servers

The final project will ask you to combine several of these topics/techniques to analyze a data sets and produce a web-based data visualization

Today: web scraping¶

Why web scraping?
Getting familiar with the Web
Web scraping: extracting data from static sites
How to deal with dynamic content

# Start with the usual imports
# We'll use these throughout
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

What is web scraping?¶

Using software to gather and extract data/content from websites

Why is web scraping useful?¶

Not every data source provides an API
The Web contains a lot of information
Unique data sources that may not be available elsewhere

What is possible: 11 million rental listings from Craigslist¶

Source: Geoff Boeing

Why isn't web scraping incredibly popular?¶

It can be time consuming and difficult to extract large volumes
You are at the mercy of website maintainers — if the website structure changes, your code breaks
Most importantly, there are ethical and legal concerns

Legal concerns¶

RadPad scraped the entirety of Craiglist, Craigslist sued RadPad, and they were awarded $60 million

Two types of legal issues¶

Copyright infringement
- For example: pictures, rental listing text
Terms of Use violations
- Unauthorized: Is scraping prohibited in the website’s terms of use?
- Intentional: Was the person aware of the terms? Did they check an “I agree to these terms” box?
- Causes damage: Did the scraping overload the website, blocking user access?

When is web scraping probably okay?¶

.gov sites and, to a lesser degree, .edu sites
Website owner has no business reason to protect the information
Not prohibited in terms of use
Limited number of requests
Not too many requests all at once
Done at night, when web traffic is low

When is it less likely to be okay?¶

search engines
E-commerce sites (e.g. Zillow, Expedia, Amazon)
Social media
Prohibited in terms of use
Large number of requests
High frequency of requests

With that being said, let's do some web scraping¶

A primer on Web definitions¶

So many acronyms:

HTML
The DOM
CSS

HTML: HyperText Markup Language¶

The language most websites are written in
The browser knows how to read this language and renders the output for you
HTML is what a web crawler will see

HTML tags¶

There are a standard set of tags to define the different structural components of a webpage
For example:
- <h1>, <h2> tags define headers
- <p> tags define paragraphs
- <ol> and <ul> are ordered and unordered lists

Jupyter notebooks can render HTML¶

Use the %%html magic cell command

%%html

<html>
  <head>
    <title>TITLE GOES HERE</title>
  </head>
  <body>
    <h1>MAIN CONTENT GOES IN THE BODY TAG</h1>
    <p>This is a paragraph tag</p>
    <p>This is a second paragraph tag</p>
  </body>
</html>

HTML: elements, tags, and attributes¶

Learning the notation:

%%html

<a id="my-link" style="color: orange;" href="https://www.design.upenn.edu" target="blank_">This is my link</a>

The element: Screen%20Shot%202019-03-12%20at%209.41.23%20PM.png

The tag: Screen%20Shot%202019-03-12%20at%209.42.27%20PM.png

The attributes:

Screen%20Shot%202019-03-12%20at%209.43.15%20PM.png

Some attributes have special meaning¶

In particular: id and class
Allows you to:
- select and manipulate specific elements
- apply styling to specific types of elements

CSS: Cascading Stylesheets¶

A language for styling HTML pages
CSS styles (also known as selectors) are applied to HTML tags based on their name, class, or ID.

Basic Web selectors¶

Class
- e.g., .red
ID
- e.g., #some-id
Tag
- e.g., p, li, div

IDs: unique identifiers
- no two elements on a page will have the same ID.
Classes: not unique
- many elements will have the same class
- a single element can have multiple classes

And many more: look up the syntax when you need it

https://www.w3schools.com/cssref/css_selectors.asp

Finally: dynamic content¶

The DOM: Document Object Model¶

An interactive object tree created from the HTML tag hierarchy on a page
Created by the browser
Tracks user interactions
It is dynamic: stores the current state of the webpage

Step 1: Inspecting a webpage¶

Modern web browsers provide tools for inspecting the source HTML and DOM of websites
Also tells you data sources that have been loaded by the page
This should also be your first step when starting to scrape a page

Simply hit F12 to load the Web Inspector

Screen%20Shot%202021-10-09%20at%205.44.48%20PM.png

The Elements tab¶

Allows you to inspect the DOM directly
The tool that will allow you to identify what data you want to scrape from a website

Screen%20Shot%202021-10-09%20at%205.45.24%20PM.png

The Elements tab¶

Right click on the element you want to view
Click on "Inspect"
The element will be highlighted in the DOM (in the Elements tab)

The Network Tab¶

Shows you all files downloaded by the webpage after the HTML loaded (FYI: these are called AJAX requests)
Select either JS or XHR (XML HTTP requests)

Usually looking for .json files for GeoJSON data

Example 1: The Prisons Census in Philadelphia ¶

Screen%20Shot%202021-10-09%20at%205.48.21%20PM-2.png

Example 2: Measles data from the WSJ ¶

Screen%20Shot%202019-03-12%20at%2010.33.26%20PM.png

Week 6Web Scraping

Housekeeping¶

The roadmap¶

Today: web scraping¶

What is web scraping?¶

Why is web scraping useful?¶

What is possible: 11 million rental listings from Craigslist¶

Why isn't web scraping incredibly popular?¶

Legal concerns¶

Two types of legal issues¶

When is web scraping probably okay?¶

When is it less likely to be okay?¶

With that being said, let's do some web scraping¶

A primer on Web definitions¶

HTML: HyperText Markup Language¶

HTML tags¶

Jupyter notebooks can render HTML¶

MAIN CONTENT GOES IN THE BODY TAG

HTML: elements, tags, and attributes¶

Some attributes have special meaning¶

CSS: Cascading Stylesheets¶

Basic Web selectors¶

Finally: dynamic content¶

The DOM: Document Object Model¶

Step 1: Inspecting a webpage¶

The Elements tab¶

The Elements tab¶

The Network Tab¶

Example 1: The Prisons Census in Philadelphia¶

Example 2: Measles data from the WSJ¶

To be continued...¶

Week 6
Web Scraping

Example 1: The Prisons Census in Philadelphia ¶

Example 2: Measles data from the WSJ ¶