Week 6
Web Scraping

Oct 11, 2021

Housekeeping

  • Homework #3 (required) due a week from today (10/18)
  • Homework #4 will be assigned next week
  • You must complete one of homeworks #4, #5, and #6
  • Final project due at the end of the final period...more details soon

The roadmap

  • Last time: APIs, Census data, Twitter
  • Today: web scraping
  • Next: big data, geo data science in the wild, machine learning, interactive web maps, dashboarding & web servers

The final project will ask you to combine several of these topics/techniques to analyze a data sets and produce a web-based data visualization

Today: web scraping

  • Why web scraping?
  • Getting familiar with the Web
  • Web scraping: extracting data from static sites
  • How to deal with dynamic content
In [1]:
# Start with the usual imports
# We'll use these throughout
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

What is web scraping?

Using software to gather and extract data/content from websites

Why is web scraping useful?

  • Not every data source provides an API
  • The Web contains a lot of information
  • Unique data sources that may not be available elsewhere

What is possible: 11 million rental listings from Craigslist

Source: Geoff Boeing

Why isn't web scraping incredibly popular?

  • It can be time consuming and difficult to extract large volumes
  • You are at the mercy of website maintainers — if the website structure changes, your code breaks
  • Most importantly, there are ethical and legal concerns

RadPad scraped the entirety of Craiglist, Craigslist sued RadPad, and they were awarded $60 million

  1. Copyright infringement
    • For example: pictures, rental listing text
  2. Terms of Use violations
    • Unauthorized: Is scraping prohibited in the website’s terms of use?
    • Intentional: Was the person aware of the terms? Did they check an “I agree to these terms” box?
    • Causes damage: Did the scraping overload the website, blocking user access?

When is web scraping probably okay?

  • .gov sites and, to a lesser degree, .edu sites
  • Website owner has no business reason to protect the information
  • Not prohibited in terms of use
  • Limited number of requests
  • Not too many requests all at once
  • Done at night, when web traffic is low

When is it less likely to be okay?

  • search engines
  • E-commerce sites (e.g. Zillow, Expedia, Amazon)
  • Social media
  • Prohibited in terms of use
  • Large number of requests
  • High frequency of requests

With that being said, let's do some web scraping

A primer on Web definitions

So many acronyms:

  • HTML
  • The DOM
  • CSS

HTML: HyperText Markup Language

  • The language most websites are written in
  • The browser knows how to read this language and renders the output for you
  • HTML is what a web crawler will see

HTML tags

  • There are a standard set of tags to define the different structural components of a webpage
  • For example:
    • <h1>, <h2> tags define headers
    • <p> tags define paragraphs
    • <ol> and <ul> are ordered and unordered lists

Jupyter notebooks can render HTML

Use the %%html magic cell command

In [2]:
%%html

<html>
  <head>
    <title>TITLE GOES HERE</title>
  </head>
  <body>
    <h1>MAIN CONTENT GOES IN THE BODY TAG</h1>
    <p>This is a paragraph tag</p>
    <p>This is a second paragraph tag</p>
  </body>
</html>
TITLE GOES HERE

MAIN CONTENT GOES IN THE BODY TAG

This is a paragraph tag

This is a second paragraph tag

HTML: elements, tags, and attributes

Learning the notation:

In [4]:
%%html

<a id="my-link" style="color: orange;" href="https://www.design.upenn.edu" target="blank_">This is my link</a>

The element: Screen%20Shot%202019-03-12%20at%209.41.23%20PM.png

The tag: Screen%20Shot%202019-03-12%20at%209.42.27%20PM.png

The attributes:

Screen%20Shot%202019-03-12%20at%209.43.15%20PM.png

Some attributes have special meaning

  • In particular: id and class
  • Allows you to:
    • select and manipulate specific elements
    • apply styling to specific types of elements

CSS: Cascading Stylesheets

  • A language for styling HTML pages
  • CSS styles (also known as selectors) are applied to HTML tags based on their name, class, or ID.

Basic Web selectors

  • Class
    • e.g., .red
  • ID
    • e.g., #some-id
  • Tag
    • e.g., p, li, div
  • IDs: unique identifiers
    • no two elements on a page will have the same ID.
  • Classes: not unique
    • many elements will have the same class
    • a single element can have multiple classes

And many more: look up the syntax when you need it

https://www.w3schools.com/cssref/css_selectors.asp

Finally: dynamic content

The DOM: Document Object Model

  • An interactive object tree created from the HTML tag hierarchy on a page
  • Created by the browser
  • Tracks user interactions
  • It is dynamic: stores the current state of the webpage

Step 1: Inspecting a webpage

  • Modern web browsers provide tools for inspecting the source HTML and DOM of websites
  • Also tells you data sources that have been loaded by the page
  • This should also be your first step when starting to scrape a page

Simply hit F12 to load the Web Inspector

Screen%20Shot%202021-10-09%20at%205.44.48%20PM.png

The Elements tab

  • Allows you to inspect the DOM directly
  • The tool that will allow you to identify what data you want to scrape from a website

Screen%20Shot%202021-10-09%20at%205.45.24%20PM.png

The Elements tab

  1. Right click on the element you want to view
  2. Click on "Inspect"
  3. The element will be highlighted in the DOM (in the Elements tab)

The Network Tab

  • Shows you all files downloaded by the webpage after the HTML loaded (FYI: these are called AJAX requests)
  • Select either JS or XHR (XML HTTP requests)

Usually looking for .json files for GeoJSON data

To be continued...

In [ ]: