“The goal is to turn data into information, and information into insight.” - Carly Fiorina


Prerequisites

  • We will use Google Chrome in this section.

  • Download the chromedriver file, unzip it, and move the folder to your working directory. You need to use same version of Google Chrome, if you have installed it already.

  • Modules used in this section:

# Load modules
import os
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from datetime import datetime, timedelta
  • The Jupyter Notenook for this section is available at here.

Web scraping

Web scraping refers to the process of extracting data from a website. This can, of course, be done manually: you could go to a website, find the relevant data or information, and enter that information into some data file that you have stored locally. But imagine that you want to pull a very large dataset from hundreds or thousands of individual URLs. In this case, extracting the data manually sounds overwhelming and time-consuming.

In this article, we’ll see how to automate this process with Python, using the requests, BeautifulSoup, and Selenium modules. requests is an elegant and simple HTTP library for Python, BeautisulSoup is a Python module for extracting and parsing data from HTML files, and Selenium is used to automate web browser interaction between Python and your browser.

Installing libraries

Activate the cper environment, as we will install requests, BeautifulSoup, and Selenium under it.

conda activate cper

If you fail to find cper, follow Creating a Python environment to create one.

To install requests, BeautifulSoup, and Selenium:

conda install requests
conda install beautifulsoup4
conda install selenium

Fetching and rendering a web page

First, let’s try rendering a web page without using Selenium and other tools, and try to parse the HTML. For this demo, we’re going to try to scrape the information contained in the Daily Observations table from a Weather Underground page for weather observations taken at the Shenzhen Bo’an international airport (ZGSZ) on December 10, 2022. Here’s the URL. You may need a VPN to open the webpage.

Before we try rendering the web page, we want to inspect the HTML elements to determine what class we need to select. Go to the Daily Observations table, right-click and choose Inspect.

We can see that the entire Daily Observations table is contained under lib-city-history-observation. Let’s use that tag to grab just the Daily Observations table elements from the HTML. We’ll also use the .prettify() method to print out the selected HTML to make it just a little more readable.

# URL
data_url = 'https://www.wunderground.com/history/daily/cn/shenzhen/ZGSZ/date/2021-12-10'

# Use requests.get() method to load the page
data_page = requests.get(data_url)

# Use BeautifulSoup to parse the page
data_soup = BeautifulSoup(data_page.text, 'html.parser')

# Find information under lib-city-history-observation
soup_container = data_soup.find('lib-city-history-observation')

# Print out the selected HTML 
# to make it just a little more readable
print(soup_container.prettify())

Here is the output:

<lib-city-history-observation _ngcontent-sc275="" _nghost-sc236="">
 <div _ngcontent-sc236="">
  <div _ngcontent-sc236="" class="observation-title">
   Daily Observations
  </div>
  No Data Recorded
  <!-- -->
  <!-- -->
 </div>
</lib-city-history-observation>

The HTML returned by BeautifulSoup shows No data recorded, but when we go to the Daily Observations table on the web page, the table is obviously populated. What happened?

In this case, the requests.get() method ran faster than the web page could load, so the Daily Observations table wasn’t available by the time BeautifulSoup parsed the web page text. Here’s where Selenium will come in and allow the page content to load completely before we try to parse it.

Using web drivers

At this point, you also need to install ChromeDriver.

Download the chromedriver_win32.zip file (if you use Windows), unzip it, and move the folder to your working directory. You should have a single executable file called chromedriver.exe in a folder called chromedriver_win32 (if you use Windows).

Next, we’re going to write a my_rendering function that will run the ChromeDriver. This will open a Chrome window, load the web page fully, return the page HTML, and quit the ChromeDriver, closing the Chrome window. Be sure to update the path in the webdriver.Chrome function to the path where you installed the ChromeDriver, in our case, it’s chromedriver_win32\\chromedriver.exe.

# Define a rendering function that runs the ChromeDriver
def my_rendering(url):
    # The path to the ChromeDriver, change if necessary
    driver = webdriver.Chrome('chromedriver_win32\\chromedriver.exe')
    # Load the web page from the URL
    driver.get(url)
    # Wait for the web page to load
    time.sleep(3)
    # Get the page source HTML
    render = driver.page_source
    # Quit ChromeDriver
    driver.quit()
    # Return the page source HTML
    return render

Now, using this function, let’s try again to get the Daily Observations table elements from the HTML. Following the code written above, where, instead of using requests.get(), we’ll use our new rendering function:

# URL
data_url = 'https://www.wunderground.com/history/daily/cn/shenzhen/ZGSZ/date/2021-12-10'

# Use my_rendering function to load the page
data_page = my_rendering(data_url)

# Use BeautifulSoup to parse the page
data_soup = BeautifulSoup(data_page, 'html.parser')

# Find information under lib-city-history-observation
soup_container = data_soup.find('lib-city-history-observation')

# Print out the selected HTML 
# to make it just a little more readable
print(soup_container.prettify())

Parsing the HTML

The returned HTML might look overwhelming, but you can do a quick search for specific keywords in the HTML. For example, we know the table contains parameters of interest Temperature, Dew Point, Humidity, etc.

Looking through the HTML for these keywords, we find that each row (e.g., 2:00 AM row) is tagged under tr, and that each cell is tagged under td with each data point (e.g., Temperature) tagged as span. And we find different fields are associated with different tags.

# Get the full table
soup_data = soup_container.find_all('tr', class_='mat-row cdk-row ng-star-inserted')

# Tag of each element cell, in the following order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
cell_tag = ['mat-cell cdk-cell cdk-column-dateString mat-column-dateString ng-star-inserted',
            'mat-cell cdk-cell cdk-column-temperature mat-column-temperature ng-star-inserted',
            'mat-cell cdk-cell cdk-column-dewPoint mat-column-dewPoint ng-star-inserted',
            'mat-cell cdk-cell cdk-column-humidity mat-column-humidity ng-star-inserted',
            'mat-cell cdk-cell cdk-column-windcardinal mat-column-windcardinal ng-star-inserted',
            'mat-cell cdk-cell cdk-column-windSpeed mat-column-windSpeed ng-star-inserted',
            'mat-cell cdk-cell cdk-column-windGust mat-column-windGust ng-star-inserted',
            'mat-cell cdk-cell cdk-column-pressure mat-column-pressure ng-star-inserted',
            'mat-cell cdk-cell cdk-column-precipRate mat-column-precipRate ng-star-inserted',
            'mat-cell cdk-cell cdk-column-condition mat-column-condition ng-star-inserted']

# Tag of each element value, in the follwoing order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
value_tag = ['ng-star-inserted'    , 'wu-value wu-value-to', 'wu-value wu-value-to',
             'wu-value wu-value-to', 'ng-star-inserted'    , 'wu-value wu-value-to',
             'wu-value wu-value-to', 'wu-value wu-value-to', 'wu-value wu-value-to',
              'ng-star-inserted']

# Loop through rows
for i, dat in enumerate(soup_data):
    # Create an empty list for saving values
    row = []  
    # Loop through different fields, i.e., 
    # Date, Time, Temperature [F], Dew point [F], Humidity [%],
    # Wind Direction, Wind Speed [mph], Wind Guest [mph],
    # Pressure [in], Precipitation [in], Condition
    for field in range(0,len(cell_tag)):
        # Loop through cells
        for j, cell in enumerate(dat.find_all('td', class_=cell_tag[field])): 
            # Loop through element value
            for k in cell.find_all('span', class_=value_tag[field]):
                # Get text
                tmp = k.text
                # Remove any extra spaces
                tmp = tmp.strip('  ')
                # Append data
                row.append(tmp)
    # Print the row
    print(i, row)

Saving HTML Data to CSV File

Now let’s modify the above chunk of code to save data to a local file (Shenzhen_20221210.csv).

# Get the full table
soup_data = soup_container.find_all('tr', class_='mat-row cdk-row ng-star-inserted')

# Tag of each element cell, in the following order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
cell_tag = ['mat-cell cdk-cell cdk-column-dateString mat-column-dateString ng-star-inserted',
            'mat-cell cdk-cell cdk-column-temperature mat-column-temperature ng-star-inserted',
            'mat-cell cdk-cell cdk-column-dewPoint mat-column-dewPoint ng-star-inserted',
            'mat-cell cdk-cell cdk-column-humidity mat-column-humidity ng-star-inserted',
            'mat-cell cdk-cell cdk-column-windcardinal mat-column-windcardinal ng-star-inserted',
            'mat-cell cdk-cell cdk-column-windSpeed mat-column-windSpeed ng-star-inserted',
            'mat-cell cdk-cell cdk-column-windGust mat-column-windGust ng-star-inserted',
            'mat-cell cdk-cell cdk-column-pressure mat-column-pressure ng-star-inserted',
            'mat-cell cdk-cell cdk-column-precipRate mat-column-precipRate ng-star-inserted',
            'mat-cell cdk-cell cdk-column-condition mat-column-condition ng-star-inserted']

# Tag of each element value, in the follwoing order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
value_tag = ['ng-star-inserted'    , 'wu-value wu-value-to', 'wu-value wu-value-to',
             'wu-value wu-value-to', 'ng-star-inserted'    , 'wu-value wu-value-to',
             'wu-value wu-value-to', 'wu-value wu-value-to', 'wu-value wu-value-to',
              'ng-star-inserted']

# Open a file to save data
outfile = 'Shenzhen_20221210.csv'

with open(outfile, 'w') as f:
    # Write the header
    f.write('Date, Time, Temperature [F], Dew point [F], Humidity [%],' 
            'Wind Direction, Wind Speed [mph], Wind Guest [mph],' 
            'Pressure [in], Precipitation [in], Condition\n')       
    # Loop through rows
    for i, dat in enumerate(soup_data):
        # Create an empty list for saving values
        row = []  
        # Loop through different fields, i.e., 
        # Date, Time, Temperature [F], Dew point [F], Humidity [%],
        # Wind Direction, Wind Speed [mph], Wind Guest [mph],
        # Pressure [in], Precipitation [in], Condition
        for field in range(0,len(cell_tag)):
            # Loop through cells
            for j, cell in enumerate(dat.find_all('td', class_=cell_tag[field])): 
                # Loop through element value
                for k in cell.find_all('span', class_=value_tag[field]):
                    # Get text
                    tmp = k.text
                    # Remove any extra spaces
                    tmp = tmp.strip('  ')
                    # Append data
                    row.append(tmp)
        # Print the row
        print(i, row)
        # Write the date of the recorded data into the file
        f.write('2022-12-10,') 
        # Write data into the file
        f.write(','.join(row[:10])) 
        # New line, to append more rows to the same file later on
        f.write('\n')

Scraping multiple URLs

Let’s say we want to scrape the information contained in the Daily Observations table from Weather Underground pages for weather observations taken at any date. To do this, we need to convert the code above to its own function that can loop through multiple web pages for a range of input dates:

# Define the scraper function that
# gets data within a date range
def scrape_shenzhen_weather(start_date, end_date):
    
    # Tag of each element cell, in the following order:
    # Date, Time, Temperature [F], Dew point [F], Humidity [%],
    # Wind Direction, Wind Speed [mph], Wind Guest [mph],
    # Pressure [in], Precipitation [in], Condition
    cell_tag = ['mat-cell cdk-cell cdk-column-dateString mat-column-dateString ng-star-inserted',
                'mat-cell cdk-cell cdk-column-temperature mat-column-temperature ng-star-inserted',
                'mat-cell cdk-cell cdk-column-dewPoint mat-column-dewPoint ng-star-inserted',
                'mat-cell cdk-cell cdk-column-humidity mat-column-humidity ng-star-inserted',
                'mat-cell cdk-cell cdk-column-windcardinal mat-column-windcardinal ng-star-inserted',
                'mat-cell cdk-cell cdk-column-windSpeed mat-column-windSpeed ng-star-inserted',
                'mat-cell cdk-cell cdk-column-windGust mat-column-windGust ng-star-inserted',
                'mat-cell cdk-cell cdk-column-pressure mat-column-pressure ng-star-inserted',
                'mat-cell cdk-cell cdk-column-precipRate mat-column-precipRate ng-star-inserted',
                'mat-cell cdk-cell cdk-column-condition mat-column-condition ng-star-inserted']
    
    # Tag of each element value, in the follwoing order:
    # Date, Time, Temperature [F], Dew point [F], Humidity [%],
    # Wind Direction, Wind Speed [mph], Wind Guest [mph],
    # Pressure [in], Precipitation [in], Condition
    value_tag = ['ng-star-inserted'    , 'wu-value wu-value-to', 'wu-value wu-value-to',
                 'wu-value wu-value-to', 'ng-star-inserted'    , 'wu-value wu-value-to',
                 'wu-value wu-value-to', 'wu-value wu-value-to', 'wu-value wu-value-to',
                 'ng-star-inserted']

    # Data URL that can be formatted to find the webpage for any observation date     
    data_url = 'http://www.wunderground.com/history/daily/cn/shenzhen/ZGSZ/date/{}-{}-{}' 
    
    # This csv file name that can be formatted for any year
    outfile = 'Shenzhen_{}.csv'.format(start_date.year)
    
    # If the csv file does not exist, write, if it does exist, append
    if not os.path.exists(outfile):
        mode = 'w'
    else:
        mode = 'a'
        
    with open(outfile, mode) as f:
        # Write the header
        if mode == 'w':
            f.write('Date, Time, Temperature [F], Dew point [F], Humidity [%],' 
                    'Wind Direction, Wind Speed [mph], Wind Guest [mph],' 
                    'Pressure [in], Precipitation [in], Condition\n')
    
        # while loop continues until it reaches the given end date
        while start_date != end_date:     
            # Format the search URL for the given observation date 
            format_data_url = data_url.format(start_date.year,start_date.month,start_date.day)
            # Use my_rendering function to load the page
            data_page = my_rendering(format_data_url)
            # Use BeautifulSoup to parse the page
            data_soup = BeautifulSoup(data_page, 'html.parser')
            # Find information under lib-city-history-observation
            soup_container = data_soup.find('lib-city-history-observation')         
            # Get the full table
            soup_data = soup_container.find_all('tr',class_='mat-row cdk-row ng-star-inserted')
            
            # Loop through rows
            for i, dat in enumerate(soup_data):
                # Create an empty list for saving values
                row = []         
                # Loop through different fields, i.e., 
                # Date, Time, Temperature [F], Dew point [F], Humidity [%],
                # Wind Direction, Wind Speed [mph], Wind Guest [mph],
                # Pressure [in], Precipitation [in], Condition      
                for field in range(0,len(cell_tag)):
                    # Loop through cells
                    for j, d in enumerate(dat.find_all('td', class_= cell_tag[field])):
                        # Loop through element value
                        for k in d.find_all('span', class_= value_tag[field]):
                            # Get text
                            tmp = k.text
                            # Remove any extra spaces
                            tmp = tmp.strip('  ')
                            # Append data
                            row.append(tmp)
                # Write the date of the recorded data into the file
                f.write('{}-{}-{},'.format(start_date.year, start_date.month, start_date.day))
                # Write data into the file
                f.write(','.join(row[:10]))
                # New line, to append more rows to the same file later on
                f.write('\n') 
                
            # Go to next date, i.e., next URL
            start_date += timedelta(days=1) 

Now that we have our new scrape_shenzhen_weather function defined, let’s use it to scrape data for between 2022-01-01 and 2022-01-03. To do this, we simply need to define the start_date and end_date parameters, and run the function.

# Set start and end dates
start_date = datetime(year=2022, month=1, day=1)
end_date = datetime(year=2022, month=1, day=5)

# Call the scraper function
scrape_shenzhen_weather(start_date, end_date)

You should end up with a .csv file (Shenzhen_2022.csv) containing dated rows. Check it.


The notes are modified from the excellent Getting Started with Web Scraping in Python, available freely online.


In-class exercise

Modify the scrape_shenzhen_weather() function so that it scrapes the information contained in the Daily Observations table from Weather Underground pages for weather observations taken at ANY airport station within ANY date range.

[Hint: Add another input parameter station]