Python
“The goal is to turn data into information, and information into insight.” - Carly Fiorina
We will use Google Chrome in this section.
Download the chromedriver
file, unzip it, and move the folder to your working
directory. You need to use same version of
your Google Chrome, if you have installed it already. If you are using
the latest Chrome, use version 131.0.6778.108
from the Stable
Releast. Either 64
or 32
version
works.
Modules used in this section:
# Load modules
import os
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from datetime import datetime, timedelta
Jupyter Notenook
for this section is available at
here.Web scraping refers to the process of extracting data from a website. This can, of course, be done manually: you could go to a website, find the relevant data or information, and enter that information into some data file that you have stored locally. But imagine that you want to pull a very large dataset from hundreds or thousands of individual URLs. In this case, extracting the data manually sounds overwhelming and time-consuming.
In this article, we’ll see how to automate this
process with Python
, using the requests
,
BeautifulSoup
,
and Selenium
modules. requests
is an elegant and simple HTTP library for
Python
, BeautisulSoup
is a Python
module for extracting and parsing data from HTML files, and
Selenium
is used to automate web browser interaction
between Python
and your browser.
Activate the cper
environment, as we will install requests
,
BeautifulSoup
, and Selenium
under it.
If you fail to find cper
, follow Creating
a Python
environment to create one.
To install requests
, BeautifulSoup
, and
Selenium
:
First, let’s try rendering a web page without using
Selenium
and other tools, and try to parse the HTML. For
this demo, we’re going to try to scrape the information contained in the
Daily Observations table from a Weather Underground page for weather
observations taken at the Shenzhen Bo’an international airport (ZGSZ) on
December 15, 2024
. Here’s the URL.
You may need a VPN to open the webpage.
Before we try rendering the web page, we want to inspect the HTML
elements to determine what class we need to select. Go to the Daily
Observations table, right-click and choose Inspect
.
We can see that the entire Daily Observations table is contained
under lib-city-history-observation
. Let’s use that tag to
grab just the Daily Observations table elements from the HTML. We’ll
also use the .prettify()
method to print out the selected
HTML to make it just a little more readable.
# URL
data_url = 'https://www.wunderground.com/history/daily/cn/shenzhen/ZGSZ/date/2024-12-15'
# Use requests.get() method to load the page
data_page = requests.get(data_url)
# Use BeautifulSoup to parse the page
data_soup = BeautifulSoup(data_page.text, 'html.parser')
# Find information under lib-city-history-observation
soup_container = data_soup.find('lib-city-history-observation')
# Print out the selected HTML
# to make it just a little more readable
print(soup_container.prettify())
Here is the output:
<lib-city-history-observation _ngcontent-sc275="" _nghost-sc236="">
<div _ngcontent-sc236="">
<div _ngcontent-sc236="" class="observation-title">
Daily Observations
</div>
No Data Recorded
<!-- -->
<!-- -->
</div>
</lib-city-history-observation>
The HTML returned by BeautifulSoup
shows
No data recorded
, but when we go to the Daily Observations
table on the web page, the table is obviously populated. What
happened?
In this case, the requests.get()
method ran
faster than the web page could load, so the Daily
Observations table wasn’t available by the time
BeautifulSoup
parsed the web page text. Here’s where
Selenium
will come in and allow the page content to load
completely before we try to parse it.
At this point, you also need to install ChromeDriver.
Download the chromedriver-win64.zip
file (if you use Windows), unzip it, and move the folder to
your working directory. You should have a single executable
file called chromedriver.exe
in a folder called
chromedriver-win64
(if you use Windows).
Next, we’re going to write a my_rendering
function that
will run the ChromeDriver
. This will open a Chrome window,
load the web page fully, return the page HTML, and quit the
ChromeDriver
, closing the Chrome window. Be sure to update
the path in the webdriver.Chrome
function to the path where
you installed the ChromeDriver
, in our case, it’s
chromedriver-win64\\chromedriver.exe
.
# Define a rendering function that runs the ChromeDriver
def my_rendering(url):
# The path to the ChromeDriver, change if necessary
chrome_service = webdriver.ChromeService(executable_path='chromedriver-win64\\chromedriver.exe')
driver = webdriver.Chrome(service = chrome_service)
# Load the web page from the URL
driver.get(url)
# Wait for the web page to load
time.sleep(3)
# Get the page source HTML
render = driver.page_source
# Quit ChromeDriver
driver.quit()
# Return the page source HTML
return render
Now, using this function, let’s try again to get the Daily
Observations table elements from the HTML. Following the code written
above, where, instead of using requests.get()
, we’ll use
our new rendering function:
# URL
data_url = 'https://www.wunderground.com/history/daily/cn/shenzhen/ZGSZ/date/2024-12-15'
# Use my_rendering function to load the page
data_page = my_rendering(data_url)
# Use BeautifulSoup to parse the page
data_soup = BeautifulSoup(data_page, 'html.parser')
# Find information under lib-city-history-observation
soup_container = data_soup.find('lib-city-history-observation')
# Print out the selected HTML
# to make it just a little more readable
print(soup_container.prettify())
The returned HTML might look overwhelming, but you can do a quick search for specific keywords in the HTML. For example, we know the table contains parameters of interest Temperature, Dew Point, Humidity, etc.
Looking through the HTML for these keywords, we find that each row
(e.g., 2:00 AM
row) is tagged under tr
, and
that each cell is tagged under td
with each data point
(e.g., Temperature) tagged as span
. And we find different
fields are associated with different tags.
# Get the full table
soup_data = soup_container.find_all('tr', class_='mat-row cdk-row ng-star-inserted')
# Tag of each element cell, in the following order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
cell_tag = ['mat-cell cdk-cell cdk-column-dateString mat-column-dateString ng-star-inserted',
'mat-cell cdk-cell cdk-column-temperature mat-column-temperature ng-star-inserted',
'mat-cell cdk-cell cdk-column-dewPoint mat-column-dewPoint ng-star-inserted',
'mat-cell cdk-cell cdk-column-humidity mat-column-humidity ng-star-inserted',
'mat-cell cdk-cell cdk-column-windcardinal mat-column-windcardinal ng-star-inserted',
'mat-cell cdk-cell cdk-column-windSpeed mat-column-windSpeed ng-star-inserted',
'mat-cell cdk-cell cdk-column-windGust mat-column-windGust ng-star-inserted',
'mat-cell cdk-cell cdk-column-pressure mat-column-pressure ng-star-inserted',
'mat-cell cdk-cell cdk-column-precipRate mat-column-precipRate ng-star-inserted',
'mat-cell cdk-cell cdk-column-condition mat-column-condition ng-star-inserted']
# Tag of each element value, in the follwoing order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
value_tag = ['ng-star-inserted' , 'wu-value wu-value-to', 'wu-value wu-value-to',
'wu-value wu-value-to', 'ng-star-inserted' , 'wu-value wu-value-to',
'wu-value wu-value-to', 'wu-value wu-value-to', 'wu-value wu-value-to',
'ng-star-inserted']
# Loop through rows
for i, dat in enumerate(soup_data):
# Create an empty list for saving values
row = []
# Loop through different fields, i.e.,
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
for field in range(0,len(cell_tag)):
# Loop through cells
for j, cell in enumerate(dat.find_all('td', class_=cell_tag[field])):
# Loop through element value
for k in cell.find_all('span', class_=value_tag[field]):
# Get text
tmp = k.text
# Remove any extra spaces
tmp = tmp.strip(' ')
# Append data
row.append(tmp)
# Print the row
print(i, row)
Now let’s modify the above chunk of code to save data to a local file
(Shenzhen_20241215.csv
).
# Get the full table
soup_data = soup_container.find_all('tr', class_='mat-row cdk-row ng-star-inserted')
# Tag of each element cell, in the following order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
cell_tag = ['mat-cell cdk-cell cdk-column-dateString mat-column-dateString ng-star-inserted',
'mat-cell cdk-cell cdk-column-temperature mat-column-temperature ng-star-inserted',
'mat-cell cdk-cell cdk-column-dewPoint mat-column-dewPoint ng-star-inserted',
'mat-cell cdk-cell cdk-column-humidity mat-column-humidity ng-star-inserted',
'mat-cell cdk-cell cdk-column-windcardinal mat-column-windcardinal ng-star-inserted',
'mat-cell cdk-cell cdk-column-windSpeed mat-column-windSpeed ng-star-inserted',
'mat-cell cdk-cell cdk-column-windGust mat-column-windGust ng-star-inserted',
'mat-cell cdk-cell cdk-column-pressure mat-column-pressure ng-star-inserted',
'mat-cell cdk-cell cdk-column-precipRate mat-column-precipRate ng-star-inserted',
'mat-cell cdk-cell cdk-column-condition mat-column-condition ng-star-inserted']
# Tag of each element value, in the follwoing order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
value_tag = ['ng-star-inserted' , 'wu-value wu-value-to', 'wu-value wu-value-to',
'wu-value wu-value-to', 'ng-star-inserted' , 'wu-value wu-value-to',
'wu-value wu-value-to', 'wu-value wu-value-to', 'wu-value wu-value-to',
'ng-star-inserted']
# Open a file to save data
outfile = 'Shenzhen_20241215.csv'
with open(outfile, 'w') as f:
# Write the header
f.write('Date, Time, Temperature [F], Dew point [F], Humidity [%],'
'Wind Direction, Wind Speed [mph], Wind Guest [mph],'
'Pressure [in], Precipitation [in], Condition\n')
# Loop through rows
for i, dat in enumerate(soup_data):
# Create an empty list for saving values
row = []
# Loop through different fields, i.e.,
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
for field in range(0,len(cell_tag)):
# Loop through cells
for j, cell in enumerate(dat.find_all('td', class_=cell_tag[field])):
# Loop through element value
for k in cell.find_all('span', class_=value_tag[field]):
# Get text
tmp = k.text
# Remove any extra spaces
tmp = tmp.strip(' ')
# Append data
row.append(tmp)
# Print the row
print(i, row)
# Write the date of the recorded data into the file
f.write('2024-12-15,')
# Write data into the file
f.write(','.join(row[:10]))
# New line, to append more rows to the same file later on
f.write('\n')
Let’s say we want to scrape the information contained in the Daily Observations table from Weather Underground pages for weather observations taken at any date. To do this, we need to convert the code above to its own function that can loop through multiple web pages for a range of input dates:
# Define the scraper function that
# gets data within a date range
def scrape_shenzhen_weather(start_date, end_date):
# Tag of each element cell, in the following order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
cell_tag = ['mat-cell cdk-cell cdk-column-dateString mat-column-dateString ng-star-inserted',
'mat-cell cdk-cell cdk-column-temperature mat-column-temperature ng-star-inserted',
'mat-cell cdk-cell cdk-column-dewPoint mat-column-dewPoint ng-star-inserted',
'mat-cell cdk-cell cdk-column-humidity mat-column-humidity ng-star-inserted',
'mat-cell cdk-cell cdk-column-windcardinal mat-column-windcardinal ng-star-inserted',
'mat-cell cdk-cell cdk-column-windSpeed mat-column-windSpeed ng-star-inserted',
'mat-cell cdk-cell cdk-column-windGust mat-column-windGust ng-star-inserted',
'mat-cell cdk-cell cdk-column-pressure mat-column-pressure ng-star-inserted',
'mat-cell cdk-cell cdk-column-precipRate mat-column-precipRate ng-star-inserted',
'mat-cell cdk-cell cdk-column-condition mat-column-condition ng-star-inserted']
# Tag of each element value, in the follwoing order:
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
value_tag = ['ng-star-inserted' , 'wu-value wu-value-to', 'wu-value wu-value-to',
'wu-value wu-value-to', 'ng-star-inserted' , 'wu-value wu-value-to',
'wu-value wu-value-to', 'wu-value wu-value-to', 'wu-value wu-value-to',
'ng-star-inserted']
# Data URL that can be formatted to find the webpage for any observation date
data_url = 'http://www.wunderground.com/history/daily/cn/shenzhen/ZGSZ/date/{}-{}-{}'
# This csv file name that can be formatted for any year
outfile = 'Shenzhen_{}.csv'.format(start_date.year)
# If the csv file does not exist, write, if it does exist, append
if not os.path.exists(outfile):
mode = 'w'
else:
mode = 'a'
with open(outfile, mode) as f:
# Write the header
if mode == 'w':
f.write('Date, Time, Temperature [F], Dew point [F], Humidity [%],'
'Wind Direction, Wind Speed [mph], Wind Guest [mph],'
'Pressure [in], Precipitation [in], Condition\n')
# while loop continues until it reaches the given end date
while start_date != end_date:
# Format the search URL for the given observation date
format_data_url = data_url.format(start_date.year,start_date.month,start_date.day)
# Use my_rendering function to load the page
data_page = my_rendering(format_data_url)
# Use BeautifulSoup to parse the page
data_soup = BeautifulSoup(data_page, 'html.parser')
# Find information under lib-city-history-observation
soup_container = data_soup.find('lib-city-history-observation')
# Get the full table
soup_data = soup_container.find_all('tr',class_='mat-row cdk-row ng-star-inserted')
# Loop through rows
for i, dat in enumerate(soup_data):
# Create an empty list for saving values
row = []
# Loop through different fields, i.e.,
# Date, Time, Temperature [F], Dew point [F], Humidity [%],
# Wind Direction, Wind Speed [mph], Wind Guest [mph],
# Pressure [in], Precipitation [in], Condition
for field in range(0,len(cell_tag)):
# Loop through cells
for j, d in enumerate(dat.find_all('td', class_= cell_tag[field])):
# Loop through element value
for k in d.find_all('span', class_= value_tag[field]):
# Get text
tmp = k.text
# Remove any extra spaces
tmp = tmp.strip(' ')
# Append data
row.append(tmp)
# Write the date of the recorded data into the file
f.write('{}-{}-{},'.format(start_date.year, start_date.month, start_date.day))
# Write data into the file
f.write(','.join(row[:10]))
# New line, to append more rows to the same file later on
f.write('\n')
# Go to next date, i.e., next URL
start_date += timedelta(days=1)
Now that we have our new scrape_shenzhen_weather
function defined, let’s use it to scrape data for between
2024-12-01
and 2024-12-05
. To do this, we
simply need to define the start_date
and
end_date
parameters, and run the function.
# Set start and end dates
start_date = datetime(year=2024, month=12, day=1)
end_date = datetime(year=2024, month=12, day=5)
# Call the scraper function
scrape_shenzhen_weather(start_date, end_date)
You should end up with a .csv
file
(Shenzhen_2024.csv
) containing dated rows. Check it.
The notes are modified from the excellent Getting Started with Web Scraping in Python, available freely online.
Modify the scrape_shenzhen_weather()
function so that it
scrapes the information contained in the Daily Observations table from
Weather Underground pages for
weather observations taken at ANY airport station within ANY date
range.
[Hint: Add another input parameter
station
]