Python
: pandas
“The exponential growth of
pandas
is due to the fact that it just works.” - Ryan Abernathey
pandas
Pandas is an open-source
library providing high-performance, easy-to-use data structures and data
analysis tools. Pandas
is particularly suited to the
analysis of tabular data, i.e. data that can go into a
table. In other words, if you can imagine the data in an Excel
spreadsheet, then pandas
is the tool for the job.
As we covered in Section
03, python is the fastest growing and most widely used programming
language in the world. This growth is driven by the data science
packages such as numpy
, matplotlib
, and
especially pandas
.
Some highlights of pandas
(from pandas
website):
A fast and efficient DataFrame object for data manipulation with integrated indexing;
Tools for reading and writing data between
in-memory data structures and different formats: CSV
and
text files, Microsoft Excel
, SQL
databases,
and the fast HDF5
format;
Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
Columns can be inserted and deleted from data structures for size mutability;
Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
High performance merging and joining of data sets;
Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging. Even create domain-specific time offsets and join time series without losing data;
Highly optimized for performance, with critical
code paths written in Cython
or C
.
Python
with pandas
is in use in a wide
variety of academic and commercial domains, including
Finance, Neuroscience, Economics, Statistics, Advertising, Web
Analytics, and more.
Pandas
basicsIn this section, we will go over the basic
capabilities of pandas
.
Pandas
is a very deep library, please
dig into the pandas
documentation for more advanced usage.
pandas
pandas
should be installed along with
Anaconda
. To import it, run:
Pandas
data structures: seriesA series represents a one-dimensional array
of data. The main difference between a series and numpy
array is that a series has an index. The index contains
the labels that we use to access the data.
There are many ways to create a Series. We will just show a few.
Let’s create a pandas
series to store AQI
(2023-10-25 10:00
) at Beijing, Shanghai, Guangzhou, and
Shenzhen with city
as the index :
# AQI of 4 cities on 10/25/2023
city = ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen']
aqi = [172, 151, 157, 97]
# Create a pandas series
my_data = pd.Series(aqi, index=city)
# Show the series using "city" as the index
my_data
You can show the series or even plot it:
Here we use pandas
built-in
functions plot
to plow the series as vertical bars
(kind='bar'
). Feel free to try other kinds (e.g.,
barh
, hist
, box
,
area
, pie
, etc.)
Arithmetic operations and most numpy
functions can be
applied to a series. An important point is that the series keeps their
index during such operations. For example:
# Apply operations and functions
# The values make no sense, just for demonstration
np.log(my_data) / my_data**2
Here are some useful functions to get access a series and its information:
# Get access the index object
my_data.index
# Get access the values
my_data.values
# Get values using the index via .loc attribute
my_data.loc['Shenzhen']
my_data.loc[['Shenzhen', 'Beijing']]
# Or by row position via .iloc
my_data.iloc[0]
# .iloc is left inclusive but right exclusive
my_data.iloc[0:2]
Pandas
data structures: dataframeSeries are limited to a single “column”. A more useful Pandas data
structure is the DataFrame
. A DataFrame is basically a
bunch of series that share the same index. It’s a lot like a table in a
spreadsheet.
In this part, let’s use the hourly
meteorological data observed in 2020 at Xichong, Shenzhen. Download
the Met_2020.csv
file, and move it to your working
directory.
csv
fileTo read a csv
file, use the read_csv()
function from the pandas
module:
Here pandas
read the Met_2020.csv
file, and
save the object as a data frame named met
. Here are some
functions to check a data frame:
# Check information
met.info()
# Show column names
met.columns
# Show header
met.head()
met.head(10)
# Show tail
met.tail()
met.tail(10)
# Basic statistics
met.describe()
# Get access to a specific column
met['P']
Similar to series, a dataframe can also be handled with
loc
and iloc
:
You may notice that some rows contain -999.0
, which
stands for NaN
data or missing data. We can select those
rows:
Or do more advanced filtering:
To sort a column or more columns:
To add one column:
We can save a dataframe to csv
, text
, or
xlsx
files:
# Only use a few columns
met2 = met[['T', 'P', 'WS']]
# Check head
met2.head()
# Save to a csv file with index
met2.to_csv('Met_2020_new1.csv')
# Save to a csv file without index
met2.to_csv('Met_2020_new2.csv', index=False)
# Save to a xlsx file
met2.to_excel('Met_2020_new3.xlsx', index=False)
# Save to txt file, separated by tab
met2.to_csv('Met_2020_new4.txt', index=False, sep='\t')
Let’s first create some categorical data for wind direction and remove missing data:
# Make a new column
met['WD_new'] = 'N'
# Assign wind directions
met.loc[ (met['WD'] <= 135 ) & (met['WD'] > 45 ), ['WD_new'] ] = 'E'
met.loc[ (met['WD'] <= 225 ) & (met['WD'] > 135 ), ['WD_new'] ] = 'S'
met.loc[ (met['WD'] <= 315 ) & (met['WD'] > 225 ), ['WD_new'] ] = 'W'
# Remove missing data
met = met.loc[ (met['WS']!=-999) & (met['WD']!=-999) &
(met['T']!=-999) & (met['RH']!=-999) &
(met['P']!=-999) & (met['Precip']!=-999) ]
met.head()
Now let’s group data by WD_new
:
# Group by WD_new
met.groupby(['WD_new']).mean()
# Group by followed by sorting
met.groupby(['WD_new']).mean().sort_values('T', ascending=True)
We can also create a new column Hour
to save observation
hours:
# Get the observation hour as a new column
met['Hour'] = pd.to_datetime(met['Time']).dt.hour
met.head()
# Group by the observation hour, get hourly means, then plot
met.groupby(['Hour']).mean()['T'].plot()
# Group by the observation hour then by wind direction
met.groupby(['Hour', 'WD_new']).mean()
Go over the notes, make sure you understand the scripts.
read_csv()
can also read txt
file. Download
this
file, take a look at it. Then read it with the opition
delimiter='\t'
.
Use read_excel()
to read this
file.
Download daily
ozone data file and read it with pandas
.
Get the lowest temperature.
Count the total number of days with ozone higher than
40
.
Get the date that has the highest ozone.
Plot a histogram of ozone.
Get the mean pressure in May.
Get the mean pressure when the temperature is lower than
40
and wind speed is higher than 4.0
.
Group the data by month, plot a bar plot to show mean ozone in different months.