Python: pandas“The exponential growth of
pandasis due to the fact that it just works.” - Ryan Abernathey
pandasPandas is an open-source
library providing high-performance, easy-to-use data structures and data
analysis tools. Pandas is particularly suited to the
analysis of tabular data, i.e. data that can go into a
table. In other words, if you can imagine the data in an Excel
spreadsheet, then pandas is the tool for the job.
As we covered in Section
03, python is the fastest growing and most widely used programming
language in the world. This growth is driven by the data science
packages such as numpy, matplotlib, and
especially pandas.
Some highlights of pandas (from pandas
website):
A fast and efficient DataFrame object for data manipulation with integrated indexing;
Tools for reading and writing data between
in-memory data structures and different formats: CSV and
text files, Microsoft Excel, SQL databases,
and the fast HDF5 format;
Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
Columns can be inserted and deleted from data structures for size mutability;
Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
High performance merging and joining of data sets;
Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging. Even create domain-specific time offsets and join time series without losing data;
Highly optimized for performance, with critical
code paths written in Cython or C.
Python with pandas is in use in a wide
variety of academic and commercial domains, including
Finance, Neuroscience, Economics, Statistics, Advertising, Web
Analytics, and more.
Pandas basicsIn this section, we will go over the basic
capabilities of pandas.
Pandas is a very deep library, please
dig into the pandas
documentation for more advanced usage.
pandaspandas should be installed along with
Anaconda. To import it, run:
Pandas data structures: seriesA series represents a one-dimensional array
of data. The main difference between a series and numpy
array is that a series has an index. The index contains
the labels that we use to access the data.
There are many ways to create a Series. We will just show a few.
Let’s create a pandas series to store AQI
(2023-10-25 10:00) at Beijing, Shanghai, Guangzhou, and
Shenzhen with city as the index :
# AQI of 4 cities on 10/25/2023
city = ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen']
aqi = [172, 151, 157, 97]
# Create a pandas series
my_data = pd.Series(aqi, index=city)
# Show the series using "city" as the index
my_dataYou can show the series or even plot it:
Here we use pandas built-in
functions plot to plow the series as vertical bars
(kind='bar'). Feel free to try other kinds (e.g.,
barh, hist, box,
area, pie, etc.)
Arithmetic operations and most numpy functions can be
applied to a series. An important point is that the series keeps their
index during such operations. For example:
# Apply operations and functions
# The values make no sense, just for demonstration
np.log(my_data) / my_data**2Here are some useful functions to get access a series and its information:
# Get access the index object
my_data.index
# Get access the values
my_data.values
# Get values using the index via .loc attribute
my_data.loc['Shenzhen']
my_data.loc[['Shenzhen', 'Beijing']]
# Or by row position via .iloc
my_data.iloc[0]
# .iloc is left inclusive but right exclusive
my_data.iloc[0:2]Pandas data structures: dataframeSeries are limited to a single “column”. A more useful Pandas data
structure is the DataFrame. A DataFrame is basically a
bunch of series that share the same index. It’s a lot like a table in a
spreadsheet.
In this part, let’s use the hourly
meteorological data observed in 2020 at Xichong, Shenzhen. Download
the Met_2020.csv file, and move it to your working
directory.
csv fileTo read a csv file, use the read_csv()
function from the pandas module:
Here pandas read the Met_2020.csv file, and
save the object as a data frame named met. Here are some
functions to check a data frame:
# Check information
met.info()
# Show column names
met.columns
# Show header
met.head()
met.head(10)
# Show tail
met.tail()
met.tail(10)
# Basic statistics
met.describe()
# Get access to a specific column
met['P']Similar to series, a dataframe can also be handled with
loc and iloc:
You may notice that some rows contain -999.0, which
stands for NaN data or missing data. We can select those
rows:
Or do more advanced filtering:
To sort a column or more columns:
To add one column:
We can save a dataframe to csv, text, or
xlsx files:
# Only use a few columns
met2 = met[['T', 'P', 'WS']]
# Check head
met2.head()
# Save to a csv file with index
met2.to_csv('Met_2020_new1.csv')
# Save to a csv file without index
met2.to_csv('Met_2020_new2.csv', index=False)
# Save to a xlsx file
met2.to_excel('Met_2020_new3.xlsx', index=False)
# Save to txt file, separated by tab
met2.to_csv('Met_2020_new4.txt', index=False, sep='\t')Let’s first create some categorical data for wind direction and remove missing data:
# Make a new column
met['WD_new'] = 'N'
# Assign wind directions
met.loc[ (met['WD'] <= 135 ) & (met['WD'] > 45 ), ['WD_new'] ] = 'E'
met.loc[ (met['WD'] <= 225 ) & (met['WD'] > 135 ), ['WD_new'] ] = 'S'
met.loc[ (met['WD'] <= 315 ) & (met['WD'] > 225 ), ['WD_new'] ] = 'W'
# Remove missing data
met = met.loc[ (met['WS']!=-999) & (met['WD']!=-999) &
(met['T']!=-999) & (met['RH']!=-999) &
(met['P']!=-999) & (met['Precip']!=-999) ]
met.head()Now let’s group data by WD_new:
# Group by WD_new
met.groupby(['WD_new']).mean()
# Group by followed by sorting
met.groupby(['WD_new']).mean().sort_values('T', ascending=True)We can also create a new column Hour to save observation
hours:
# Get the observation hour as a new column
met['Hour'] = pd.to_datetime(met['Time']).dt.hour
met.head()
# Group by the observation hour, get hourly means, then plot
met.groupby(['Hour']).mean()['T'].plot()
# Group by the observation hour then by wind direction
met.groupby(['Hour', 'WD_new']).mean()Go over the notes, make sure you understand the scripts.
read_csv() can also read txt file. Download
this
file, take a look at it. Then read it with the opition
delimiter='\t'.
Use read_excel() to read this
file.
Download daily
ozone data file and read it with pandas.
Get the lowest temperature.
Count the total number of days with ozone higher than
40.
Get the date that has the highest ozone.
Plot a histogram of ozone.
Get the mean pressure in May.
Get the mean pressure when the temperature is lower than
40 and wind speed is higher than 4.0.
Group the data by month, plot a bar plot to show mean ozone in different months.