SciPy
“Since its initial release in 2001,
SciPy
has become a de facto standard for leveraging scientific algorithms inPython
.” - Virtanen et al. (2020)
SciPy
SciPy
is a library of numerical routines for the
Python
programming language that provides fundamental
building blocks for modeling and solving scientific problems.
SciPy
includes algorithms for optimization, integration,
interpolation, eigenvalue problems, algebraic equations, differential
equations and many other classes of problems; it also provides
specialized data structures, such as sparse matrices and
k-dimensional trees.
The SciPy
library depends on NumPy
, which
provides convenient and fast N-dimensional array manipulation. Check SciPy 1.0:
fundamental algorithms for scientific computing in Python for an
overview of the capabilities and development practices of
SciPy
.
When you want to do scientific work in Python
, the first
library you can turn to is SciPy
. As you’ll see in this
Section, SciPy
is not just a library, but a whole
ecosystem of libraries that work together to help you
accomplish complicated scientific tasks quickly and reliably.
SciPy
should be installed along with
Anaconda
. If not, in
Anaconda Powershell Prompt
, install it:
To make sure SciPy
is installed:
# Import modules
import numpy as np
import matplotlib.pyplot as plt
# Show plots in the notebook
%matplotlib inline
# Import SciPy
import scipy
# Print the location of the file from where `SciPy` is loaded
print(scipy.__file__)
Now you’ve imported SciPy
and printed the location of
the file from where SciPy
is loaded. Your computer will
probably show a different location. SciPy
should be ready
for use.
SciPy
As non-professional programmers, scientists often tend to
re-invent the wheel, which leads to buggy, non-optimal,
difficult-to-share and unmaintainable code. By contrast,
SciPy
’s routines are optimized and tested, and should
therefore be used when possible.
SciPy
is composed of task-specific sub-modules,
including:
Sub-module | Usage |
---|---|
scipy.cluster |
Vector quantization / Kmeans |
scipy.constants |
Physical and mathematical constants |
scipy.fftpack |
Fourier transform |
scipy.integrate |
Integration routines |
scipy.interpolate |
Interpolation |
scipy.io |
Data input and output |
scipy.linalg |
Linear algebra routines |
scipy.ndimage |
n-dimensional image package |
scipy.odr |
Orthogonal distance regression |
scipy.optimize |
Optimization |
scipy.signal |
Signal processing |
scipy.sparse |
Sparse matrices |
scipy.spatial |
Spatial data structures and algorithms |
scipy.special |
Special mathematical functions |
scipy.stats |
Statistics |
This section is far from an introduction to numerical computing. As
enumerating the different sub-modules and functions in
SciPy
would be very boring, we concentrate instead on a few
problems you may encounter to give a general idea of how to use
SciPy
for scientific computing.
scipy.constants
SciPy
constants package provides a wide range of
constants, which are used in the general scientific area. We have to
import the required constant and use them as per the requirement.
Import mathematical constant, for example:
Import physical constant, for example:
In addition to the above variables, scipy.constants
also
contains the 2018
CODATA recommended values (CODATA2018) database containing more
physical constants. Call
scipy.constants.physical_constants[NAME]
to return a
dictionary of physical constants, of the format
(value, unit, uncertainty)
. For example:
scipy.constants
also contains SI units, for example:
# SI prefixes
scipy.constants.micro
# Mass, in [kg]
scipy.constants.long_ton
# Speed, in [m/s]
scipy.constants.mph
# Pressure, in [Pa]
scipy.constants.atmosphere
Remembering all of these are a bit tough. The easy way to get which
key is for which function is looking up things in the SciPy
Constants page.
scipy.integrate
When a function cannot be integrated analytically, or is very
difficult to integrate analytically, one generally turns to numerical
integration methods. SciPy
has a number of routines for
performing numerical integration. Most of them are found in the same
scipy.integrate
library. For example:
from numpy import exp
from scipy import integrate
# Define a function
f = lambda x:exp(-x**2)
# Single integral of f from 0 to 1
res = integrate.quad(f, 0, 1)
print(res)
Here we use a lambda
expression and then call the quadrature method (for single
integrals) on that function. The quad
function returns the
two values, in which the first number is the value of integral and the
second value is the estimate of the absolute error in the value of
integral. For multiple integrals or solving initial value problems for
ODE systems, check Integration
and ODEs.
scipy.interpolate
Interpolation is the process of finding a value between two points on a line or a curve. This tool, interpolation, is not only useful in statistics, but is also useful in science, business, or when there is a need to predict values that fall within two existing data points.
# Make up some data points
x = np.linspace(0, 4, 12)
y = np.cos(x**2/3+4)
# Plot data
plt.plot(x, y,'ko')
Suppose we want to find the value of y
when
x=3.12
, you can use interp1d()
as:
from scipy.interpolate import interp1d
# Using interp1d function
f1 = interp1d(x, y, kind = 'linear')
f2 = interp1d(x, y, kind = 'cubic')
# Now make x finer
xnew = np.linspace(0, 4, 30)
# Plot interpolated values
plt.plot(x, y, 'ko', xnew, f1(xnew), '-', xnew, f2(xnew), '--')
# Add legend
plt.legend(['data', 'linear', 'cubic'], loc = 'best')
# Show data at a specific location
print(f1(3.12), f2(3.12))
The interp1d()
function in the
scipy.interpolate
is a convenient method to create a
function based on fixed data points, which can be evaluated anywhere
within the domain defined by the given data using linear interpolation.
Using the interp1d()
function, we created two functions
f1
and f2
. These functions, for a given input
x
returns y
. The third variable kind
represents the type of the interpolation technique. linear
,
nearest
, zero
, slinear
,
quadratic
, cubic
are a few techniques of
interpolation.
One-dimensional smoothing spline fits a given set of data points. The
univariatespline
function in scipy.interpolate
is a convenient method to
create a function, based on fixed data points class.
# Make up some data points
x = np.linspace(-3, 3, 50)
y = np.exp(-x**2) + 0.1 * np.random.randn(50)
# Plot data
plt.plot(x, y, 'ko')
Suppose we want to find the value of y
when
x=1.23
, you can use UnivariateSpline()
as:
from scipy.interpolate import UnivariateSpline
# Using UnivariateSpline function
spl = UnivariateSpline(x, y, k=3)
spl.set_smoothing_factor(0.5)
# Now make x finer
xnew = np.linspace(-3, 3, 1000)
# Plot interpolated values
plt.plot(x, y, 'ko')
plt.plot(xnew, spl(xnew), 'b', lw = 3)
# Show data at a specific location
spl(1.23)
scipy.optimize
Suppose we have data on a sine wave, with some noise:
# Make up some data points
x = np.linspace(-5, 5, num=50)
y = 2.9 * np.sin(1.5 * x) + np.random.normal(size=50)
# Plot data
plt.plot(x, y, 'ko')
If we know that the data lies on a sine wave, but not the amplitudes or the period, we can find those by least squares curve fitting. First we have to define the test function to fit, here a sine with unknown amplitude and period:
We then use scipy.optimize.curve_fit
to find
a
and b
:
Let’s define the following function and plot it:
# Define a function
def g(x):
return x**2 + 20*np.sin(x)
# Set the range
x = np.arange(-10, 10, 0.1)
# Plot the function
plt.plot(x, g(x))
This function has a global minimum around -1.5
and local
minimums around -7.1
and 4.3
. Searching for
minimum can be done with scipy.optimize.minimize
, given a
starting point x0
, it returns the location of the minimum
that it has found.
scipy.stats
The module scipy.stats
contains statistical tools and
probabilistic descriptions of random processes.
For example, we can fit a sample to a distribution:
# Generate a sample
sample = np.random.normal(size=1000)
from scipy import stats
# Fit a normal distribution
stats.norm.fit(sample)
A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from normal distributions, we can use a t-test to decide whether the means of two sets of observations are significantly different:
# Sample 1
sample1 = np.random.normal(0, 1, size=20)
# Sample 2
sample2 = np.random.normal(1, 1, size=15)
# Independent t-test
stats.ttest_ind(sample1, sample2)
The chapter on
statistics introduces much more elaborate tools for statistical
testing and statistical data loading and visualization outside of
SciPy
. You are also welcome to check ESE335
Environmental Data Analysis for conducting statistical tests with
the R
language.
scipy.cluster
Clustering is a popular technique to categorize data by associating
it into groups. The SciPy
library includes an
implementation of the k-means clustering algorithm as well as
several hierarchical clustering algorithms. Here we will be
using the k-means algorithm in scipy.cluster.vq
, where
vq
stands for vector quantization.
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# Data generation, 200 rows, 2 cols
data = vstack((rand(100,2) + array([.8,.8]), rand(100,2)))
# Show data
plt.plot(data[:,0],data[:,1],'k+')
# Computing K-Means with K = 2 (2 clusters)
#centroids, mean_dist = kmeans(data,2)
# Assign each sample to a cluster
#clusters, dist = vq(data,centroids)
# Plot clusters
#plt.plot(data[clusters==0][:,0], data[clusters==0][:,1],'bo')
#plt.plot(data[clusters==1][:,0], data[clusters==1][:,1],'rd')
# Plot cluster centroids
#plt.plot(centroids[:,0],centroids[:,1],'gX',markersize=10)
# Plot raw data
#plt.plot(data[:,0],data[:,1],'w+')
Go through the notes, make sure you understand the scripts.
The temperature extremes in Shenzhen for each month, starting in January, are given by (in degrees Celcius):
max: 19.8, 20.2, 22.7, 26.3, 29.5, 31.1, 32.3, 32.3, 31.3, 29.2, 25.4, 21.5
min: 12.5, 13.8, 16.5, 20.3, 23.6, 25.6, 26.3, 26.1, 25.0, 22.5, 18.2, 13.8
2.1 Plot these temperature extremes.
2.2 Define a function that can describe min and max
temperatures.
[Hint]: this function has to have a period of 1 year.
2.3 Fit this function to the data with
scipy.optimize.curve_fit
.
2.4 Plot the result. Is the fit reasonable? If not,
why?
2.5 Is the time offset for min and max temperatures the
same within the fit accuracy?