0 1
1 2
2 3
3 4
dtype: int64
0 C
1 B
2 A
dtype: object
0 True
dtype: bool
Lecture 07
pandas is an implementation of data frames in Python - it takes much of its inspiration from R and NumPy.
pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.
Key features:
DataFrame and Series (column) object classes
Reading and writing tabular data
Data munging (filtering, grouping, summarizing, joining, etc.)
Data reshaping
The columns of a DataFrame are constructed using the Series
class - these are a 1d array like object containing values of the same type (similar to an numpy array).
Once constructed the components of a series can be accessed via array
and index
attributes.
An index (row names) can also be explicitly provided when constructing a Series,
Series objects are compatible with NumPy like functions (i.e. vectorized)
Series can be indexed in the same was as NumPy arrays with the addition of being able to use index label(s) when selecting elements.
When performing operations with multiple series, generally pandas will attempt to align the operation by the index values,
Series can also be constructed from dictionaries, in which case the keys are used as the index,
Pandas encodes missing values using NaN (mostly),
np.isna()
?none
In some cases none
can also be used as a missing value, for example:
This can have a side effect of changing the dtype of the series.
If instead of using base dtypes we use Pandas’ built-in dtypes we get “native” support for missing values,
Series containing strings can their strings accessed via the str
attribute,
Just like R a DataFrame is a collection of vectors (Series) with a common length (and a common index)
Column dtypes can be heterogeneous
Columns have names stored in the columns
index.
It can be useful to think of a dictionary of Series objects where the keys are the column names.
<class 'pandas.core.frame.DataFrame'>
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
We just saw how to read a DataFrame via read_csv()
, DataFrames
can also be constructed via DataFrame()
, in general this is done using a dictionary of columns / Series
:
2d ndarrays can also be used to construct a DataFrame
- generally it is a good idea to provide column and row names (indexes)
20
(5, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5 non-null int64
1 weight 5 non-null float64
2 height 5 non-null float64
3 date 5 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 292.0 bytes
id int64
weight float64
height float64
date datetime64[ns]
dtype: object
Index(['id', 'weight', 'height', 'date'], dtype='object')
RangeIndex(start=0, stop=5, step=1)
[RangeIndex(start=0, stop=5, step=1), Index(['id', 'weight', 'height', 'date'], dtype='object')]
Columns can be selected by name or via .
accessor,
a single slice is assumed to refer to the rows
id 541
weight 33.469345
height 195.730662
date 2022-02-02 00:00:00
Name: 1, dtype: object
id weight height date
1 541 33.469345 195.730662 2022-02-02
id weight height date
0 482 64.162174 169.468134 2022-02-01
1 541 33.469345 195.730662 2022-02-02
weight height
1 33.469345 195.730662
2 93.782322 147.946539
id date
0 482 2022-02-01
1 541 2022-02-02
2 213 2022-02-03
id weight
0 482 64.162174
1 541 33.469345
2 213 93.782322
id weight height date
1 541 33.469345 195.730662 2022-02-02
3 523 48.479028 164.486509 2022-02-04
In general most pandas operations will generate a new object but some will return views, mostly the later occurs with subsetting.
When constructing a DataFrame we can specify the indexes for both the rows (index
) and columns (columns
),
A B C
0 -0.875270 1.313213 -0.528093
1 1.136586 -0.645874 -0.945650
2 0.616353 0.541941 -0.273260
3 0.322153 -0.424912 -0.195107
4 1.491297 -0.304639 1.245868
Index(['A', 'B', 'C'], dtype='object')
RangeIndex(start=0, stop=5, step=1)
pandas’ Index
class and its subclasses provide the infrastructure necessary for lookups, data alignment, and other related tasks. You can think of them as being an immutable multiset (i.e. duplicate values are allowed).
Index objects can have names which are shown when printing the DataFrame or Index,
df = pd.DataFrame(
np.random.randn(3, 3),
index=pd.Index(['x','y','z'], name="rows"),
columns=pd.Index(['A', 'B', 'C'], name="cols")
)
df
cols A B C
rows
x 2.627667 -1.008846 -2.032781
y -1.360111 0.195396 -0.230822
z 0.154858 -0.921222 -0.204913
Index(['A', 'B', 'C'], dtype='object', name='cols')
Index(['x', 'y', 'z'], dtype='object', name='rows')
It is possible for an index to contain missing values (e.g. np.nan
) but this is generally a bad idea and should be avoided.
Existing columns can be made into an index via set_index()
and removed via reset_index()
,
New index values can be attached to a DataFrame via reindex()
,
These are a hierarchical analog of standard Index objects and are used to represent nested indexes. There are a number of methods for constructing them based on the initial object
MultiIndexes can also be used for columns as well,
ridx = pd.MultiIndex.from_product(
[["m","n"],["l","p"]], names=["r1","r2"]
)
pd.DataFrame(
np.random.rand(4,4),
index= ridx, columns = cidx
)
c1 A B
c2 x y x y
r1 r2
m l 0.569111 0.639990 0.693537 0.170564
p 0.367974 0.961939 0.573365 0.527121
n l 0.637089 0.860972 0.008284 0.141591
p 0.665466 0.060594 0.121356 0.941145
c1 A B
c2 x y x y
r1 r2
m l 0.361458 0.506350 0.429574 0.342180
p 0.406104 0.755411 0.416626 0.938283
n l 0.384019 0.734839 0.455678 0.423700
p 0.224225 0.684089 0.813723 0.471471
iloc
c1 c2
A x 0.361458
y 0.506350
B x 0.429574
y 0.342180
Name: (m, l), dtype: float64
<class 'pandas.core.series.Series'>
np.float64(0.5063499771744547)
c1 A B
c2 x y x y
r1 r2
m l 0.361458 0.506350 0.429574 0.342180
p 0.406104 0.755411 0.416626 0.938283
loc
loc
Index slices can also be used with combinations of indexes and index tuples,
c1 A B
c2 x y x y
r1 r2
m l 0.361458 0.506350 0.429574 0.342180
p 0.406104 0.755411 0.416626 0.938283
n l 0.384019 0.734839 0.455678 0.423700
p 0.224225 0.684089 0.813723 0.471471
c1 A B
c2 x y x y
r1 r2
m l 0.361458 0.506350 0.429574 0.342180
p 0.406104 0.755411 0.416626 0.938283
n l 0.384019 0.734839 0.455678 0.423700
c1 A B
c2 x y x y
r1 r2
m p 0.406104 0.755411 0.416626 0.938283
n l 0.384019 0.734839 0.455678 0.423700
p 0.224225 0.684089 0.813723 0.471471
c1 A B
c2 x y x y
r1 r2
m p 0.406104 0.755411 0.416626 0.938283
n l 0.384019 0.734839 0.455678 0.423700
The previous methods don’t give easy access to indexing on nested index levels, this is possible via the cross-section method xs()
,
It is also possible to construct a MultiIndex or modify an existing one using set_index()
and reset_index()
,
The query()
method can be used for filtering rows, it evaluates a string expression in the context of the data frame.
Empty DataFrame
Columns: [id, weight, height, date]
Index: []
id weight height date
anna 202 79.477217 162.607949 2025-02-01
bob 535 97.369002 175.888696 2025-02-02
carol 960 51.663463 156.062230 2025-02-03
dave 370 67.517056 171.197477 2025-02-04
Beyond the use of loc()
and iloc()
there is also the filter()
method which can be used to select columns (or indices) by name with pattern matching
id weight
anna 202 79.477217
bob 535 97.369002
carol 960 51.663463
dave 370 67.517056
erin 206 29.780742
id weight height
anna 202 79.477217 162.607949
bob 535 97.369002 175.888696
carol 960 51.663463 156.062230
dave 370 67.517056 171.197477
erin 206 29.780742 167.607252
weight height
anna 79.477217 162.607949
bob 97.369002 175.888696
carol 51.663463 156.062230
dave 67.517056 171.197477
erin 29.780742 167.607252
id weight height date
anna 202 79.477217 162.607949 2025-02-01
carol 960 51.663463 156.062230 2025-02-03
dave 370 67.517056 171.197477 2025-02-04
Indexing with assignment allows for inplace modification of a DataFrame, while assign()
creates a new object (but is chainable)
id weight height date student age
anna 202 79.477217 162.607949 2025-02-01 True 19.0
bob 535 97.369002 175.888696 2025-02-02 True 22.0
carol 960 51.663463 156.062230 2025-02-03 True 25.0
dave 370 67.517056 171.197477 2025-02-04 False NaN
erin 206 29.780742 167.607252 2025-02-05 None NaN
id weight height date student age rand
anna 202 79.477217 162.607949 2025-02-01 yes 19.0 0.938553
bob 535 97.369002 175.888696 2025-02-02 yes 22.0 0.000779
carol 960 51.663463 156.062230 2025-02-03 yes 25.0 0.992212
dave 370 67.517056 171.197477 2025-02-04 no NaN 0.617482
erin 206 29.780742 167.607252 2025-02-05 no NaN 0.611653
Columns or rows can be removed via the drop()
method,
KeyError: "['student'] not found in axis"
id weight height date age
anna 202 79.477217 162.607949 2025-02-01 19.0
bob 535 97.369002 175.888696 2025-02-02 22.0
carol 960 51.663463 156.062230 2025-02-03 25.0
dave 370 67.517056 171.197477 2025-02-04 NaN
erin 206 29.780742 167.607252 2025-02-05 NaN
id weight height date student age
bob 535 97.369002 175.888696 2025-02-02 True 22.0
carol 960 51.663463 156.062230 2025-02-03 True 25.0
erin 206 29.780742 167.607252 2025-02-05 None NaN
KeyError: '[False, False, False, False, False, True] not found in axis'
id weight height date student
anna 202 79.477217 162.607949 2025-02-01 True
bob 535 97.369002 175.888696 2025-02-02 True
carol 960 51.663463 156.062230 2025-02-03 True
dave 370 67.517056 171.197477 2025-02-04 False
erin 206 29.780742 167.607252 2025-02-05 None
id date student age
anna 202 2025-02-01 True 19.0
bob 535 2025-02-02 True 22.0
carol 960 2025-02-03 True 25.0
dave 370 2025-02-04 False NaN
erin 206 2025-02-05 None NaN
DataFrames can be sorted on one or more columns via sort_values()
,
id weight height date student age
anna 202 79.477217 162.607949 2025-02-01 True 19.0
bob 535 97.369002 175.888696 2025-02-02 True 22.0
carol 960 51.663463 156.062230 2025-02-03 True 25.0
dave 370 67.517056 171.197477 2025-02-04 False NaN
erin 206 29.780742 167.607252 2025-02-05 None NaN
id weight height date student age
dave 370 67.517056 171.197477 2025-02-04 False NaN
carol 960 51.663463 156.062230 2025-02-03 True 25.0
bob 535 97.369002 175.888696 2025-02-02 True 22.0
anna 202 79.477217 162.607949 2025-02-01 True 19.0
erin 206 29.780742 167.607252 2025-02-05 None NaN
All three can be used to combine data frames,
concat()
stacks DataFrames on either axis, with basic alignment based on (row) indexes. join
argument only supports “inner” and “outer”.
merge()
aligns based on one or more shared columns. how
supports “inner”, “outer”, “left”, “right”, and “cross”.
join()
uses merge()
behind the scenes, but prefers to join based on (row) indexes. Also has different default how
compared to merge()
, “left” vs “inner”.
Sta 663 - Spring 2025