id weight height date
anna 202 79.477217 162.607949 2025-02-01
bob 535 97.369002 175.888696 2025-02-02
carol 960 51.663463 156.062230 2025-02-03
dave 370 67.517056 171.197477 2025-02-04
df.query('weight > 50 & height < 165')
id weight height date
anna 202 79.477217 162.607949 2025-02-01
carol 960 51.663463 156.062230 2025-02-03
qid =202df.query('id == @qid')
id weight height date
anna 202 79.477217 162.607949 2025-02-01
Selecting Columns
Beyond the use of loc() and iloc() there is also the filter() method which can be used to select columns (or indices) by name with pattern matching
df.filter(items=["id","weight"])
id weight
anna 202 79.477217
bob 535 97.369002
carol 960 51.663463
dave 370 67.517056
erin 206 29.780742
df.filter(like ="i")
id weight height
anna 202 79.477217 162.607949
bob 535 97.369002 175.888696
carol 960 51.663463 156.062230
dave 370 67.517056 171.197477
erin 206 29.780742 167.607252
df.filter(regex="ght$")
weight height
anna 79.477217 162.607949
bob 97.369002 175.888696
carol 51.663463 156.062230
dave 67.517056 171.197477
erin 29.780742 167.607252
df.filter(like="a", axis=0)
id weight height date
anna 202 79.477217 162.607949 2025-02-01
carol 960 51.663463 156.062230 2025-02-03
dave 370 67.517056 171.197477 2025-02-04
Adding columns
Indexing with assignment allows for inplace modification of a DataFrame, while assign() creates a new object (but is chainable)
id weight height date student age
anna 202 79.477217 162.607949 2025-02-01 True 19.0
bob 535 97.369002 175.888696 2025-02-02 True 22.0
carol 960 51.663463 156.062230 2025-02-03 True 25.0
dave 370 67.517056 171.197477 2025-02-04 False NaN
erin 206 29.780742 167.607252 2025-02-05 None NaN
id weight height date student age rand
anna 202 79.477217 162.607949 2025-02-01 yes 19.0 0.938553
bob 535 97.369002 175.888696 2025-02-02 yes 22.0 0.000779
carol 960 51.663463 156.062230 2025-02-03 yes 25.0 0.992212
dave 370 67.517056 171.197477 2025-02-04 no NaN 0.617482
erin 206 29.780742 167.607252 2025-02-05 no NaN 0.611653
Removing columns (and rows)
Columns or rows can be removed via the drop() method,
df.drop(['student'])
KeyError: "['student'] not found in axis"
df.drop(['student'], axis=1)
id weight height date age
anna 202 79.477217 162.607949 2025-02-01 19.0
bob 535 97.369002 175.888696 2025-02-02 22.0
carol 960 51.663463 156.062230 2025-02-03 25.0
dave 370 67.517056 171.197477 2025-02-04 NaN
erin 206 29.780742 167.607252 2025-02-05 NaN
df.drop(['anna','dave'])
id weight height date student age
bob 535 97.369002 175.888696 2025-02-02 True 22.0
carol 960 51.663463 156.062230 2025-02-03 True 25.0
erin 206 29.780742 167.607252 2025-02-05 None NaN
df.drop(columns = df.columns =="age")
KeyError: '[False, False, False, False, False, True] not found in axis'
df.drop(columns = df.columns[df.columns =="age"])
id weight height date student
anna 202 79.477217 162.607949 2025-02-01 True
bob 535 97.369002 175.888696 2025-02-02 True
carol 960 51.663463 156.062230 2025-02-03 True
dave 370 67.517056 171.197477 2025-02-04 False
erin 206 29.780742 167.607252 2025-02-05 None
id date student age
anna 202 2025-02-01 True 19.0
bob 535 2025-02-02 True 22.0
carol 960 2025-02-03 True 25.0
dave 370 2025-02-04 False NaN
erin 206 2025-02-05 None NaN
Sorting
DataFrames can be sorted on one or more columns via sort_values(),
df
id weight height date student age
anna 202 79.477217 162.607949 2025-02-01 True 19.0
bob 535 97.369002 175.888696 2025-02-02 True 22.0
carol 960 51.663463 156.062230 2025-02-03 True 25.0
dave 370 67.517056 171.197477 2025-02-04 False NaN
erin 206 29.780742 167.607252 2025-02-05 None NaN
id weight height date student age
dave 370 67.517056 171.197477 2025-02-04 False NaN
carol 960 51.663463 156.062230 2025-02-03 True 25.0
bob 535 97.369002 175.888696 2025-02-02 True 22.0
anna 202 79.477217 162.607949 2025-02-01 True 19.0
erin 206 29.780742 167.607252 2025-02-05 None NaN
join vs merge vs concat
All three can be used to combine data frames,
concat() stacks DataFrames on either axis, with basic alignment based on (row) indexes. join argument only supports “inner” and “outer”.
merge() aligns based on one or more shared columns. how supports “inner”, “outer”, “left”, “right”, and “cross”.
join() uses merge() behind the scenes, but prefers to join based on (row) indexes. Also has different default how compared to merge(), “left” vs “inner”.
Pivoting - long to wide
df
country year type count
0 A 1999 cases 0.7K
1 A 1999 pop 19M
2 A 2000 cases 2K
3 A 2000 pop 20M
4 B 1999 cases 37K
5 B 1999 pop 172M
6 B 2000 cases 80K
7 B 2000 pop 174M
8 C 1999 cases 212K
9 C 1999 pop 1T
10 C 2000 cases 213K
11 C 2000 pop 1T
calories sugars rating
type
Cold 107.162162 7.175676 42.095218
Hot 100.000000 1.333333 56.737708
cereal.groupby("mfr").size()
mfr
General Mills 22
Kellogg's 23
Maltex 1
Nabisco 6
Post 9
Quaker Oats 8
Ralston Purina 8
dtype: int64
Selecting groups
Groups can be accessed via get_group()
cereal.groupby("type").get_group("Hot")
name mfr type calories sugars rating
20 Cream of Wheat (Quick) Nabisco Hot 100 0 64.533816
43 Maypo Maltex Hot 100 3 54.850917
57 Quaker Oatmeal Quaker Oats Hot 100 1 50.828392
cereal.groupby("mfr").get_group("Post")
name mfr ... sugars rating
9 Bran Flakes Post ... 5 53.313813
27 Fruit & Fibre Dates; Walnuts; and Oats Post ... 10 40.917047
29 Fruity Pebbles Post ... 12 28.025765
30 Golden Crisp Post ... 15 35.252444
32 Grape Nuts Flakes Post ... 5 52.076897
33 Grape-Nuts Post ... 3 53.371007
34 Great Grains Pecan Post ... 4 45.811716
37 Honey-comb Post ... 11 28.742414
52 Post Nat. Raisin Bran Post ... 14 37.840594
[9 rows x 6 columns]
Iterating groups
DataFrameGroupBy’s can also be iterated over,
for name, group in cereal.groupby("type"):print(f"# {name}\n{group}\n\n")
# Cold
name mfr ... sugars rating
0 100% Bran Nabisco ... 6 68.402973
1 100% Natural Bran Quaker Oats ... 8 33.983679
2 All-Bran Kellogg's ... 5 59.425505
3 All-Bran with Extra Fiber Kellogg's ... 0 93.704912
4 Almond Delight Ralston Purina ... 8 34.384843
.. ... ... ... ... ...
72 Triples General Mills ... 3 39.106174
73 Trix General Mills ... 12 27.753301
74 Wheat Chex Ralston Purina ... 3 49.787445
75 Wheaties General Mills ... 3 51.592193
76 Wheaties Honey Gold General Mills ... 8 36.187559
[74 rows x 6 columns]
# Hot
name mfr type calories sugars rating
20 Cream of Wheat (Quick) Nabisco Hot 100 0 64.533816
43 Maypo Maltex Hot 100 3 54.850917
57 Quaker Oatmeal Quaker Oats Hot 100 1 50.828392
Aggregation
The aggregate() function or agg() method can be used to compute summary statistics for each group,
cereal.groupby("mfr").agg("mean")
TypeError: agg function failed [how->mean,dtype->object]
In Polars, missing data is represented by the value null. This missing value null is used for all data types, including numerical types.
pl.Series("ints", [1, 2, 3, None])
shape: (4,)
ints
i64
1
2
3
null
pl.Series("dbls", [1., 2., 3., None])
shape: (4,)
dbls
f64
1.0
2.0
3.0
null
pl.Series("bools", [True, False, True, None])
shape: (4,)
bools
bool
true
false
true
null
pl.Series("strs", ["A", "B", "C", None])
shape: (4,)
strs
str
"A"
"B"
"C"
null
pl.Series("ints", [1, 2, 3, np.nan])
TypeError: unexpected value while building Series of type Int64; found value of type Float64: NaN
Hint: Try setting `strict=False` to allow passing data with mixed types.
pl.Series("dbls", [1., 2., 3., np.nan])
shape: (4,)
dbls
f64
1.0
2.0
3.0
NaN
Missing value checking
Checking for missing values can be done via the is_null() method
This represents a potential computation that can be executed later. Much of the power of Polars comes from the ability to chain together / compose these expressions.
Contexts
Contexts are the environments in which expressions are evaluated - examples of common contexts include: select, with_columns, filter, and group_by.
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)
SORT BY [col("pickup_zone")] LEFT JOIN: LEFT PLAN ON: [col("pickup_zone").strict_cast(Int64)] AGGREGATE [len().alias("num_rides"), [(col("fare_amount")) / (col("trip_distance"))].mean().round().alias("avg_fare_per_mile")] BY [col("pickup_zone")] FROM RENAME FILTER [(col("trip_distance")) > (0.0)] FROM Parquet SCAN [/Users/rundel/Scratch/nyctaxi/yellow_tripdata_2020-01_fix.parquet, ... 58 other sources] PROJECT */19 COLUMNS RIGHT PLAN ON: [col("pickup_zone").strict_cast(Int64)] DF ["pickup_zone", "Borough", "Zone", "service_zone"]; PROJECT */4 COLUMNS END LEFT JOIN
Result
query.collect()
shape: (263, 6)
pickup_zone
num_rides
avg_fare_per_mile
Borough
Zone
service_zone
i32
u32
f64
str
str
str
1
6022
2205.09
"EWR"
"Newark Airport"
"EWR"
2
149
4.93
"Queens"
"Jamaica Bay"
"Boro Zone"
3
5812
11.98
"Bronx"
"Allerton/Pelham Gardens"
"Boro Zone"
4
225977
9.9
"Manhattan"
"Alphabet City"
"Yellow Zone"
5
891
20.02
"Staten Island"
"Arden Heights"
"Boro Zone"
…
…
…
…
…
…
261
818903
9.25
"Manhattan"
"World Trade Center"
"Yellow Zone"
262
2336728
7.93
"Manhattan"
"Yorkville East"
"Yellow Zone"
263
3531531
7.76
"Manhattan"
"Yorkville West"
"Yellow Zone"
264
1245641
22.66
"Unknown"
"N/A"
"N/A"
265
275794
66.35
"N/A"
"Outside of NYC"
"N/A"
Performance
%timeit query.collect()
1.14 s ± 62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)