2  Getting to know the data

Code
import polars as pl
import numpy as np
from IPython.display import Markdown
from tabulate import tabulate

np.random.seed(1)

Let’s start by reading in the data and printing some five random rows of it to get a sense for what we’re dealing with.

Code
df = pl.read_csv("../data/data.csv", separator=";", try_parse_dates=True)
df.sample(n=5)
shape: (5, 13)
Rank Mark Competitor DOB Nat Pos Venue Date Results Score Mark [meters or seconds] Event Wind Sex
i64 str str date str str str date i64 f64 str str str
12875 "13.59" "Louis François… 1999-03-02 "SEN" "1" "Bonneuil-sur-M… 2019-06-18 1141 13.5 "110 Metres Hur… "0.0" "male"
13542 "10:16.3h" "Holly PAGE" 2000-08-17 "GBR" "2" "West Suffolk A… 2021-05-15 1056 616.3 "3000 Metres St… null "female"
8585 "10:03.1h" "Birtukan ADAMU… 1992-04-29 "ETH" "1" "Addis Abeba (E… 2019-05-11 1088 603.1 "3000 Metres St… null "female"
12129 "19.82" "Tim NEDOW" 1990-10-16 "CAN" "1" "Azusa, CA (USA… 2013-04-19 1110 19.8 "Shot Put" null "male"
11362 "20.63" "Marlon DEVONIS… 1976-06-01 "GBR" "6" "Rieti (ITA)" 2008-09-07 1123 20.6 "200 Metres" "0.6" "male"

We can easily understand the different columns:

2.1 Some basic counts

Below are some basic counts of the data.

Code
print("Shape of the dataframe:")
df.shape
Shape of the dataframe:
(463847, 13)
Code
print("Counts for male and female performance:")
df.groupby("Sex").count()
Counts for male and female performance:
shape: (2, 2)
Sex count
str u32
"male" 236187
"female" 227660
Code
print("Performance count by sex and event, colored:")
(
    df.groupby("Sex", "Event")
    .count()
    .pivot(index="Event", columns="Sex", values="count", aggregate_function=None)
    .sort("female", descending=True)
    .to_pandas()
    .style.format(precision=0)
    .background_gradient(vmax=35_000)
    # .to_markdown()
)
Performance count by sex and event, colored:
  Event female male
0 Hammer Throw 33647 12984
1 100 Metres 26983 24875
2 200 Metres 22067 15005
3 Pole Vault 15670 16388
4 3000 Metres Steeplechase 15125 9665
5 Javelin Throw 13449 7564
6 800 Metres 11597 7283
7 400 Metres 10867 8189
8 1500 Metres 9456 9107
9 20 Kilometres Race Walk 9275 3773
10 100 Metres Hurdles 7585 nan
11 Shot Put 7123 16284
12 400 Metres Hurdles 6978 15502
13 Triple Jump 4724 5446
14 5000 Metres 4563 8357
15 Marathon 4356 8496
16 10000 Metres 3111 4867
17 3000 Metres 2924 2078
18 High Jump 2440 8980
19 Long Jump 2343 6570
20 Half Marathon 1955 5084
21 10 Kilometres Race Walk 1926 360
22 Discus Throw 1774 13709
23 Heptathlon 1124 nan
24 2000 Metres Steeplechase 831 105
25 35 Kilometres Race Walk 699 225
26 5 Kilometres Race Walk 691 49
27 10 Kilometres 640 1129
28 One Mile 601 1753
29 5 Kilometres 587 1196
30 1000 Metres 516 513
31 50 Kilometres Race Walk 341 1477
32 5000 Metres Race Walk 303 663
33 15 Kilometres 198 189
34 10000 Metres Race Walk 191 496
35 10 Miles Road 163 143
36 600 Metres 159 35
37 20000 Metres Race Walk 159 44
38 300 Metres 151 843
39 3000 Metres Race Walk 102 237
40 20 Kilometres 95 283
41 Two Miles 92 146
42 2000 Metres 79 223
43 Decathlon nan 1790
44 110 Metres Hurdles nan 13970
45 30 Kilometres Race Walk nan 112
Code
import altair as alt
from camminapy.plot import altair_theme

print("Count of performances grouped by year (starting 1960):")

altair_theme()
alt.Chart(
    df.with_columns(pl.col("Date").dt.year())
    .groupby("Date", "Sex")
    .count()
    .filter(pl.col("Date") > 1960)
    .sort("Date")
    .to_pandas()
).mark_bar(clip=True).encode(
    x=alt.X("Date:N").axis(labelAngle=-90, values=list(range(1960, 2024, 2))),
    y="count:Q",
    color=alt.Color("Sex:N").scale(
        domain=["female", "male"], range=["purple", "green"]
    ),
).properties(
    height=300, width=550
)
Count of performances grouped by year (starting 1960):

Interesting to see COVID pop up in this data as well.

Code
print("The top 10 events with the most performances:")
(
    df.groupby("Event")
    .count()
    .sort("count", descending=True)
    .head(10)
    .to_pandas()
    .style.background_gradient(subset="count")
)
The top 10 events with the most performances:
  Event count
0 100 Metres 51858
1 Hammer Throw 46631
2 200 Metres 37072
3 Pole Vault 32058
4 3000 Metres Steeplechase 24790
5 Shot Put 23407
6 400 Metres Hurdles 22480
7 Javelin Throw 21013
8 400 Metres 19056
9 800 Metres 18880
Code
print(
    "Let's finish with an overview that shows during which months, which events are held:"
)
alt.Chart(
    df.with_columns(pl.col("Date").dt.month())
    .groupby("Date", "Sex", "Event")
    .count()
    .filter(pl.col("Event").str.contains("Walk").is_not())
    .sort("Date")
    .to_pandas()
).mark_bar(clip=True).encode(
    x=alt.X("Date:N").title("Month of Year"),
    y="count:Q",
    color=alt.Color("Sex:N").scale(
        domain=["female", "male"], range=["purple", "green"]
    ),
    row="Event:N",
).properties(
    height=200, width=550
).resolve_scale(
    y="independent"
)
Let's finish with an overview that shows during which months, which events are held: