2 Getting to know the data
Let’s start by reading in the data and printing some five random rows of it to get a sense for what we’re dealing with.
shape: (5, 13)
Rank | Mark | Competitor | DOB | Nat | Pos | Venue | Date | Results Score | Mark [meters or seconds] | Event | Wind | Sex |
---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | date | str | str | str | date | i64 | f64 | str | str | str |
12875 | "13.59" | "Louis François… | 1999-03-02 | "SEN" | "1" | "Bonneuil-sur-M… | 2019-06-18 | 1141 | 13.5 | "110 Metres Hur… | "0.0" | "male" |
13542 | "10:16.3h" | "Holly PAGE" | 2000-08-17 | "GBR" | "2" | "West Suffolk A… | 2021-05-15 | 1056 | 616.3 | "3000 Metres St… | null | "female" |
8585 | "10:03.1h" | "Birtukan ADAMU… | 1992-04-29 | "ETH" | "1" | "Addis Abeba (E… | 2019-05-11 | 1088 | 603.1 | "3000 Metres St… | null | "female" |
12129 | "19.82" | "Tim NEDOW" | 1990-10-16 | "CAN" | "1" | "Azusa, CA (USA… | 2013-04-19 | 1110 | 19.8 | "Shot Put" | null | "male" |
11362 | "20.63" | "Marlon DEVONIS… | 1976-06-01 | "GBR" | "6" | "Rieti (ITA)" | 2008-09-07 | 1123 | 20.6 | "200 Metres" | "0.6" | "male" |
We can easily understand the different columns:
Rank
is the rank of that performance for the given sex and event.- The
Mark
is the unparsed result entry for that performance. This can be in minutes, seconds, hours, meters (long jump) or points (decathlon). Competitor
,DOB
, andNat
are the competitor’s name, date of birth, and nationality.Pos
is the position that was achieved with this performance in that specific event where it was performed.Venue
andDate
specify where and when the performance was achieved.- World Athletics assigns a score to a performance, that’s what
Results Score
is. Mark [meters or seconds]
is my attempt to parse theMark
into seconds or meters, i.e. afloat
.Event
is the event name.Wind
, if available, tells the wind reading for that performance. This is mostly important for sprinting and jumping.Sex
is either female or male.
2.1 Some basic counts
Below are some basic counts of the data.
Counts for male and female performance:
shape: (2, 2)
Sex | count |
---|---|
str | u32 |
"male" | 236187 |
"female" | 227660 |
Code
Performance count by sex and event, colored:
Event | female | male | |
---|---|---|---|
0 | Hammer Throw | 33647 | 12984 |
1 | 100 Metres | 26983 | 24875 |
2 | 200 Metres | 22067 | 15005 |
3 | Pole Vault | 15670 | 16388 |
4 | 3000 Metres Steeplechase | 15125 | 9665 |
5 | Javelin Throw | 13449 | 7564 |
6 | 800 Metres | 11597 | 7283 |
7 | 400 Metres | 10867 | 8189 |
8 | 1500 Metres | 9456 | 9107 |
9 | 20 Kilometres Race Walk | 9275 | 3773 |
10 | 100 Metres Hurdles | 7585 | nan |
11 | Shot Put | 7123 | 16284 |
12 | 400 Metres Hurdles | 6978 | 15502 |
13 | Triple Jump | 4724 | 5446 |
14 | 5000 Metres | 4563 | 8357 |
15 | Marathon | 4356 | 8496 |
16 | 10000 Metres | 3111 | 4867 |
17 | 3000 Metres | 2924 | 2078 |
18 | High Jump | 2440 | 8980 |
19 | Long Jump | 2343 | 6570 |
20 | Half Marathon | 1955 | 5084 |
21 | 10 Kilometres Race Walk | 1926 | 360 |
22 | Discus Throw | 1774 | 13709 |
23 | Heptathlon | 1124 | nan |
24 | 2000 Metres Steeplechase | 831 | 105 |
25 | 35 Kilometres Race Walk | 699 | 225 |
26 | 5 Kilometres Race Walk | 691 | 49 |
27 | 10 Kilometres | 640 | 1129 |
28 | One Mile | 601 | 1753 |
29 | 5 Kilometres | 587 | 1196 |
30 | 1000 Metres | 516 | 513 |
31 | 50 Kilometres Race Walk | 341 | 1477 |
32 | 5000 Metres Race Walk | 303 | 663 |
33 | 15 Kilometres | 198 | 189 |
34 | 10000 Metres Race Walk | 191 | 496 |
35 | 10 Miles Road | 163 | 143 |
36 | 600 Metres | 159 | 35 |
37 | 20000 Metres Race Walk | 159 | 44 |
38 | 300 Metres | 151 | 843 |
39 | 3000 Metres Race Walk | 102 | 237 |
40 | 20 Kilometres | 95 | 283 |
41 | Two Miles | 92 | 146 |
42 | 2000 Metres | 79 | 223 |
43 | Decathlon | nan | 1790 |
44 | 110 Metres Hurdles | nan | 13970 |
45 | 30 Kilometres Race Walk | nan | 112 |
Code
import altair as alt
from camminapy.plot import altair_theme
print("Count of performances grouped by year (starting 1960):")
altair_theme()
alt.Chart(
df.with_columns(pl.col("Date").dt.year())
.groupby("Date", "Sex")
.count()
.filter(pl.col("Date") > 1960)
.sort("Date")
.to_pandas()
).mark_bar(clip=True).encode(
x=alt.X("Date:N").axis(labelAngle=-90, values=list(range(1960, 2024, 2))),
y="count:Q",
color=alt.Color("Sex:N").scale(
domain=["female", "male"], range=["purple", "green"]
),
).properties(
height=300, width=550
)
Count of performances grouped by year (starting 1960):
Interesting to see COVID pop up in this data as well.
Code
The top 10 events with the most performances:
Event | count | |
---|---|---|
0 | 100 Metres | 51858 |
1 | Hammer Throw | 46631 |
2 | 200 Metres | 37072 |
3 | Pole Vault | 32058 |
4 | 3000 Metres Steeplechase | 24790 |
5 | Shot Put | 23407 |
6 | 400 Metres Hurdles | 22480 |
7 | Javelin Throw | 21013 |
8 | 400 Metres | 19056 |
9 | 800 Metres | 18880 |
Code
print(
"Let's finish with an overview that shows during which months, which events are held:"
)
alt.Chart(
df.with_columns(pl.col("Date").dt.month())
.groupby("Date", "Sex", "Event")
.count()
.filter(pl.col("Event").str.contains("Walk").is_not())
.sort("Date")
.to_pandas()
).mark_bar(clip=True).encode(
x=alt.X("Date:N").title("Month of Year"),
y="count:Q",
color=alt.Color("Sex:N").scale(
domain=["female", "male"], range=["purple", "green"]
),
row="Event:N",
).properties(
height=200, width=550
).resolve_scale(
y="independent"
)
Let's finish with an overview that shows during which months, which events are held: