Skip to content

Commit ae003fa

Browse files
authored
Merge pull request #6 from com-480-data-visualization/ms2
MS2
2 parents 11f06aa + 9c61539 commit ae003fa

33 files changed

Lines changed: 116653 additions & 147 deletions

.gitignore

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,11 @@
1818
.ionide
1919

2020
.DS_Store
21-
data/*
21+
22+
# Data files
23+
data/*
24+
data/**/*
25+
!data/**/
26+
!data/*.md
27+
!data/processed/*.md
28+
!data/**/.gitkeep

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ Project of Data Visualization (EPFL COM-480) - 2026
1717

1818
## Deliverables
1919

20-
- [Milestone 1](./deliverables/MS1.md)
21-
- [Milestone 2](./deliverables/MS2.md)
22-
- [Milestone 3](./deliverables/MS3.md)
20+
- [Milestone 1](./deliverables/ms1/ms1.md)
21+
- [Milestone 2](./deliverables/ms2/ms2.pdf)
22+
- [Milestone 3](./deliverables/ms3/ms3.md)
2323

2424
**Note**: Each deliverable comes with its associated GitHub release of the repository.

data/README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,11 @@ This folder contains the datasets that we will be using for our project. We have
66

77
- nba_database/ : Contains player and team statistics for every NBA game from 1947 to the present. It is historical and comprehensive, updated daily. It provides a solid foundation for exploring basketball history, player performance, and team dynamics : https://www.kaggle.com/api/v1/datasets/download/eoinamoore/historical-nba-data-and-player-box-scores
88

9-
- nba_play_by_play_shot_data/ : A large-scale play-by-play and shot-detail dataset covering both NBA and WNBA games, collected from multiple public sources (e.g., official league APIs and stats sites). It provides every in-game event—from period starts, jump balls, fouls, turnovers, rebounds, and field-goal attempts through free throws—along with detailed shot metadata (shot location, distance, result, assisting player, etc.) : https://www.kaggle.com/api/v1/datasets/download/brains14482/nba-playbyplay-and-shotdetails-data-19962021
9+
- nba_play_by_play_shot_data/ : A large-scale play-by-play and shot-detail dataset covering both NBA and WNBA games, collected from multiple public sources (e.g., official league APIs and stats sites). It provides every in-game event—from period starts, jump balls, fouls, turnovers, rebounds, and field-goal attempts through free throws—along with detailed shot metadata (shot location, distance, result, assisting player, etc.) : https://www.kaggle.com/api/v1/datasets/download/brains14482/nba-playbyplay-and-shotdetails-data-19962021
10+
11+
- nba_salary/ : Contains player/team salary data for the NBA from 1990 to 2026, scraped from HoopsHype (https://eu.hoopshype.com/salaries/players/ & https://eu.hoopshype.com/salaries/teams/) using the script in src (scrapping for each year, then merge into one csv, finally match to the personId with the player database for future use). It includes player names, seasons, and corresponding salaries : https://drive.google.com/drive/folders/1AI6Z8fIpP7RxhImdtOexrn6dgmB05izz?usp=share_link
12+
13+
- nba_team_colors/ : Contains the main color palette for each NBA team (source: https://www.trucolor.net/portfolio/national-basketball-association-official-colors-franchise-records-1946-1947-through-present/) : https://drive.google.com/drive/folders/1-yfOkoxvzeTDWPNqtZtBHt1_TOxrQw8o?usp=sharing
14+
15+
- processed/ : Contains the processed data files that are used for visualization. The files in this folder are generated from the raw datasets after cleaning, transforming, and aggregating the data to make it suitable for analysis and visualization : https://drive.google.com/drive/folders/1fipeswQbiNC3rnPsH2NQ-axi6x91Rk9n?usp=share_link
16+

data/nba_salary/.gitkeep

Whitespace-only changes.

data/nba_team_colors/.gitkeep

Whitespace-only changes.

data/processed/README.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Processed Data
2+
3+
This folder contains the processed data files that are used for visualization.
4+
5+
Can be found here:
6+
7+
## Description of files
8+
9+
### player_seasons.csv
10+
11+
Contains aggregated season-level statistics for players. Each row corresponds to a player's performance in a specific season, with the following columns:
12+
13+
- `season`: The NBA season (e.g., 2020 for the 2020-2021 season)
14+
- `personId`: Unique identifier for the player (NBA player ID)
15+
- `firstName`, `lastName`: Player's first and last name
16+
- `gameType`: Regular Season or Playoffs
17+
- `teamScore`: Average number of points scored by the player's team in the season
18+
- `opponentScore`: Average number of points scored by the opponent teams in the season
19+
- `points`, `assists`, `rebounds`, `blocks`, `steals`, `turnovers`: Average per game general statistics for the player in the season
20+
- `pointsTotal`, `assistsTotal`, `reboundsTotal`, `blocksTotal`, `stealsTotal`, `turnoversTotal`: Sum throughout the season of the corresponding statistics for the player
21+
- `fieldGoalsMade`, `fieldGoalsAttempted`, `threePointersMade`, `threePointersAttempted`, `freeThrowsMade`, `freeThrowsAttempted`: Sum throughout the season of the corresponding shot statistics for the player
22+
- `plusMinusPoints`: Average plus/minus points per game for the player in the season (i.e., the average point differential when the player is on the court)
23+
- `foulsPersonal`: Average personal fouls per game for the player in the season
24+
- `numMinutes`: Average minutes played per game for the player in the season
25+
- `win`: Number of games won by the player's team in the season
26+
- `gamesPlayed`: Number of games played by the player in the season
27+
- `proportionThreePoint`: Proportion of three-point shots attempted out of total field goal attempts for the player in the season
28+
- `fieldGoalsPercentage`, `threePointersPercentage`, `freeThrowsPercentage`: Shooting percentages for the player in the season
29+
- `salary`: Average salary for the player in the season (if available, otherwise NaN)
30+
31+
### player_games.csv
32+
33+
TODO
34+
35+
### shot_events.csv
36+
37+
TODO
38+
39+
### player_metadata.csv
40+
41+
Contains metadata about players (from nba_api), with the following columns:
42+
43+
- `personId`: Unique identifier for the player (NBA player ID)
44+
- `firstName`: Player's first name
45+
- `lastName`: Player's last name
46+
- `birthDate`: Player's birth date (in datetime format)
47+
- `height`: Player's height in centimeters
48+
- `weight`: Player's weight in pounds
49+
- `nbSeasons`: Number of seasons the player has played in the NBA
50+
- `jerseyNumber`: The jersey number the player wore in the NBA (e.g., 23)
51+
- `position`: The position the player played in the NBA (e.g., "Guard", "Forward", "Center")
52+
- `startYear`: The year the player started playing in the NBA
53+
- `endYear`: The year the player ended playing in the NBA
54+
- `draftYear`: The year the player was drafted into the NBA
55+
- `draftRound`: The round in which the player was drafted
56+
- `draftNumber`: The number of the pick in the draft
57+
58+
### team_seasons.csv
59+
60+
Contains aggregated season-level statistics for teams. Each row corresponds to a team's performance in a specific season, with the following columns:
61+
62+
- `season`: The NBA season (e.g., 2020 for the 2020-2021 season)
63+
- `teamId`: Unique identifier for the team (NBA team ID)
64+
- `gameType`: Regular Season or Playoffs
65+
- `teamName`: Name of the team (e.g., "Lakers")
66+
- `teamCity`: City of the team (e.g., "Los Angeles")
67+
- `teamScore`: Average number of points scored by the team in the season
68+
- `opponentScore`: Average number of points scored by the opponent teams in the season
69+
- `assists`, `rebounds`, `blocks`, `steals`, `turnovers`: Average per game general statistics for the team in the season
70+
- `teamScoreTotal`, `opponentScoreTotal`, `assistsTotal`, `reboundsTotal`, `blocksTotal`, `stealsTotal`, `turnoversTotal`: Sum throughout the season of the corresponding statistics for the team
71+
- `fieldGoalsMade`, `fieldGoalsAttempted`, `threePointersMade`, `threePointersAttempted`, `freeThrowsMade`, `freeThrowsAttempted`: Average per game shot statistics for the team in the season
72+
- `fieldGoalsPercentage`, `threePointersPercentage`, `freeThrowsPercentage`: Shooting percentages for the team in the season
73+
- `proportionThreePoint`: Proportion of three-point shots attempted out of total field goal attempts for the team in the season
74+
- `plusMinusPoints`: Average plus/minus points per game for the team in the season (i.e., the average point differential when the team is on the court)
75+
- `foulsPersonal`: Average personal fouls per game for the team in the season
76+
- `numMinutes`: Average minutes played per game for the team in the season
77+
- `win`: Number of games won by the team in the season
78+
- `losses`: Number of games lost by the team in the season
79+
- `gamesPlayed`: Number of games played by the team in the season (should be 82 for regular season, but can be less for older seasons or playoffs)
80+
- `salary`: Average salary for the team in the season (if available, otherwise NaN)
81+
82+
83+
### team_games.csv
84+
85+
Contains detailed game-level statistics for teams. Each row corresponds to a team's performance in a specific game, with the following columns:
86+
87+
- `gameId`: Unique identifier for the game
88+
- `gameDateTimeEst`: Date and time of the game in Eastern Standard Time
89+
- `teamCity`: City of the team (e.g., "Los Angeles")
90+
- `teamName`: Name of the team (e.g., "Lakers")
91+
- `teamId`: Unique identifier for the team (NBA team ID)
92+
- `opponentTeamCity`: City of the opponent team (e.g., "Boston")
93+
- `opponentTeamName`: Name of the opponent team (e.g., "Celtics")
94+
- `opponentTeamId`: Unique identifier for the opponent team (NBA team ID)
95+
- `home`: Boolean indicating if the team was playing at home (0) or away (1)
96+
- `win`: Boolean indicating if the team won (1) or lost (0) the game
97+
- `season`: The NBA season (e.g., 2020 for the 2020-2021 season)
98+
- `gameType`: Regular Season or Playoffs
99+
- `teamScore`: Number of points scored by the team in the game
100+
- `opponentScore`: Number of points scored by the opponent team in the game
101+
- `numMinutes`: Duration of the game in minutes (should be 48 for regular season, but can be more for overtime games)
102+
- `assists`, `rebounds`, `reboundsDefensive`, `reboundsOffensive`, `blocks`, `steals`, `turnovers`: General statistics for the team in the game
103+
- `foulsPersonal`: Number of fouls committed by the team in the game
104+
- `fieldGoalsMade`, `fieldGoalsAttempted`, `threePointersMade`, `threePointersAttempted`, `freeThrowsMade`, `freeThrowsAttempted`: Shot statistics for the team in the game
105+
- `fieldGoalsPercentage`, `threePointersPercentage`, `freeThrowsPercentage`: Shooting percentages for the team in the game
106+
107+
108+
### team_metadata.csv
109+
110+
Contains metadata about teams, with the following columns:
111+
112+
- `teamId`: Unique identifier for the team (NBA team ID)
113+
- `teamAbbrev`: Abbreviation of the team name (e.g., "LAL" for Los Angeles Lakers)
114+
- `teamSlug`: Slug version of the team name (e.g., "los-angeles-lakers" should be used to get the team logo from the https://i.logocdn.com/nba/{year}/{teamSlug}.svg URL)
115+
- `Color1`, `Color2`, `Color3`, `Color4`, `Color5`: Colors associated with the team (in hexadecimal format, e.g., "#552583")
116+
- `seasonFounded`: The season in which the team was founded (e.g., 1947 for the first NBA season)
117+
- `seasonActiveTill`: The most recent season in which the team was active (e.g., 2100 for the active teams, or a past season for defunct teams)

deliverables/MS2.md

Lines changed: 0 additions & 3 deletions
This file was deleted.

deliverables/instructions.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,50 @@ Please, fill the following sections about your project.
3838

3939
**10% of the final grade**
4040

41+
Two A4 pages describing the project goal.
42+
43+
- Include sketches of the vizualiation you want to make in your final product.
44+
- List the tools that you will use for each visualization and which (past or future) lectures you will need.
45+
- Break down your goal into independent pieces to implement. Try to design a core visualization (minimal viable product) that will be required at the end.
46+
- Then list extra ideas (more creative or challenging) that will enhance the visualization but could be dropped without endangering the meaning of the project.
47+
48+
Functional project prototype review.
49+
50+
- You should have an initial website running with the basic skeleton of the visualization/widgets.
4151

4252
## Milestone 3 (29th May, 5pm)
4353

4454
**80% of the final grade**
4555

56+
For the final milestone, you need to create a **cool, interactive, and sufficiently complex D3.js (and other) data viz** on the dataset you chose.
57+
Tell a data story and explain it to the targeted audience.
58+
Create a **process book** that details how you got there, from the original idea to the final product.
59+
60+
You need to deliver the following things:
61+
62+
- 1. Github repository with a README
63+
* Host the code and data on Github (if data is too big, link to a cloud storage) with your process book as a PDF file.
64+
* Add a README file that explains the technical setup and intended usage.
65+
* Code should be clean, manageable, and using the latest practices.
66+
- 2. Screencast
67+
* Demonstrate what you can do with your viz in a fun, engaging and impactful manner.
68+
* Talk about your main contributions rather than on technical details.
69+
* **2 min video (not more).**
70+
- 3. Process book (max 8 pages)
71+
* Describe the path you took to obtain the final result.
72+
* Explain challenges that you faced and design decisions that you took.
73+
* Reuse the sketches/plans that you made for the first milestone, expanding them and explaining the changes.
74+
* Care about the visual/design of this report.
75+
* **Peer assessment:** include a breakdown of the parts of the project completed by each team member.
76+
77+
Grading Rubric:
78+
79+
- Visualization: 35%
80+
- Technical Implementation: 15%
81+
- Screencast: 25%
82+
- Process book: 25%
83+
84+
**Note:** Grades may vary across the team members, based on the process book and the peer assessment process. Please provide clear explanations.
4685

4786
## Late policy
4887

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,17 +51,17 @@ Even for people who aren't analytics nerds, it will be fun to know how the game
5151
We already explored our three main datasets to get a sense of their structure, coverage, and quality.
5252
Here's what we found.
5353

54-
**Note**: You can check our full [data exploration notebook](../scripts/data_exploration/exploration.ipynb), but we also reported the main results below for your convenience.
54+
**Note**: You can check our full [data exploration notebook](../../scripts/data_exploration/exploration.ipynb), but we also reported the main results below for your convenience.
5555

56-
- **Player statistics**: The dataset contains over 1.6 million rows, covering games from November 1946 all the way to the current 2025-26 season. Each row represents a single player's performance in a single game, with 35 columns including points, assists, rebounds, shooting percentages, and +/- impact. As a quick sanity check, looking at the 2015-16 season, Stephen Curry tops the scoring chart at 30.1 PPG, followed by James Harden and Kevin Durant — which matches historical records perfectly.</br><img src="../scripts/data_exploration/images/2015_players_table.png" width="600">
56+
- **Player statistics**: The dataset contains over 1.6 million rows, covering games from November 1946 all the way to the current 2025-26 season. Each row represents a single player's performance in a single game, with 35 columns including points, assists, rebounds, shooting percentages, and +/- impact. As a quick sanity check, looking at the 2015-16 season, Stephen Curry tops the scoring chart at 30.1 PPG, followed by James Harden and Kevin Durant — which matches historical records perfectly.</br><img src="../../scripts/data_exploration/images/2015_players_table.png" width="600">
5757

58-
- **Team statistics**: Around 145'000 rows, also starting from 1946, with 48 columns per game entry. This includes all the usual box score stats plus richer game level data like fast break points, paint points, biggest lead, and lead changes. Looking at the 2015-16 season, the Warriors season stands out immediately with 73 wins, the all-time regular season record.</br><img src="../scripts/data_exploration/images/2015_teams_table.png" width="600">
58+
- **Team statistics**: Around 145'000 rows, also starting from 1946, with 48 columns per game entry. This includes all the usual box score stats plus richer game level data like fast break points, paint points, biggest lead, and lead changes. Looking at the 2015-16 season, the Warriors season stands out immediately with 73 wins, the all-time regular season record.</br><img src="../../scripts/data_exploration/images/2015_teams_table.png" width="600">
5959

60-
- **Shot Details**: This dataset only starts from the 1996 season (when the NBA began tracking spatial shot data), but it's by far the richest one in terms of granularity. For the 2015-16 season alone there are ~208'000 shot attempts, each with court coordinates (`LOC_X`, `LOC_Y`), shot zone, distance, and whether it went in. Shot locations are centered on the basket, with x coordinates ranging from -250 to 250 (in tenths of a foot) and y from slightly negative (behind the backboard) up to the half-court line.</br><img src="../scripts/data_exploration/images/2015_barplot_shot_details.png" width="600">
60+
- **Shot Details**: This dataset only starts from the 1996 season (when the NBA began tracking spatial shot data), but it's by far the richest one in terms of granularity. For the 2015-16 season alone there are ~208'000 shot attempts, each with court coordinates (`LOC_X`, `LOC_Y`), shot zone, distance, and whether it went in. Shot locations are centered on the basket, with x coordinates ranging from -250 to 250 (in tenths of a foot) and y from slightly negative (behind the backboard) up to the half-court line.</br><img src="../../scripts/data_exploration/images/2015_barplot_shot_details.png" width="600">
6161

62-
- **Awards**: The awards dataset has 3'465 entries across MVP, Defensive Player of the Year, Rookie of the Year, and more, going back to the 1956 season. It also includes vote share, so we can see not just who won but how dominant the win was (e.g. Nikola Jokić's 2021 MVP at 96.1% of votes). The All-Star selections go back to 1951 with 2'058 entries, and the End of Season Teams (All-NBA, All-Defensive, All-Rookie) cover 2'222 entries since 1947.</br><img src="../scripts/data_exploration/images/awards_table.png" width="600">
62+
- **Awards**: The awards dataset has 3'465 entries across MVP, Defensive Player of the Year, Rookie of the Year, and more, going back to the 1956 season. It also includes vote share, so we can see not just who won but how dominant the win was (e.g. Nikola Jokić's 2021 MVP at 96.1% of votes). The All-Star selections go back to 1951 with 2'058 entries, and the End of Season Teams (All-NBA, All-Defensive, All-Rookie) cover 2'222 entries since 1947.</br><img src="../../scripts/data_exploration/images/awards_table.png" width="600">
6363

64-
- **Key plots**: The heatmaps below illustrate the coverage and density of the data nicely. The player heatmap shows career scoring totals by season for the top 50 all-time scorers, with Wilt Chamberlain peaking at 4'029 points in a single season, a staggering number. The team heatmap shows win totals by franchise across every season, making historical dominant runs (the 80s Celtics/Lakers, the 90s Bulls, the mid-2010s Warriors) immediately visible.</br><img src="../scripts/data_exploration/images/heatmap_players.png" width="600"></br><img src="../scripts/data_exploration/images/heatmap_teams.png" width="600">
64+
- **Key plots**: The heatmaps below illustrate the coverage and density of the data nicely. The player heatmap shows career scoring totals by season for the top 50 all-time scorers, with Wilt Chamberlain peaking at 4'029 points in a single season, a staggering number. The team heatmap shows win totals by franchise across every season, making historical dominant runs (the 80s Celtics/Lakers, the 90s Bulls, the mid-2010s Warriors) immediately visible.</br><img src="../../scripts/data_exploration/images/heatmap_players.png" width="600"></br><img src="../../scripts/data_exploration/images/heatmap_teams.png" width="600">
6565

6666
Overall, the **data is in great shape** with very **high quality**.
6767
There are some minor typing differences we need to be aware of, but nothing that requires heavy cleaning.

0 commit comments

Comments
 (0)