The datastory: https://epfl-ada.github.io/ada-2025-project-othertagada/
The aim of this project will be to understand the events and mechanisms in and around the "Gamer gate", an online harassement campaign against feminism, diversity, and progressivism in video game culture, that was present on multiple online platforms, including reddit. We chose this topic because "Gamer gate" functions as a "model", i.e., a blueprint for the coordinated, polarized hate campaigns we see on social media today. More specifically, we want to examine the evolution of the campaign, using data to uncover how the event unfolded across the platform and how various communities responded, interacted, and influenced each other during the escalation. The goal of our project is to be able to better understand hate on the internet and maybe find mitigation strategies.
- To what extent do linguistic patterns and user sentiment diverge between Pro-GamerGate (KiA) and Anti-GamerGate (GiA) communities ? How does this compare to other related subreddits ?
- Is the hostility and negative sentiment in GamerGate debates widely distributed across the user base, or disproportionately generated by a hyper-active core of ‘super-participants’ that create the majority of posts ?
- To what extent do Pro- and Anti-GamerGate communities function as linguistic echo chambers? Can a classifier distinguish between posts from r/KotakuInAction and r/GamerGhazi with high accuracy, indicating a distinct separation in vocabulary and rhetoric?
- A lot of articles such as this post suggest that gamergate was just a strategy headed by extremists to spread ideas and recruit more people to their cause. Did that work out ? To which extent did the gaming community involved in the gamergate controversy get influenced towards alt-right spheres ?
- What happens of subreddits created to discuss of specific events when the events aer not longer relevant? Do they become totally inactive? If not what are the topics of discussion?
Pushshift: this dataset been generated by pushshift, which uses the Reddit API to extract and save post data to create a snapshot of Reddit accessible to the public (obtained here). We selected the same period of time as the first dataset (Jan 2014 to April 2017). It contains a log and metadata of all posts made during the time period, including usernames, titles, post bodies and more! This enables us to do a lot more analysis on the text content (for example keyword analysis), on users, since we can track their posts using the username and total post volume over time.
To make it easier to work with the Pushshift dataset, we only kept the following attributes:
| Label | Description |
|---|---|
| SUBREDDIT | The subreddit of the post |
| USERNAME | Username of the post author |
| POST_ID | Unique id of the post |
| TIMESTAMP | Time of the post |
| TITLE | Text of the post title |
| BODY_TEXT | Text of the post body |
| NUM_COMMENTS | Number of comments under the post |
To make it easier to work with our datasets and to focus our attention on relevant data:
Keep only the posts of the top 10 subreddits that interact the most with either r/kotakuinaction (main pro-GamerGate subreddit) or r/gamerghazi (main counterpart of kotakuinaction)
To understand the "players" and their relationships:
-
User Similarity: A heatmap of user similarity is generated to measure the overlap of user bases between different subreddits.
-
Clustering: Users are clustered using data_gamergate (large) based on behavioral features, including: Average link sentiment and LIWC
-
Activity Metrics: Histograms of posts per user are compared between KotakuInAction (KiA) and GamerGhazi (GiA), cross-referenced with link sentiment to identify if high-volume users drive negativity.
-
Deleted Content Analysis: Comparison of deleted vs. non-deleted users/posts to check for differences using LIWC.
To map "how it played out" over time:
-
Event Detection: Identification of spikes in total post volume correlated with a timeline of real-world GamerGate events.
-
Dynamic Network Visualization: A network graph with a time slider to visualize the structural evolution of the community.
Linguistic & Sentiment Analysis To analyze speech patterns and misogyny:
- Statistical Hypothesis Testing (T-tests): The distribution of LIWC categories (Sexual, Swear, Anger, Sad) in KiA is compared against the global distribution using T-tests (p-values) to quantitatively verify if the discourse in KiA is distinct (e.g., significantly more toxic or misogynistic).
Topic Modeling To understand what was being discussed:
-
TF-IDF Matrix: Used to weigh word importance within posts.
-
Topic Extraction: Identification of "Top Topics" per subreddit and an analysis of topic evolution per month to see how the narrative shifted.
-
Topic Classification: A model trained to predict the subreddit based on the topic, with accuracy used as a metric of discourse distinctiveness.
| Week | Dates | Tasks |
|---|---|---|
| 1 | 6.11-12.11 |
|
| 2 | 13.11-19.11 |
|
| 3 | 20.11-26.11 |
|
| 4 | 27.11-03.12 |
|
| 5 | 4.12-10.12 |
|
| 6 | 11.12-17.12 |
|
Robin Herberich:
- Readme for P3
- Setting up datastory website, work on layout of website and general features
- Writing datastory introduction and data presentation
- Pushshift datawrangling, exploration, processing scripts and related hyperlink dataset clean-up
- Extracting posts per day per subreddit dataset and plot posts per day for a few subreddits.
Maguette Diouf:
- Prediction of link sentiment
- Analysis of feature importance in logistic regresion
- Analysis of Gamergate Speech
Katia Häfliger:
- Writing structure of results notebook
- Analysis of Political implication of users
- Analysis of topics monthly
- Writing script to create .txt file from post dataset for each subreddits
Matteo Simonet:
- Scraping of events data from an online timeline
- Selection of relevant subreddits for analysis
- Visualization of the network and volume of posts during the conflict, visualization of relationships between subreddits
- Analysis of political implications of users
Jérémie de Faveri:
- Styling the website
- Posts per user analysis: Does it follow the power law?
- Analysis: Difference in negativity between power and light users
- Analysis: Which subreddits are more moderated?