Welcome to the Social Media Classification Modeling for Policy Topics repository! This project focuses on collecting, processing, and classifying Reddit data to analyze discussions around various policy topics across different U.S. states. By leveraging advanced data scraping techniques, exploratory data analysis (EDA), manual labeling, and sophisticated classification models, this project aims to provide insightful classifications of social media conversations related to policy issues.
Current Dashboard of Multi-Label Model Output General Analysis can be viewed here: https://policy-ensemble-social-posts-classification.streamlit.app/
- Introduction
- Reddit Data Collection
- Exploratory Data Analysis (EDA)
- Image Handling and Sampling for Label Studio
- Labeling Process with Label Studio
- Classification Model Preprocessing
- Classification Model
- Results
- Contributing
- License
- Contact
- Acknowledgements
In the age of digital information, social media platforms like Reddit serve as rich sources of public opinion and discourse on a myriad of topics, including policy issues. Understanding and categorizing these discussions can provide valuable insights for policymakers, researchers, and the general public.
This project aims to systematically collect and analyze Reddit posts and comments from state-specific subreddits across the United States to classify discussions into predefined policy topics. The workflow encompasses several stages:
- Data Collection: Utilizing custom scripts to efficiently gather large-scale Reddit data while managing API rate limits.
- Exploratory Data Analysis (EDA): Conducting in-depth analysis to assess the suitability of the data for modeling, including trend detection and topic exploration.
- Image Handling and Sampling: Preprocessing images associated with Reddit posts and creating a representative sample for manual labeling using Label Studio.
- Labeling Process: Implementing a structured labeling workflow to categorize posts accurately, ensuring high-quality annotations through team collaboration and quality assurance measures.
- Classification Model Preprocessing: Preparing both labeled and unlabeled datasets for training machine learning models, involving text cleaning, normalization, and label encoding.
- Model Development: Building and fine-tuning multi-label classification models based on transformer architectures (RoBERTa) to accurately categorize posts into multiple policy areas.
- Evaluation and Results: Assessing model performance using robust metrics and presenting comprehensive results to validate the effectiveness of the classification approach.
By integrating these components, the project not only facilitates the classification of social media content but also provides a scalable framework for analyzing policy-related discussions across various platforms. Whether you're a data scientist, policy analyst, or researcher, this repository offers valuable tools and insights to explore the intersection of social media and policy discourse.
This project is licensed under the MIT License
For any questions or feedback, please reach out to dforcade@gatech.edu).
- Special thanks to all of our annotators!
- Data Collection: Automate the large-scale collection of Reddit data, including posts and comments.
- Exploratory Data Analysis (EDA): Apply advanced techniques to validate and prepare the dataset for classification.
- Model Development: Create and fine-tune multi-label classification models to categorize posts into predefined policy areas.
- Human-in-the-Loop: Incorporate manual labeling and review processes to ensure high-quality datasets and nuanced model outputs.
The Reddit_Data_Scrapers folder contains scripts designed for efficient and large-scale collection of Reddit posts and comments from state-specific subreddits. These scripts utilize multiple Reddit API keys to manage rate limits and optimize asynchronous data fetching. They are configured to pull the top 600 threads/posts from the past year for the 50 state subreddits, and all of the comments (including nested comments) from those threads (around 4.5 million comments). The comments script takes around 16 hours to run, posts is much faster.
-
redditPostPull.py- Purpose: Retrieves the top posts from specified state subreddits over the past year.
- Output: Saves collected posts to a CSV file (
reddit_posts.csv).
-
redditCommentPull.py- Purpose: Fetches all comments for posts collected by
redditPostPull.py. - Output: Saves comments to a CSV file (
reddit_comments.csv), grouped by state. - (New York had an issue in data collection and has two specific scripts to append to the created dataframes)
- Purpose: Fetches all comments for posts collected by
-
Create Reddit Accounts:
- Sign up for multiple Reddit accounts to obtain multiple API keys.
-
Register Applications:
- Log in to each Reddit account and navigate to Reddit Apps.
- Click "Create App" or "Create Another App".
- Fill in the application name and select "script" as the type.
- Set the redirect URI to
http://localhost. - Note down the client ID and client secret (API key).
-
Organize API Keys:
- Create a JSON file named
reddit_api_keys.jsonin theReddit_Data_Scrapersfolder. - Structure the JSON file as follows:
{ "group1": [ { "client_id": "your_client_id_1", "api_key": "your_api_secret_1" }, { "client_id": "your_client_id_2", "api_key": "your_api_secret_2" } ], "group2": [ { "client_id": "your_client_id_3", "api_key": "your_api_secret_3" }, { "client_id": "your_client_id_4", "api_key": "your_api_secret_4" } ] }
- Create a JSON file named
-
Environment:
- Python 3.8+.
- Install dependencies via:
pip install -r requirements.txt
Step 1: Run redditPostPull.py must be run first, as redditCommentsPull.py utilizies the post_ids created
- Script:
redditPostPull.py - Description:
- Collects the top posts from state subreddits over the last year.
- Rotates between multiple API key groups for rate-limited, asynchronous scraping.
- Output:
- Saves posts to
reddit_posts.csv
- Saves posts to
| Column | Description |
|---|---|
post_id |
Unique identifier of the Reddit post |
state |
Name of the subreddit (state) |
title |
Title of the post |
selftext |
Body text of the post |
created_utc |
UTC timestamp of when the post was created |
score |
Score (upvotes - downvotes) of the post |
url |
URL of the post |
num_comments |
Number of comments on the post |
author |
Username of the post's author |
- Script:
redditCommentPull.py - Description:
- Collects all the comments from the top posts produced by redditPostPull.py
- Rotates between multiple API key groups for rate-limited, asynchronous scraping.
- Output:
- Saves comments to
reddit_comments.csv
- Saves comments to
| Column | Description |
|---|---|
post_id |
Identifier of the post to which the comment belongs |
state |
Name of the subreddit (state) |
comment_id |
Unique identifier of the comment |
body |
Text content of the comment |
created_utc |
UTC timestamp of when the comment was created |
score |
Score of the comment |
author |
Username of the comment's author |
Quick Overview on Why we decided on on investing in manual labeling and advanced Classification Models
Once the data successfully scraped and validated, extensive EDA was run using several exploratory methods to determine if this data would be a good candidate for modeling.
The first step was employing Allotaxonometry-Style graphs on several test states to determine if rough trends and differences could be detected in the data, or if it was simply too noisy to be worth the trouble. With our EDA Allotaxonemtry, we were able to detect a Marijunana Legalization Trend downtick due to a legislative event that was losing steam, and in Vermont we were able to detect Foliage-related terms trending going into the Fall:
From there, we used exploratory Topic Modeling with BeRTopic and KMeans clustering. When converting embeddings to t-SNE, we saw some promising results -- but not directly usable for our policy classification/modeling task. K-Means was strugging to differentiate in a meaningful way - and when BeRTopic clusters were individually investigated, they were too fragemented for usable downstream analysis.
We also ran additional statistical tests on simple sentiment analysis between clusters and groups to determine if there was validitity to our intuiton, and the results were statistically significant. With these (and a few more metrics/analysis), we made the decision that this data was a good candidate for manual labeling and transformer based classification for our goal of identifying political topic discussion.
Full EDA Modeling report can be found here: EDA Modeling Report
To prepare to for manual labeling of the social media data, several preprocessing steps need to be completed, namely image handling and sampling. In our dataset, 35% of reddit posts
contained images, many of which were crucial for determining context. To addresss this, the script reddit_post_image_handling.py uses reddit_posts.csv urls to search for image
extensions and retrieves images from Reddit post URLs and saves them locally. Warning: This will easily be over 20GB of data, and the script will take several hours to run. We moved these to the cloud for hosting, but if you're working local that's fine too.
reddit_post_image_handling.py
Key Features:
- Normalization of URLs: The script cleans and normalizes the URLs to address inconsistencies such as backslashes or whitespace.
- Filtering Valid Image Links: Only URLs pointing to supported image formats (
.jpg,.png,.gif, etc.) are retained. - Concurrent Downloading: Uses
asyncioandaiohttpto download multiple images simultaneously, significantly reducing runtime. - Retry Logic with Exponential Backoff: Handles rate limits and transient errors by retrying failed downloads with increasing delays.
Output:
Images are saved in the directory post_images/, with filenames corresponding to their respective post_id (e.g., abc123.jpg).
Once the images have been downloaded, you can proceed to using labelStudioSampleCreation.rmd, which handles Label Studio preprocessing as well as generates a stratified,
proportional, and constrained sample with prioritization based on engagement metrics within each State. It also performs lemmatization on a dummy column to facilitate a keyword search for each policy topic of interest, to further drive representative sampling for all classes into the manual phase.
-
Policy Area Classification
Each post is classified into one of several predefined policy areas using a keyword-based matching approach.- Text Preprocessing: Titles and body text are lemmatized for better keyword matching using the
textstemR library. - Keyword Matching: Policy areas are defined by a set of keywords, such as:
- Health:
health,medicine,hospital,insurance, etc. - Environment:
climate,pollution,wildlife, etc.
Posts without a match are classified as Other / Uncategorized.
- Health:
- Text Preprocessing: Titles and body text are lemmatized for better keyword matching using the
-
State-Specific Sampling
The sampling ensures a balanced representation across U.S. states while prioritizing relevance:- Minimum and Maximum Constraints: Each state contributes at least 90 posts but no more than 350.
- Weighting by Engagement: Sampling is limited-weight proportional to the total comments per state.
- Stratification by Policy Area: Posts are distributed across policy areas to maintain diversity in content.
-
Post Selection Criteria
Posts are prioritized based on engagement:- 80th Percentile Thresholds: Posts in the top 20% for each State by
num_commentsorscoreare prioritized for selection. - Random Sampling for Remaining Posts: To fill gaps, additional posts are randomly sampled within states, excluding duplicates.
- 80th Percentile Thresholds: Posts in the top 20% for each State by
-
"Other / Uncategorized" Posts
An additional 1,000 posts classified as "Other / Uncategorized" are included in the final dataset to ensure representation of general or miscellaneous topics.
- Total Sample Size: 6,000 posts, plus 1,000 Other / Uncategorized posts.
- Balanced Distribution: Ensures proportional representation of states and policy areas while maintaining diversity.
- Output File: The final dataset is saved as
final_sample.csv.
To verify the dataset's representativeness:
- State Distribution: The number of posts per state is visualized in a bar chart.
- Policy Area Distribution: Policy areas are similarly analyzed to confirm proportional representation.
- Comparison to Original Data: Distributions of the final sample are compared to the original dataset to highlight differences and ensure sampling goals are met.
Visualization Examples:
Graphs comparing the distribution of states and policy areas in the sampled dataset are included to validate the sampling process.
-
Prepare the Input Data
Ensurereddit_posts.csvandreddit_comments.csvare available in the working directory. -
Run Image Download
Usedownload_images.pyto fetch images linked in posts. Save them inpost_images/. -
Execute the Sampling Script
RunlabelStudioSampleCreation.Rmdto generate the balanced ready to label dataset- Output:
final_sample.csv
- Output:
| Column | Description |
|---|---|
post_id |
Unique identifier of the Reddit post |
state |
Name of the subreddit (state) |
title |
Title of the post |
selftext |
Body text of the post |
policy_area |
Classified policy area of the post |
num_comments |
Number of comments on the post |
score |
Score (upvotes - downvotes) of the post |
image_url |
URL of the image associated with the post |
To prepare the dataset for analysis, we used Label Studio for labeling Reddit posts. This process involved both single-label and multi-label classification tasks. We set ours up on a virtual machine on Google Cloud, and uploaded the reddit images downloaded from the previous script to a bucket on the Cloud. Those images were then efficently fed into the Label Studio setup so that our annotators could have full context while labeling.
Ours can be seen here: http://34.23.190.214:8080/projects/
-
Team Collaboration:
- We recruited additional annotators and provided training to ensure consistent and high-quality annotations.
- We provided a setup guide, training, and reference guide for all labelers (Including members of our team), which were displayed everytime someone entered a project
- Reference guide provided in-depth category definitions and explainations to keep labeling between annotators consistent.
- Label Studio Info: Label Studio Documentation
-
Task Types:
- Multi-Label Classification: 2,500 posts were labeled with one or more categories, allowing for posts to belong to multiple policy areas or topics.
- Single-Label Classification: 1,000 posts were labeled with exactly one category, simplifying the classification process.
-
Data Preparation:
- The sampled dataset (
final_sample.csv) was uploaded to Label Studio. - Each record included:
- State: State subreddit the post was made in
- Post Title: Title of the Post
- Image URL: (if applicable): Visual content automatically displayed with posts from Google Cloud bucket
- Post Contents: Text the author posted along with the title, if any
- The sampled dataset (
-
Label Studio Interface:
- Each labeler was assigned tasks directly in Label Studio.
- The interface included:
- Large color annotation buttons for each category, to assist in speed and comfortability of annotators
- Automated queue on Submit or Skip
- Full random delivery to keep things interesting for annotators and ensure distribution of class balance
-
Quality Assurance:
- An initial training phase allowed labelers to familiarize themselves with the task.
- Randomly selected posts were reviewed to ensure labeling consistency.
- Individual annotator results for Cohen's Kappa statistics
-
Labeled Dataset:
- After completion, the labeled data was exported from Label Studio in .CSV format.
- The final dataset was processed into CSV format for further analysis.
-
Label Summary:
- A summary of labeled categories, including frequency and distribution, was generated for exploratory analysis.
- Save this file as Classification_Model/data/raw/labeled_reddit_posts_raw.csv
- Bring in an additional copy of reddit_posts.csv and place into Classification_Model/data/raw/reddit_posts_raw.csv
The classification model preprocessing pipeline is implemented in Python to prepare Reddit post and comment data for classification tasks. This includes handling both labeled and unlabeled datasets, ensuring they are cleaned, normalized, and formatted for multi-label and single-label classification. Only Posts will be covered here due to the large file size and computing requirements for handling the 4.5 million comments.
The preprocessing process is orchestrated by the script Classification_Model/main_preprocessing.py, which utilizes helper modules to automate and modularize tasks.
-
Labeled Data Preprocessing:
- Parses multi-label topics from raw data fields from Label Studio Export.
- Cleans and combines text fields (title + body text).
- Converts multi-label topics into a binary label matrix for machine learning.
- Assigns a primary label to each post for single-label classification using a prioritization strategy (Note: No Single-Label models were moved to production).
- Outputs a processed dataset for labeled posts (
Classification_Model/data/processed/labeled_reddit_posts_processed.csv).
-
Unlabeled Data Preprocessing:
- Filters out posts already labeled to avoid duplication.
- Cleans and combines text fields (title + body text).
- Initializes placeholders for topic and label fields.
- Outputs a processed dataset for unlabeled posts (
Classification_Model/data/processed/unlabeled_reddit_posts_processed.csv).
-
Topic Distribution Analysis:
- Computes and prints the distribution of primary labels.
- Counts the occurrences of all topics across the dataset (multi-label).
- Outputs summary statistics to a CSV (
Classification_Model/data/processed/total_topic_occurrences.csv).
The entry point for the preprocessing pipeline:
- Calls labeled and unlabeled data preprocessing functions.
- Outputs processed datasets and summary statistics.
Utility functions used throughout the pipeline:
clean_text(): Cleans text by removing URLs, special characters, and unnecessary spaces, and converts it to lowercase.assign_single_label(): Assigns a single primary label to a post, prioritizing non-governmental labels if multiple are present.normalize_topics(): Ensures consistent formatting for topic labels.
Handles preprocessing for the labeled dataset:
- Parses and normalizes multi-label topics.
- Converts topics to a binary matrix using
MultiLabelBinarizer. - Cleans text fields and combines title and body.
- Assigns a primary label to each post for single-label classification.
Handles preprocessing for the unlabeled dataset:
- Filters out posts already present in the labeled dataset.
- Cleans text fields and combines title and body.
- Initializes empty labels for future annotation.
-
Processed Labeled Dataset (
labeled_reddit_posts_processed.csv):- Includes cleaned text fields, binary topic labels, and primary labels.
-
Processed Unlabeled Dataset (
unlabeled_reddit_posts_processed.csv):- Includes cleaned text fields and placeholders for future labeling.
-
Summary Statistics:
- Total Topic Occurrences (
total_topic_occurrences.csv):- Provides the count of each topic across all posts.
- Primary Label Distribution:
- Displays the count of posts per primary label (printed to console).
- Total Topic Occurrences (
-
Prepare the Raw Data:
- Place the raw labeled and unlabeled datasets in the
Classification_Model/data/raw/directory:labeled_reddit_posts_raw.csvreddit_posts_raw.csv
- Place the raw labeled and unlabeled datasets in the
-
Run the Main Preprocessing Script:
python main_preprocessing.py
-
Outputs:
- Both processed .csvs are put into the Classification_Model/data/processed/ directory:
labeled_reddit_posts_processed.csvunlabeled_reddit_posts_processed.csv
We developed two multi-label text classification models to categorize Reddit posts into multiple policy areas -- an ensemble roberta-base and an ensemble fusion roberta-large. The ensemble fusion model is currently still in development and not ready for release at this time. The nonFusion model is performing well on the multi-label policy classification task and is provided here.
To run the model:
- Make sure you have
labeled_reddit_posts_processed.csvin theClassification_Model/data/processed/directory. - Go to
Classification_Model/nonFusion - Install
requirements.txt - Run
nonFusion.py
- MultiLabelBinarizer: We utilized scikit-learn's
MultiLabelBinarizerto convert the list of policy area labels for each post into a binary matrix. This matrix is suitable for multi-label classification tasks, where each post can belong to multiple classes.
- Text Cleaning: Combined the post's title and body text into a single string. We performed cleaning steps such as:
- Removing URLs
- Removing special characters
- Tokenization: Used
RobertaTokenizerwith a maximum sequence length of 128 tokens to tokenize the cleaned text.
We fine-tuned the pre-trained roberta-base model from Hugging Face's Transformers library for our classification task.
- Model Modification:
- Set
problem_type="multi_label_classification"to adapt the model for multi-label outputs. - Adjusted the output layer to match the number of policy area classes.
- Set
Due to class imbalance in our dataset, we implemented the Focal Loss function to focus training on hard-to-classify examples.
The Focal Loss is defined as:
p_tis the model's estimated probability for the true class.alpha_tis the weighting factor for the class (we setalpha = 0.25).gammais the focusing parameter (we setgamma = 2).
We customized the Focal Loss to handle the class imbalance in our dataset. The implementation in PyTorch is as follows:
class FocalLoss(nn.Module):
def __init__(self, gamma=2, alpha=0.25):
super(FocalLoss, self).__init__()
self.gamma = gamma
self.alpha = alpha
def forward(self, logits, labels):
probs = torch.sigmoid(logits) ## Convert logits to probabilities
ce_loss = nn.BCEWithLogitsLoss(reduction='none')(logits, labels) ## Binary cross-entropy loss
pt = torch.where(labels == 1, probs, 1 - probs) ## Probability of the true class
focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss ## Apply focal loss formula
return focal_loss.mean() ## Return mean loss over the batch
This loss function reduces the relative loss for well-classified examples and focuses on those that the model struggles with.
- 5-Fold Multilabel Stratified Cross-Validation: Ensured that each fold has a similar distribution of labels, crucial for multi-label datasets.
- Early Stopping: Monitored the validation Micro F1 score to prevent overfitting. Training stops if the performance doesn't improve for a specified number of epochs (patience of 3 epochs).
- Optimizer: Used
AdamWwith weight decay for regularization. - Layer-wise Learning Rates: Applied different learning rates to different layers:
- Embeddings and lower layers:
1e-5 - Middle layers:
1e-5 - Higher layers:
2e-5 - Classification head:
3e-5
- Embeddings and lower layers:
- Learning Rate Scheduler: Employed a linear learning rate scheduler with a warm-up phase (10% of total steps).
In multi-label classification, it's essential to determine the optimal threshold to convert predicted probabilities into binary labels.
For each class c, we determined the threshold Ο_c that maximizes the chosen metric (e.g., F1 or Micro F1 score) on the validation set.
The process involved:
- For each class, iterate over possible thresholds in the range [0.1, 0.9] with steps of 0.01.
- For each threshold, compute the metric (e.g., F1 score) using the validation data.
- Select the threshold that yields the highest metric value.
def apply_thresholds_with_limit_dynamic(y_probs, thresholds, max_labels=3):
binary_preds = np.zeros_like(y_probs)
for i in range(y_probs.shape[0]):
probs = y_probs[i]
## Get indices sorted by probability in descending order
sorted_indices = np.argsort(probs)[::-1]
selected_indices = []
for idx in sorted_indices:
if probs[idx] >= thresholds[idx]:
selected_indices.append(idx)
if len(selected_indices) == max_labels:
break
## Assign label if conditions met
binary_preds[i, selected_indices] = 1
return binary_preds
To ensure that each instance is assigned a realistic number of labels, we applied the thresholds while limiting the maximum number of labels per instance to k (we used k=3).
The procedure is:
- For each instance:
- Obtain the predicted probabilities for all classes.
- Sort the classes by their predicted probabilities in descending order.
- Assign labels:
- Iterate through the sorted classes.
- Assign a label if the predicted probability exceeds the class threshold
c. - Stop assigning labels once
klabels have been assigned.
This method ensures that the most confident predictions (above the threshold) are selected, up to the maximum number of labels per instance.
We evaluated our model using several metrics suitable for multi-label classification:
- Micro F1 Score: Measures the F1 score globally by counting the total true positives, false negatives, and false positives.
- Macro F1 Score: Averages the F1 score per class without considering class imbalance.
- Weighted F1 Score: Averages the F1 score per class, weighted by the number of true instances per class.
- Hamming Loss: Fraction of labels that are incorrectly predicted.
- Jaccard Index: Also known as Intersection over Union, measures the similarity between the predicted and true label sets.
- Per-Class Precision, Recall, and F1 Scores: Provides insight into the model's performance on individual classes.
- The best model per fold was saved based on the highest validation Micro F1 score.
- Optimal thresholds per class were saved for each fold.
- The fold models were added to an ensemble for final evaluation.
- Ensemble Model: Combined the models from each fold using weighted averaging based on their validation performance.
- Threshold Aggregation: The optimal thresholds from each fold were averaged to obtain ensemble thresholds.
- Test Metrics:
| Metric | Score |
|---|---|
| Micro F1 | 0.7359 |
| Macro F1 | 0.7042 |
| Weighted F1 | 0.7359 |
| Hamming Loss | 0.0637 |
| Jaccard Index | 0.6711 |
This section details the various outputs generated by the model:
- Location:
./artifacts/models/ - Content:
- Best-performing model for each fold during k-fold cross-validation.
- Saved using Hugging Faceβs
save_pretrainedmethod, including both the model weights and tokenizer. 3
- Location:
./artifacts/thresholds/ - Content:
- Per-Fold Thresholds:
- Files:
fold_1_optimal_thresholds.pkl,fold_2_optimal_thresholds.pkl, etc. - Description: Optimal thresholds for each class obtained during cross-validation.
- Files:
- Ensemble Thresholds:
- File:
ensemble_optimal_thresholds.pkl - Description: Average thresholds across all folds, used for final test evaluation.
- File:
- Per-Fold Thresholds:
- Location:
./artifacts/metrics/ - Content:
- Per-Fold Metrics:
- Includes metrics like Micro F1, Macro F1, Weighted F1, Hamming Loss, and Jaccard Index for each fold.
- Test Set Metrics:
- File:
ensemble_test_metrics.pkl - Description: Metrics calculated on the test set using the ensemble of models.
- File:
- Per-Fold Metrics:
- Location:
./artifacts/mlb_multi_label.pkl - Content:
- A
MultiLabelBinarizerobject fitted on the training dataset. - Used for encoding and decoding multi-label targets.
- A
- Location:
training.log - Content:
- Detailed logs of the training process, including:
- Data preprocessing steps.
- Training and validation metrics for each epoch and fold.
- Final test set evaluation metrics.
- Detailed logs of the training process, including:
- Content:
- Aggregated probabilities and binary predictions from the ensemble models.
- Predictions stored in memory for evaluation metrics and threshold application.
This script uses the ensemble of fine-tuned RoBERTa models to perform multi-label classification on unlabeled Reddit posts and comments. It assigns labels based on confidence thresholds, assignes confidence score for Human-In-The-Loop review, selects high-confidence samples, and visualizes the results.
- Ensemble Predictions: Combines predictions from multiple models using weighted averaging.
- Label Assignment: Assigns multi-label classifications using class-specific thresholds while limiting the number of labels per instance.
- Confidence Scoring: Computes overall confidence scores for predictions using methods such as average or maximum probability.
- High-Confidence Sample Selection: Selects the top percentage of samples based on confidence scores for further analysis or manual labeling.
- t-SNE Visualizations: Reduces prediction probabilities to two dimensions for visualization, including both topic-specific and combined plots.
- Contrastive Word Clouds: Generates word clouds for each predicted topic, highlighting distinctive terms using contrastive TF-IDF scores.
- CSV Files:
- Predictions with assigned labels and confidence scores.
- High-confidence samples.
- Data with t-SNE coordinates.
- Visualizations:
- Topic-specific and combined t-SNE plots.
- Contrastive word clouds for each topic.
- Log File:
- Detailed logs of the prediction workflow.
- Pre-trained ensemble models, thresholds, and
MultiLabelBinarizersaved during training. - Unlabeled data in
/data/processed/unlabeled_reddit_posts_processed(From the Classification Preprocessing Step) Usage: - Configure the paths and settings in the
CONFIGdictionary within the script. - Run the script:
python nonFusion_Predict.py
We've had increasingly competitive performance on both Fusion and Non-Fusion ensemble models, where we have been tuning hyperparameters as well as utilizing Human-In-The-Loop active learning to target both low-confidence predictions and increasing training data for underperforming or under-represented classes. We're genuinely excited by the performance of the models and classification results, and believe we're only scratching the surface in terms of further improvement.
Our t-SNE of the embeddings show several clear policy clusters -- and shows serious improvement over the unsupervised methods explored in our EDA process.
Non-Fusion Multi-Label Classification Roberta-basd Model Per-Class Metrics: Class F1 Precision Recall
Class 0 0.8077 1.0000 0.6774
Class 1 0.7541 0.7077 0.8070
Class 2 0.7024 0.6629 0.7468
Class 3 0.7273 0.6471 0.8302
Class 4 0.7945 0.8529 0.7436
Class 5 0.7879 0.8966 0.7027
Class 6 0.8333 0.8075 0.8609
Class 7 0.7797 0.8519 0.7188
Class 8 0.5870 0.5510 0.6279
Class 9 0.6818 0.7500 0.6250
Class 10 0.4912 0.3889 0.6667
Class 11 0.5000 0.5833 0.4375
Class 12 0.7368 0.8333 0.6604
Our most difficult classification tasks fall between Other / Uncategorized and Culture and Recreation. Since we're not specifically filtering for topics or keywords in our scrape, many posts and conversations have nothing to do with policy areas relating to political bill topics/areas, so it's somewhat expected. However, these posts often straddle a thin, but imporant line between Culture and Recreation based posts.
https://policy-ensemble-social-posts-classification.streamlit.app/
While the current models demonstrate strong performance, we see several avenues for future development and improvement:
- RoBERTa-Large Fusion Model: Complete the development and integration of the ensemble fusion RoBERTa-Large model, which is anticipated to offer enhanced performance due to its larger capacity and fusion techniques.
- Hyperparameter Optimization: Conduct more extensive hyperparameter tuning for both existing and new models using techniques like Bayesian optimization or more exhaustive grid searches.
- Addressing Class Imbalance and Difficult Distinctions:
- Further refine strategies for handling class imbalance, potentially exploring advanced data augmentation or re-sampling techniques.
- Investigate methods to better distinguish between closely related or ambiguous categories, particularly "Other / Uncategorized" and "Culture and Recreation." This might involve more nuanced feature engineering, hierarchical classification, or targeted data collection for these specific classes.
- Expanding Policy Categories: Explore the possibility of adding new policy categories or refining existing ones based on evolving discourse or specific research needs.
- Temporal Analysis: Incorporate temporal analysis to track how discussions around policy topics evolve over time.
- Comment-Level Classification: Extend the classification capabilities to individual comments, which presents unique challenges due to shorter text and conversational context, but could provide finer-grained insights.
- Scalability and Efficiency: Optimize data processing and model inference pipelines for even larger datasets and real-time or near real-time analysis.
- Enhanced Dashboard Features: Add more interactive visualizations and analytical tools to the Streamlit dashboard, such as trend analysis, comparative views between states, or deeper dives into specific topics.
Description:
Topics related to health, healthcare services, public health initiatives, and medical research.
Example:
"Another public hospital closes in Montana, the third this year."
Description:
Covers armed forces, national defense, homeland security, and military policies.
Example:
"Iβm worried that China may come and steal my goats in the night, is that possible? Do they like goats?"
Description:
Includes crime prevention, law enforcement, policing, and emergency management.
Example:
"Third officer arrested in New York this week on corruption charges."
Description:
Focuses on international relations, foreign trade, diplomacy, and international finance.
Example:
"Vermont tightens border regulations with Canada, will maple syrup prices go up?"
Description:
Topics on government operations, legislation, law, political processes, and congressional matters.
Example:
"State congress motions for unlimited snack budget."
Description:
Encompasses topics related to financial stability, economic growth, labor policies, and trade practices that impact citizensβ day-to-day lives and the overall economy.
Example:
"If our property taxes go up again this year, Iβm moving to the moon. I mean it this time, Elon is really making progress on the moon."
Description:
Covers environmental protection, natural resources, energy, and water resource management.
Example:
"Historic flood washes away brand new solar panel installations."
Description:
Covers education, social welfare, housing, family support, and social sciences.
Example:
"Affordable housing is impossible to find right now in our state!"
Description:
Includes agriculture, farming policies, food production, and food safety.
Example:
"Organic farming takes a big hit this year, due to the wow-crop-delicious insect boom."
Description:
Topics on scientific research, technological advancements, and communication systems.
Example:
"Comcast sues small family-owned telephone maker in Florida."
Description:
Focuses on immigration policies, civil rights, minority issues, and Native American matters.
Example:
"This is crazy, my son canβt even get a job at Fast Food Express due to the recent influx of Swedish Meatball Farmers from Portugal."
Description:
Covers transportation systems, public works, and infrastructure development.
Example:
"I swear to god if they donβt fix these potholes Iβm going to write another strongly written letter."
Description:
Includes arts, culture, religion, sports, recreational activities, and animal-related topics.
Example:
"I love these moose. Iβm so glad we can own 5 now legally."
Description:
Use this label if the content does not fit into any specific category or is uncategorized.
Example:
"The post discusses personal opinions on various unrelated topics without a clear topic focus."






