This project scrapes data from German Rap subreddits, processes the text, trains a Word2Vec model, performs sentiment analysis on mentions of rappers using an LLM, and analyzes the results, including temporal trends.
- Reddit Data Scraping: Fetches posts and comments from specified subreddits using Pushshift and PRAW APIs.
- Text Processing: Cleans raw Reddit text, removing noise like URLs, markdown, and bot messages.
- N-gram Generation: Identifies and creates meaningful bigrams and trigrams (e.g., "kool_savas", "k_i_z").
- Word Embedding: Trains a Word2Vec model on the processed Reddit corpus.
- Rapper Alias Discovery: Uses the Word2Vec model to find potential aliases or related terms for known rappers.
- Sentiment Analysis: Utilizes an LLM (via Ollama) to assign sentiment scores (1-5) to text snippets mentioning specific rappers.
- Data Storage: Stores sentiment results, including timestamps, in an SQLite database.
- Evaluation Framework: Includes tools for evaluating LLM performance against a manually annotated test set and comparing different models/prompts.
- Temporal Analysis: Links sentiment data with original timestamps for time-based analysis.
- Visualization: Generates interactive plots for word embeddings and sentiment analysis results.
-
Prerequisites:
- Python 3.x
- Git
-
Clone the Repository:
git clone <your-repository-url> cd <repository-directory>
-
Create a Virtual Environment:
python -m venv .venv # Activate the environment # Windows: .\.venv\Scripts\activate # macOS/Linux: source .venv/bin/activate
-
Install Dependencies:
pip install -r requirements.txt
-
Environment Variables:
- Create a
.envfile in theStep 1 Reddit Scraperdirectory. - Add the following required variables (obtain credentials from Reddit and Pushshift):
# Reddit API Credentials (PRAW) CLIENT_ID=your_reddit_client_id CLIENT_SECRET=your_reddit_client_secret REFRESH_TOKEN=your_reddit_refresh_token USER_AGENT='YourAppDescription by /u/YourUsername' # Pushshift API Credentials PUSHSHIFT_ACCESS_TOKEN=your_pushshift_access_token # Scraper Configuration (Optional - Defaults shown) SUBREDDITS=germanrap # Comma-separated if multiple LIMIT=1000000 SINCE=2010-01-01 UNTIL=YYYY-MM-DD # Defaults to current date
- You might need another
.envfile in the root orStep 3 Sentiment Analysisfor theDISCORD_WEBHOOK_URLif you want crash notifications.
- Create a
-
Ollama:
- Ensure you have Ollama installed and running.
- Pull the required LLM model (e.g.,
ollama pull qwen2.5:3b). The model used for sentiment analysis is specified inStep 3 Sentiment Analysis/sentiment-analysis.py.
The project is structured in sequential steps. Run the main script within each step's directory in order.
-
Gather Rapper List (Supporting):
- Run scripts in
Supporting - List of Rappers/(Spotify, Wikipedia) to generate an initial list of artists (e.g.,all_artists.txt).
- Run scripts in
-
Step 1: Scrape Reddit Data:
- Navigate to
Step 1 Reddit Scraper/. - Ensure
.envis configured correctly. - Run:
python mainscript.py - Output: JSON files containing post and comment data in
Step 1 Reddit Scraper/1-posts/.
- Navigate to
-
Step 2.1: Prepare Text:
- Navigate to
Step 2.1 Prepare Text for Word2Vec/. - Run:
python reddit_text_extraction_for_word2vec.py - Output: Cleaned sentences in
2_1-processed_sentences.txt.
- Navigate to
-
Step 2.2: Create N-grams:
- Navigate to
Step 2.2 Create Bi-and Trigrams for Word2Vec/. - Run:
python creating_ngrams.py - Output: Sentences with n-grams joined by underscores in
2_2-sentences_with_ngrams.txt. Also createsbigrams.txtandtrigrams.txt.
- Navigate to
-
Step 2.3: Train Word2Vec & Analyze Aliases:
- Navigate to
Step 2.3 Train Word2Vec/. - Run
python add_rappers_no_alias.pyto initializerapper_aliases.jsonusing the list from the Supporting step. - Run
python train_word2vec.pyto train the model. - Output:
word2vec_model.model. - Run
python find_rappers.pyto interactively find and confirm potential aliases based on word similarity, updatingrapper_aliases.json. - Use
create_interactive_view.pyorword2vec-visualizer.pyto explore embeddings.
- Navigate to
-
Step 3: Sentiment Analysis:
- Navigate to
Step 3 Sentiment Analysis/. - (Manual Step): Run
python test-set-creator.pyto launch the GUI and manually annotate samples for evaluation. This createstest_set.json. - Run
python sentiment-analysis.pyto perform sentiment analysis using the configured LLM (requires Ollama running). - Output: Results stored in
rapper_sentiments.db. Progress saved insentiment_analysis_progress.json. - Use
evaluate_BERT_baseline.pyorllm_evaluator.py/prompt_evaluator.pyto evaluate model performance againsttest_set.json. - Run
python clean-sentiment-db.pyto cleanERRORorNO_SENTIMENTentries if needed (convertsNO_SENTIMENTto3). - Run
python analyser.pyto generate analysis reports and visualizations from the database results (saved tosentiment_analysis_results/).
- Navigate to
-
Step 3.1: Add Timestamps to Database:
- This step refines the timestamp association after n-grams have been created.
- Navigate to
Step 3.1 Tack together time data with database/. - Run
python reddit_text_extraction_for_word2vec.py(extracts sentences with timestamps).- Output:
2_1-processed_sentences_with_time.txt.
- Output:
- Run
python ngrams-with-timestamps-txt.py(applies n-grams, saves mapping).- Outputs:
2_2-sentences_with_ngrams.txt(overwritten/same as Step 2.2 output),sentence_timestamps_mapping.txt.
- Outputs:
- Run
python fix-mapping.py(corrects mapping using the final n-gram sentences).- Output:
corrected_timestamps_mapping.txt.
- Output:
- Run
python update-db-txt.py(updates the database using the corrected mapping).- Output: Adds/updates the
original_timestampcolumn inrapper_sentiments.db.
- Output: Adds/updates the
-
Final Analysis:
- Navigate back to
Step 3 Sentiment Analysis/. - Run
python analyser.pyagain to generate reports incorporating the timestamp data.
- Navigate back to
- Step 1/mainscript.py: Orchestrates Reddit scraping and token management.
- Step 2.1/reddit_text_extraction_for_word2vec.py: Cleans JSON data and extracts sentences.
- Step 2.2/creating_ngrams.py: Identifies and applies bigrams/trigrams.
- Step 2.3/train_word2vec.py: Trains the Word2Vec model.
- Step 2.3/find_rappers.py: Analyzes rapper similarity and helps build the alias list.
- Step 3/test-set-creator.py: GUI tool for manual sentiment annotation.
- Step 3/sentiment-analysis.py: Performs LLM-based sentiment analysis and saves to DB.
- Step 3/evaluate_BERT_baseline.py: Evaluates model performance against the test set.
- Step 3.1/update-db-txt.py: Adds accurate timestamps to the sentiment database.
- Step 3/analyser.py: Generates final reports and visualizations from the sentiment database.
- API Keys & Scraper Settings: Configure in
Step 1 Reddit Scraper/.env. - LLM Model: The model used for sentiment analysis is set within
Step 3 Sentiment Analysis/sentiment-analysis.py(currentlyqwen2.5:3b). You may need to adjust this based on available models in Ollama. - Rapper List: The initial list is generated by scripts in
Supporting - List of Rappers/. The final alias mapping is managed inStep 2.3 Train Word2Vec/rapper_aliases.json.
All required Python packages are listed in requirements.txt.
Note that generated data files (JSONs in 1-posts/, .txt, .csv, .model, .db, .html, .png), log files, and environment files (.env) are ignored by Git.
Happy Analyzing!