Repository to reproduce a credit risk data pipeline including data ingestion to HDFS, preprocessing notebooks, feature extraction, modeling, and a streaming demo with Kafka + Spark Streaming.
- Project overview
- Prerequisites
- Quick start
- Download data
- Upload data to HDFS
- Run notebooks (preprocessing, EDA, modeling)
- Build preprocessor for new data
- Streaming (Kafka + Spark Streaming)
- Testing the streaming pipeline
- Stopping and cleanup
- Troubleshooting & tips
- Contact & Credits
This repository contains code and notebooks for a complete credit risk analysis pipeline, including data ingestion, preprocessing, exploratory data analysis (EDA), feature extraction, model training, and a streaming demo that simulates events and predictions using Kafka and Spark Streaming.
To give an overview of how the system operates, the following diagrams illustrate the end-to-end pipeline and the real-time monitoring dashboard used in the demo:
The instructions below describe how to set up the project locally using Docker and HDFS, run the notebooks, create the preprocessor artifact required by the model, and start the streaming demo.
- Git
- Docker & Docker Compose
- Enough disk space for the dataset and containers
bash/ a POSIX-compatible shell for running provided scripts
Ports used (default):
- Jupyter Notebook:
localhost:8888 - Web dashboard (monitoring / demo):
localhost:5000
- Clone the repository and enter the project directory:
git clone https://github.com/nctrinh/Credit-Risk-Analysis.git
cd Credit-Risk-Analysis- Start required services with Docker Compose (both variants are accepted):
# either
docker compose up -d
# or
docker-compose up -dDownload the data files referenced in data/links_to_data.txt into the data/ folder. Ensure the downloaded file expected below is named loan.csv and placed in ./data/.
The repo contains
data/links_to_data.txt— use it as the source of truth for data URLs.
After starting containers, upload the dataset to HDFS so the notebooks and Spark jobs can access it.
Run the following commands (these are meant to be executed on your host machine):
# Leave safe mode on the namenode
docker exec -it namenode hdfs dfsadmin -safemode leave
# Create a temporary directory on the namenode container and copy the CSV there
docker exec -it namenode mkdir -p /opt/bigdata/data/
docker cp ./data/loan.csv namenode:/opt/bigdata/data/
# Create target directories in HDFS
docker exec -it namenode hdfs dfs -mkdir -p /bigdata/data
docker exec -it namenode hdfs dfs -mkdir -p /bigdata/notebooks
docker exec -it namenode hdfs dfs -mkdir -p /bigdata/data/splitted_data
docker exec -it namenode hdfs dfs -mkdir -p /bigdata/data/processed_data
# Put the CSV into HDFS
docker exec -it namenode hdfs dfs -put /opt/bigdata/data/loan.csv /bigdata/data/Open Jupyter Notebook at http://localhost:8888.
Navigate to work/notebooks and run in order (use Run All for each notebook):
data_preprocessing.ipynbdata_splitting.ipynbEDA.ipynbfeature_extraction.ipynbmodeling.ipynb
After notebooks finish you can stop the Jupyter Notebook service if you like.
To generate the preprocessor artifact (used to preprocess new incoming data), run the following inside the jupyter_notebook container:
# open a shell inside the jupyter container
docker exec -it jupyter_notebook bash
# activate conda environment and run preprocessor builder
conda activate py37
python3 work/model/preprocessor.pyThis will produce the preprocessor file used by streaming consumers / model inference.
To enable streaming for the demo, run the scripts in the scripts/ folder. Run each command in its own terminal (they are long-running processes):
bash scripts/create_topics.sh
bash scripts/run_spark_streaming.sh
bash scripts/run_consumer.shWhen streaming is running you can observe the demo/dashboard at http://localhost:5000.
To generate test events and see end-to-end behavior, open a new terminal and run the test client:
python3 test_client.pyFollow the on-screen menu and choose a number to select a test to run (the script will indicate available choices).
To stop the Docker Compose stack:
# from repo root
docker compose down
# or
docker-compose down(Optional) Remove volumes if you want a clean start (be careful — this deletes persisted container data):
docker compose down -v
# or
docker-compose down -v- If Jupyter fails to start, check container logs:
docker logs jupyter_notebook. - If HDFS commands fail, ensure the
namenodecontainer is healthy andhdfs dfsadmin -safemode leavehas been executed. - For streaming issues, check Kafka topics exist: run
docker exec -it kafka kafka-topics --list --bootstrap-server localhost:9092. - If ports are already taken, either stop the conflicting service or change the port mapping in
docker-compose.yml.
This project is maintained by the development team: Nguyễn Công Trình, Phạm Thế Trung, and Nguyễn Công Vinh.
If you have any questions regarding the codebase, please open an issue in this repository.
If you want, I can also:
- Add badges (build / license)
- Generate a short usage GIF or a screenshot guide
- Create a
Makefileor simplified scripts to automate the above steps

