This repository represent a simplified version of data processing projects written in Python with data processing engine and a medallion-like architecture. It represents combination of data science, written in notebooks, and data engineer and it should describe (through code) process of generating dataset, starting with ingestion all the way to the model training.
Repository contains:
- data - local representation of medallion architecture (bronze/silver/gold layers).
- model - saved model.
- notebook - notebooks for data analysis and model training.
- data processing jobs - applications processing data using Polars.
Requirements for running CLI:
- Python 3.11 (>= 3.10, <= 3.12)
- Poetry
- Python virtual environment (virtualenv, pyenv)
- Install Poetry -
curl -sSL https://install.python-poetry.org | python3 - - Clone this repository
$ git clone https://github.com/predrag-njegovanovic/library-analysis.git- Create virtual environment and activate it
- Install python libraries
$ poetry install- Inside
data/ingestdirectory, place data files (*.csv files). - Inside
storagedirectory createbronze,silverandgoldfolders.
Project is setup in a way that no additional changes should be needed for a local run. Project configuration is stored inside src/config/settings.toml and it's adapted for default (local) environment, but there are options for extensions.
The main application abstractions are defined in the common.py module and they are used for creating processing pipelines.
There are three pipelines which are moving data through storage layers thus making it more "usable".
Those three layers are:
- Ingestion,
- Transformation and
- Aggregation
And applications are conceived in that way.
Notebooks are there to simulate data science workload such as data analysis, feature analysis and model training.
After poetry installation, the script with an entrypoint is installed with the project. This creates a 'symlink' for the CLI activation.
Running ingestion:
lt ingestRunning processing:
lt processAfter these two commands data should be in the bronze layer and silver layer.
Only after this, notebooks can be run.
To create dataset run
lt create-datasetAll commands have --help option which shows the arguments.
There is also a command for getting predictions for customers and books and their late returns. This is implemented to mimick real implementation and it's very basic version of model serving.
lt predict --customer-id <customer_id> --book-id <book_id>Steps to reproduce results:
- Make sure all .csv files are inside
ingestdirectory, - Run
ingestcommand, - Run
processcommand, - Inspect
01_library_data_analysisnotebook, - Run
create-datasetcommand, - Inspect
02_prediction_model_trainingnotebook and - Try
predictcommand.