Library data analysis

This repository represent a simplified version of data processing projects written in Python with data processing engine and a medallion-like architecture. It represents combination of data science, written in notebooks, and data engineer and it should describe (through code) process of generating dataset, starting with ingestion all the way to the model training.

Repository contains:

data - local representation of medallion architecture (bronze/silver/gold layers).
model - saved model.
notebook - notebooks for data analysis and model training.
data processing jobs - applications processing data using Polars.

Getting Started

Prerequisites

Requirements for running CLI:

Python 3.11 (>= 3.10, <= 3.12)
Poetry
Python virtual environment (virtualenv, pyenv)

Local setup

Install Poetry - curl -sSL https://install.python-poetry.org | python3 -
Clone this repository

$ git clone https://github.com/predrag-njegovanovic/library-analysis.git

Create virtual environment and activate it
Install python libraries

$ poetry install

Inside data/ingest directory, place data files (*.csv files).
Inside storage directory create bronze, silver and gold folders.

Project is setup in a way that no additional changes should be needed for a local run. Project configuration is stored inside src/config/settings.toml and it's adapted for default (local) environment, but there are options for extensions.

The idea

The main application abstractions are defined in the common.py module and they are used for creating processing pipelines. There are three pipelines which are moving data through storage layers thus making it more "usable". Those three layers are:

Ingestion,
Transformation and
Aggregation

And applications are conceived in that way.

Notebooks are there to simulate data science workload such as data analysis, feature analysis and model training.

CLI

After poetry installation, the script with an entrypoint is installed with the project. This creates a 'symlink' for the CLI activation.

Running ingestion:

lt ingest

Running processing:

lt process

After these two commands data should be in the bronze layer and silver layer.

Only after this, notebooks can be run.

To create dataset run

lt create-dataset

All commands have --help option which shows the arguments.

There is also a command for getting predictions for customers and books and their late returns. This is implemented to mimick real implementation and it's very basic version of model serving.

lt predict --customer-id <customer_id> --book-id <book_id>

Steps to reproduce results:

Make sure all .csv files are inside ingest directory,
Run ingest command,
Run process command,
Inspect 01_library_data_analysis notebook,
Run create-dataset command,
Inspect 02_prediction_model_training notebook and
Try predict command.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
architecture		architecture
data/ingest		data/ingest
model		model
notebook		notebook
src		src
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Library data analysis

Getting Started

Prerequisites

Local setup

The idea

CLI

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Library data analysis

Getting Started

Prerequisites

Local setup

The idea

CLI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages