The Dataprep Microservice aims to preprocess the data from various sources (either structured or unstructured data) to text data, and convert the text data to embedding vectors then store them in the database.
apt-get update
apt-get install libreofficeOccasionally unstructured data will contain image data, to convert the image data to the text data, LVM can be used to summarize the image. To leverage LVM, please refer to this readme to start the LVM microservice first and then set the below environment variable, before starting any dataprep microservice.
export SUMMARIZE_IMAGE_VIA_LVM=1For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
For details, please refer to this readme
The following steps are common for running the dataprep microservice in an air gapped environment (a.k.a. environment with no internet access), for all DB backends.
- Download the following models, e.g.
huggingface-cli download --cache-dir <model data directory> <model>
- microsoft/table-transformer-structure-recognition
- timm/resnet18.a1_in1k
- unstructuredio/yolo_x_layout
- launch the
dataprepmicroservice with the following settings:
- mount the
model data directoryas the/datadirectory within thedataprepcontainer - set environment variable
HF_HUB_OFFLINEto 1 when launching thedataprepmicroservice
e.g. docker run -d -v <model data directory>:/data -e HF_HUB_OFFLINE=1 ... ...