Oqura's local-datagen-cli is a terminal tool for generating structured datasets from local files like PDFs, Word docs, images, and text. You upload a file and describe the kind of dataset you want. It extracts the content, uses semantic search to understand and gather relevant context, applies your instructions through a generated schema, and outputs clean, structured data. Perfect for converting raw or unstructured local documents into ready-to-use datasets for training, analysis, or experimentation, all without manual formatting.
- give the path to a local directory containing all kind files mentioned (PDF, DOCX, JPG, TXT, etc.)
- extracts text from the uploaded document
- splits the content page-wise into smaller chunks
- randomly selects a chunk to use as a reference
- runs a semantic similarity search using Qdrant to find related chunks
- gathers similar chunks to build a context window
- formats the gathered context cleanly
- generates structured data using an instruction query and generated schema
- evolves and improves the dataset iteratively
- combines generated samples into a complete dataset
- exports the final dataset in CSV or JSON format via the terminal
This diagram shows how the tool takes a local file and an instruction, extracts and understands the content, and turns it into a structured dataset.
Follow these steps to set up and run the project locally.
uv is required to manage the virtual environment and dependencies.
You can download it from the official uv GitHub repository, which includes platform-specific installation instructions.
git clone https://github.com/Oqura-ai/local-datagen-cli.git
cd local-datagen-cliUse uv to create a virtual environment:
uv venvActivate the environment depending on your OS:
Windows:
.venv\Scripts\activatemacOS/Linux:
source .venv/bin/activateCopy the example .env file and add your API keys:
cp .env.example .envOpen the .env file in a text editor and fill in the required fields:
OPENAI_API_KEY=your_openai_api_key_here
MISTRAL=your_mistral_api_key_here
# defaults
QDRANT_URL=http://localhost:6333
COLLECTION_NAME=knowledge_base
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
These keys are essential for the application to work correctly.
Install required packages using:
uv pip install -r requirements.txtMake sure you have Docker and Docker Compose installed. Then start the required services (e.g., Qdrant) using:
docker-compose up --buildThis will spin up the necessary services in the background.
Once the environment and services are ready, start the application:
python main.pyYou're all set to go! The application will now guide you through the dataset creation process step by step and the final dataset will be saved in the output_files directory.
You can customize how the tool behaves using the configuration.py file. It lets you adjust 2 parameters for this application.
CONFIGURATION = {
"rows_per_context": 5, # Number of QAs or rows generated per chunk
"evolution_depth": 1, # How much transformation/evolution to apply (1 = minimal, 3 = very complex)
}If something here could be improved, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for more details.

