Skip to content

Latest commit

 

History

History
97 lines (66 loc) · 2.37 KB

File metadata and controls

97 lines (66 loc) · 2.37 KB

Synthetic QA Data Generation with Knowledge Graphs

This folder contains scripts and workflows to generate synthetic QA datasets based on random walk on Knowledge Graph. The final merged outputs contribute to the official dataset:

Dataset

📦 Step 1: Install Dependencies

Install MongoDB Tools (Ubuntu)

# Add the official MongoDB source
wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" \
    | tee /etc/apt/sources.list.d/mongodb-org-6.0.list

# Install the toolkit and verify
apt-get update && apt-get install -y mongodb-database-tools
mongoimport --version

# Install Python dependencies
pip install -r requirements.txt

📚 Step 2: Download Knowledge Graph Source

Download KILT dataset (Wikipedia snapshot, ~35GB):

wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json

🗄️ Step 3: Start Database & Import Data

Run MongoDB container:

docker run -d --name mongo -p 27017:27017 -v mongodata:/data/db mongo:6

Import knowledge source (10–20 min):

mongoimport --uri "mongodb://127.0.0.1:27017" \
  --db kilt --collection knowledgesource \
  --file kilt_knowledgesource.json

Create index for faster queries:

docker exec -it mongo mongosh --eval \
  'db.getSiblingDB("kilt").knowledgesource.createIndex({ wikipedia_title: "text", text: "text" })'

Test connection:

python3 kilt_query.py

🔑 Step 4: Configure API Keys

First, create the .env file

touch .env

then add your credentials:

OPENAI_API_KEY=sk-your-openai-key-here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENROUTER_API_KEY=sk-your-openrouter-key-here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

🎲 Step 5: Random Walk on Knowledge Graph

Sample multi-hop paths:

python3 random_walk_kilt.py

Outputs will be stored in ./random_walk_outputs/.

💬 Step 6: Generate QA Data

Use Gemini-2.5-pro to create Q&A pairs:

python3 generate_qa.py \
  --input ./random_walk_outputs/random_walk_architecture.jsonl \
  --output ./random_walk_outputs/random_walk_architecture_qa.jsonl