This folder contains scripts and workflows to generate synthetic QA datasets based on random walk on Knowledge Graph. The final merged outputs contribute to the official dataset:
# Add the official MongoDB source
wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" \
| tee /etc/apt/sources.list.d/mongodb-org-6.0.list
# Install the toolkit and verify
apt-get update && apt-get install -y mongodb-database-tools
mongoimport --version
# Install Python dependencies
pip install -r requirements.txtDownload KILT dataset (Wikipedia snapshot, ~35GB):
wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.jsonRun MongoDB container:
docker run -d --name mongo -p 27017:27017 -v mongodata:/data/db mongo:6Import knowledge source (10–20 min):
mongoimport --uri "mongodb://127.0.0.1:27017" \
--db kilt --collection knowledgesource \
--file kilt_knowledgesource.jsonCreate index for faster queries:
docker exec -it mongo mongosh --eval \
'db.getSiblingDB("kilt").knowledgesource.createIndex({ wikipedia_title: "text", text: "text" })'Test connection:
python3 kilt_query.pyFirst, create the .env file
touch .envthen add your credentials:
OPENAI_API_KEY=sk-your-openai-key-here
OPENAI_BASE_URL=https://api.openai.com/v1
OPENROUTER_API_KEY=sk-your-openrouter-key-here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1Sample multi-hop paths:
python3 random_walk_kilt.pyOutputs will be stored in ./random_walk_outputs/.
Use Gemini-2.5-pro to create Q&A pairs:
python3 generate_qa.py \
--input ./random_walk_outputs/random_walk_architecture.jsonl \
--output ./random_walk_outputs/random_walk_architecture_qa.jsonl