A practical guide to finding and using the right dataset from the DSFSI collection.
Looking for something specific?
- I need a speech/audio dataset
- I need text data in a South African language
- I need terminology or translations
- I need data for teaching
- I want the most popular datasets
- I'm doing research on low-resource languages
Start here: African Next Voices (ZA-ANV)
- Why: Largest SA multilingual speech dataset (3,000 hours)
- Languages: isiZulu, Setswana, Sesotho sa Leboa, Tshivenda, Xitsonga, isiXhosa, Sesotho
- Includes: 8 pre-trained Whisper ASR models
- Downloads: 451k+
Other options:
- isiXhosa only: Vuk'uzenzele isiXhosa Speech (ViXSD)
- 11 SA languages: NCHLT Speech Corpus
- Multi-language: Lwazi Speech Corpus
Available models (all on HuggingFace under dsfsi-anv):
- Multilingual Whisper v3 Turbo (0.8B params, 7 languages)
- Language-specific models:
whisper-large-v3-turbo-anv-[zul|ven|tso|tsn|sot] - MMS-1B models (NCHLT & Lwazi variants)
| Dataset | Type | Size | Platform |
|---|---|---|---|
| African Next Voices | Speech | 3000 hrs | HuggingFace |
| Vuk'uzenzele Corpus | Parallel text | 2200+ pairs | GitHub |
| IsiZulu News 2022 | News text | - | GitHub |
| Umsuka Parallel Corpus | EN-ZU parallel | - | Zenodo |
| za-mafoko collections | Terminology | Multiple | HuggingFace |
Best starting point: PuoData
- Downloads: 126k+
- Includes: Pre-trained PuoBERTa model
- Variants: PuoBERTa-News, PuoBERTa-NER, PuoBERTa-POS
Also available:
- African Next Voices (speech)
- Vuk'uzenzele Corpus (text)
- za-mafoko terminology
- African Next Voices (speech)
- ViXSD (speech, dedicated)
- Vuk'uzenzele Corpus (text)
- za-mafoko terminology
Best option: Vuk'uzenzele Corpus
- Languages: English, Afrikaans, isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, Siswati, Tshivenda, Xitsonga
- Type: Parallel corpus from government magazine
- Size: 2,200+ aligned sentence pairs
- Use case: Machine translation, multilingual NLP
Government data: Gov-ZA Multilingual
- Cabinet statements in multiple languages
See the Datasets by Language section in the main README.
The most comprehensive multilingual terminology resource for SA languages:
| Dataset | Focus | Downloads | Best For |
|---|---|---|---|
| DSAC Terminology | Arts & Culture | 15.6k | Cultural terms |
| StatsSA Terminology | Statistics | 1.16k | Statistical terms |
| UP Glossary | Academic | 1.77k | Academic terms |
| AI Glossary | AI/ML | 85 | Tech terms |
| OERTB Terminology | Education | 5.74k | Educational terms |
Portal: za-mafoko.dsfsi.co.za
- WordNets for SA Languages
- Loughran McDonald-SA-2020 (financial sentiment)
Our top 5 most impactful datasets:
-
African Next Voices (ZA-ANV) - 451k downloads
- Speech recognition in 7 SA languages
-
za-mafoko Collection - 100k+ downloads
- Multilingual terminologies and glossaries
-
COVID-19 ZA - 256 stars
- Comprehensive pandemic data with dashboard
-
PuoData - 126k downloads
- Setswana corpus with BERT model
-
Vuk'uzenzele - 7 stars
- 11-language parallel corpus
- Kinyarwanda & Kirundi: Cross-lingual models
- 40+ African languages: Masakhane MT
- Multiple African languages: Pre-trained Embeddings
- Code-switching: AfroCS-xs
Each of the 11 SA official languages has resources in:
- Vuk'uzenzele parallel corpus
- za-mafoko terminology
- NCHLT speech corpus (where applicable)
- Provincial and district-level data (2020-present)
- Cases, deaths, testing, vaccination, mobility
- Live dashboard
- 64+ contributors, 256 stars
- ZASCA-Sum: Supreme Court judgments and summaries
- State Capture Transcripts: Zondo Commission transcripts
JSE Top 40 Stock Data (Local)
- Years: 2019, 2021, 2022, 2023, 2024
- Daily closing prices (Jan-Apr each year)
- Format: CSV and pickle
- Access: Included in this repository
- IsiZulu & Siswati News 2022
- South African News Data 2020
- ZA Fake News 2020 (disinformation)
Datasets curated for teaching and coursework:
- Customer Segmentation Data
- Hypermarket Dataset
- Market Basket Optimisation
- Online Retail II
- AG News Dataset
Location: data/cos781/ and data/cos802/ in this repository
Start here: Vuk'uzenzele Corpus
- 11 languages, 2,200+ aligned pairs
Also:
- PuoBERTa-News (Setswana)
- IsiZulu & Siswati News
- AG News Dataset (local)
- PuoBERTa-NER (Setswana)
- PuoBERTa-POS (Setswana)
- PuoData (Setswana)
- African Pre-Trained Embeddings
from datasets import load_dataset
# Load a dataset
dataset = load_dataset("dsfsi/PuoData")
# Load with specific split
train_data = load_dataset("dsfsi-anv/za-african-next-voices", split="train")# Clone repository
git clone https://github.com/dsfsi/covid19za.git
# Or download specific files
wget https://raw.githubusercontent.com/dsfsi/vukuzenzele-nlp/main/data/file.csv- Click "Download" on the Zenodo page
- Use DOI for citation in papers
- Some datasets have APIs (check Zenodo docs)
import pandas as pd
# JSE stock data
stocks = pd.read_csv("data/stocks/top40_jse_2024_performance.csv")
# Or pickle format
stocks_df = pd.read_pickle("data/stocks/top40_jse_2024_performance.df")
# Course data
customers = pd.read_csv("data/cos781/customer_segment_data.csv")Use datasets_index.json for programmatic discovery:
import json
# Load registry
with open('datasets_index.json') as f:
registry = json.load(f)
# Find all speech datasets
speech = [d for d in registry['datasets']
if d.get('category') == 'speech']
# Find datasets by language
zulu = [d for d in registry['datasets']
if 'zu' in d.get('languages', [])]
# Find datasets on HuggingFace
hf_datasets = [d for d in registry['datasets']
if d.get('platform') == 'huggingface']
# Find by tag
nlp_datasets = [d for d in registry['datasets']
if 'nlp' in d.get('tags', [])]When choosing a dataset, consider:
- Language coverage: Does it include your target language(s)?
- Size: Is it large enough for your task?
- License: Can you use it for your purpose (research, commercial, etc.)?
- Quality: Is it curated/validated? What's the error rate?
- Documentation: Is there a paper or detailed documentation?
- Format: Is it in a format you can work with?
- Pre-trained models: Are there models already available?
- Community: Is it actively maintained? Are others using it?
- Citation: Is there proper citation information?
Can't find what you're looking for?
- Check the main README: Complete catalog
- Browse by category: See Datasets by Category
- Browse by language: See Datasets by Language
- Check publications: DSFSI Publications
- Contact us: [email protected]
- Follow us: Twitter/X | LinkedIn | Bluesky
When using DSFSI datasets in research:
- Cite the specific dataset: Use DOI or URL from dataset page
- Cite the registry: For general dataset discovery
- Cite the paper: If there's an associated publication
Example citation for the registry:
@misc{dsfsi-datasets-2025,
title={DSFSI Public Datasets Registry},
author={{Data Science for Social Impact Research Group}},
year={2025},
publisher={University of Pretoria},
url={https://github.com/dsfsi/dsfsi-datasets}
}Found a dataset we should include? See CONTRIBUTING.md for guidelines on submitting new datasets.
DSFSI Research Group University of Pretoria www.dsfsi.co.za