DSFSI Datasets Discovery Guide

A practical guide to finding and using the right dataset from the DSFSI collection.

Quick Start

Looking for something specific?

I need a speech/audio dataset
I need text data in a South African language
I need terminology or translations
I need data for teaching
I want the most popular datasets
I'm doing research on low-resource languages

Finding Speech Datasets

For Automatic Speech Recognition (ASR)

Start here: African Next Voices (ZA-ANV)

Why: Largest SA multilingual speech dataset (3,000 hours)
Languages: isiZulu, Setswana, Sesotho sa Leboa, Tshivenda, Xitsonga, isiXhosa, Sesotho
Includes: 8 pre-trained Whisper ASR models
Downloads: 451k+

Other options:

isiXhosa only: Vuk'uzenzele isiXhosa Speech (ViXSD)
11 SA languages: NCHLT Speech Corpus
Multi-language: Lwazi Speech Corpus

Pre-trained ASR Models

Available models (all on HuggingFace under dsfsi-anv):

Multilingual Whisper v3 Turbo (0.8B params, 7 languages)
Language-specific models: whisper-large-v3-turbo-anv-[zul|ven|tso|tsn|sot]
MMS-1B models (NCHLT & Lwazi variants)

Finding Text Datasets by Language

isiZulu (Zulu)

Dataset	Type	Size	Platform
African Next Voices	Speech	3000 hrs	HuggingFace
Vuk'uzenzele Corpus	Parallel text	2200+ pairs	GitHub
IsiZulu News 2022	News text	-	GitHub
Umsuka Parallel Corpus	EN-ZU parallel	-	Zenodo
za-mafoko collections	Terminology	Multiple	HuggingFace

Setswana (Tswana)

Best starting point: PuoData

Downloads: 126k+
Includes: Pre-trained PuoBERTa model
Variants: PuoBERTa-News, PuoBERTa-NER, PuoBERTa-POS

Also available:

African Next Voices (speech)
Vuk'uzenzele Corpus (text)
za-mafoko terminology

isiXhosa (Xhosa)

African Next Voices (speech)
ViXSD (speech, dedicated)
Vuk'uzenzele Corpus (text)
za-mafoko terminology

Multiple SA Languages (11 languages)

Best option: Vuk'uzenzele Corpus

Languages: English, Afrikaans, isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, Siswati, Tshivenda, Xitsonga
Type: Parallel corpus from government magazine
Size: 2,200+ aligned sentence pairs
Use case: Machine translation, multilingual NLP

Government data: Gov-ZA Multilingual

Cabinet statements in multiple languages

For All Language Options

See the Datasets by Language section in the main README.

Finding Terminology Resources

za-mafoko Collection

The most comprehensive multilingual terminology resource for SA languages:

Dataset	Focus	Downloads	Best For
DSAC Terminology	Arts & Culture	15.6k	Cultural terms
StatsSA Terminology	Statistics	1.16k	Statistical terms
UP Glossary	Academic	1.77k	Academic terms
AI Glossary	AI/ML	85	Tech terms
OERTB Terminology	Education	5.74k	Educational terms

Portal: za-mafoko.dsfsi.co.za

Other Lexicon Resources

WordNets for SA Languages
Loughran McDonald-SA-2020 (financial sentiment)

Featured & Popular Datasets

Our top 5 most impactful datasets:

African Next Voices (ZA-ANV) - 451k downloads
- Speech recognition in 7 SA languages
za-mafoko Collection - 100k+ downloads
- Multilingual terminologies and glossaries
COVID-19 ZA - 256 stars
- Comprehensive pandemic data with dashboard
PuoData - 126k downloads
- Setswana corpus with BERT model
Vuk'uzenzele - 7 stars
- 11-language parallel corpus

Low-Resource Language Datasets

African Languages Beyond SA

Kinyarwanda & Kirundi: Cross-lingual models
40+ African languages: Masakhane MT
Multiple African languages: Pre-trained Embeddings
Code-switching: AfroCS-xs

Language-Specific Resources

Each of the 11 SA official languages has resources in:

Vuk'uzenzele parallel corpus
za-mafoko terminology
NCHLT speech corpus (where applicable)

Domain-Specific Datasets

Public Health

COVID-19 ZA

Provincial and district-level data (2020-present)
Cases, deaths, testing, vaccination, mobility
Live dashboard
64+ contributors, 256 stars

Legal Documents

ZASCA-Sum: Supreme Court judgments and summaries
State Capture Transcripts: Zondo Commission transcripts

Financial Data

JSE Top 40 Stock Data (Local)

Years: 2019, 2021, 2022, 2023, 2024
Daily closing prices (Jan-Apr each year)
Format: CSV and pickle
Access: Included in this repository

News & Media

Educational Datasets

Datasets curated for teaching and coursework:

COS781 (Customer Analytics)

Customer Segmentation Data
Hypermarket Dataset
Market Basket Optimisation
Online Retail II

COS802 (Text Analytics)

AG News Dataset

Location: data/cos781/ and data/cos802/ in this repository

Datasets for Specific Tasks

Machine Translation

Start here: Vuk'uzenzele Corpus

11 languages, 2,200+ aligned pairs

Also:

Text Classification

How to Access Datasets

From HuggingFace

from datasets import load_dataset

# Load a dataset
dataset = load_dataset("dsfsi/PuoData")

# Load with specific split
train_data = load_dataset("dsfsi-anv/za-african-next-voices", split="train")

From GitHub

# Clone repository
git clone https://github.com/dsfsi/covid19za.git

# Or download specific files
wget https://raw.githubusercontent.com/dsfsi/vukuzenzele-nlp/main/data/file.csv

From Zenodo

Click "Download" on the Zenodo page
Use DOI for citation in papers
Some datasets have APIs (check Zenodo docs)

Local Datasets (This Repo)

import pandas as pd

# JSE stock data
stocks = pd.read_csv("data/stocks/top40_jse_2024_performance.csv")

# Or pickle format
stocks_df = pd.read_pickle("data/stocks/top40_jse_2024_performance.df")

# Course data
customers = pd.read_csv("data/cos781/customer_segment_data.csv")

Programmatic Search

Use datasets_index.json for programmatic discovery:

import json

# Load registry
with open('datasets_index.json') as f:
    registry = json.load(f)

# Find all speech datasets
speech = [d for d in registry['datasets']
          if d.get('category') == 'speech']

# Find datasets by language
zulu = [d for d in registry['datasets']
        if 'zu' in d.get('languages', [])]

# Find datasets on HuggingFace
hf_datasets = [d for d in registry['datasets']
               if d.get('platform') == 'huggingface']

# Find by tag
nlp_datasets = [d for d in registry['datasets']
                if 'nlp' in d.get('tags', [])]

Dataset Selection Checklist

When choosing a dataset, consider:

Language coverage: Does it include your target language(s)?
Size: Is it large enough for your task?
License: Can you use it for your purpose (research, commercial, etc.)?
Quality: Is it curated/validated? What's the error rate?
Documentation: Is there a paper or detailed documentation?
Format: Is it in a format you can work with?
Pre-trained models: Are there models already available?
Community: Is it actively maintained? Are others using it?
Citation: Is there proper citation information?

Need Help?

Can't find what you're looking for?

Check the main README: Complete catalog
Browse by category: See Datasets by Category
Browse by language: See Datasets by Language
Check publications: DSFSI Publications
Contact us: [email protected]
Follow us: Twitter/X | LinkedIn | Bluesky

Citing Datasets

When using DSFSI datasets in research:

Cite the specific dataset: Use DOI or URL from dataset page
Cite the registry: For general dataset discovery
Cite the paper: If there's an associated publication

Example citation for the registry:

@misc{dsfsi-datasets-2025,
  title={DSFSI Public Datasets Registry},
  author={{Data Science for Social Impact Research Group}},
  year={2025},
  publisher={University of Pretoria},
  url={https://github.com/dsfsi/dsfsi-datasets}
}

Contributing

Found a dataset we should include? See CONTRIBUTING.md for guidelines on submitting new datasets.

DSFSI Research Group University of Pretoria www.dsfsi.co.za

FilesExpand file tree

DATASETS_GUIDE.md

Latest commit

History