Skip to content

Latest commit

 

History

History
340 lines (235 loc) · 11.8 KB

File metadata and controls

340 lines (235 loc) · 11.8 KB

DSFSI Datasets Discovery Guide

A practical guide to finding and using the right dataset from the DSFSI collection.

Quick Start

Looking for something specific?

Finding Speech Datasets

For Automatic Speech Recognition (ASR)

Start here: African Next Voices (ZA-ANV)

  • Why: Largest SA multilingual speech dataset (3,000 hours)
  • Languages: isiZulu, Setswana, Sesotho sa Leboa, Tshivenda, Xitsonga, isiXhosa, Sesotho
  • Includes: 8 pre-trained Whisper ASR models
  • Downloads: 451k+

Other options:

Pre-trained ASR Models

Available models (all on HuggingFace under dsfsi-anv):

  • Multilingual Whisper v3 Turbo (0.8B params, 7 languages)
  • Language-specific models: whisper-large-v3-turbo-anv-[zul|ven|tso|tsn|sot]
  • MMS-1B models (NCHLT & Lwazi variants)

Finding Text Datasets by Language

isiZulu (Zulu)

Dataset Type Size Platform
African Next Voices Speech 3000 hrs HuggingFace
Vuk'uzenzele Corpus Parallel text 2200+ pairs GitHub
IsiZulu News 2022 News text - GitHub
Umsuka Parallel Corpus EN-ZU parallel - Zenodo
za-mafoko collections Terminology Multiple HuggingFace

Setswana (Tswana)

Best starting point: PuoData

  • Downloads: 126k+
  • Includes: Pre-trained PuoBERTa model
  • Variants: PuoBERTa-News, PuoBERTa-NER, PuoBERTa-POS

Also available:

  • African Next Voices (speech)
  • Vuk'uzenzele Corpus (text)
  • za-mafoko terminology

isiXhosa (Xhosa)

Multiple SA Languages (11 languages)

Best option: Vuk'uzenzele Corpus

  • Languages: English, Afrikaans, isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, Siswati, Tshivenda, Xitsonga
  • Type: Parallel corpus from government magazine
  • Size: 2,200+ aligned sentence pairs
  • Use case: Machine translation, multilingual NLP

Government data: Gov-ZA Multilingual

  • Cabinet statements in multiple languages

For All Language Options

See the Datasets by Language section in the main README.

Finding Terminology Resources

za-mafoko Collection

The most comprehensive multilingual terminology resource for SA languages:

Dataset Focus Downloads Best For
DSAC Terminology Arts & Culture 15.6k Cultural terms
StatsSA Terminology Statistics 1.16k Statistical terms
UP Glossary Academic 1.77k Academic terms
AI Glossary AI/ML 85 Tech terms
OERTB Terminology Education 5.74k Educational terms

Portal: za-mafoko.dsfsi.co.za

Other Lexicon Resources

Featured & Popular Datasets

Our top 5 most impactful datasets:

  1. African Next Voices (ZA-ANV) - 451k downloads

    • Speech recognition in 7 SA languages
  2. za-mafoko Collection - 100k+ downloads

    • Multilingual terminologies and glossaries
  3. COVID-19 ZA - 256 stars

    • Comprehensive pandemic data with dashboard
  4. PuoData - 126k downloads

    • Setswana corpus with BERT model
  5. Vuk'uzenzele - 7 stars

    • 11-language parallel corpus

Low-Resource Language Datasets

African Languages Beyond SA

Language-Specific Resources

Each of the 11 SA official languages has resources in:

  • Vuk'uzenzele parallel corpus
  • za-mafoko terminology
  • NCHLT speech corpus (where applicable)

Domain-Specific Datasets

Public Health

COVID-19 ZA

  • Provincial and district-level data (2020-present)
  • Cases, deaths, testing, vaccination, mobility
  • Live dashboard
  • 64+ contributors, 256 stars

Legal Documents

Financial Data

JSE Top 40 Stock Data (Local)

  • Years: 2019, 2021, 2022, 2023, 2024
  • Daily closing prices (Jan-Apr each year)
  • Format: CSV and pickle
  • Access: Included in this repository

News & Media

Educational Datasets

Datasets curated for teaching and coursework:

COS781 (Customer Analytics)

  • Customer Segmentation Data
  • Hypermarket Dataset
  • Market Basket Optimisation
  • Online Retail II

COS802 (Text Analytics)

  • AG News Dataset

Location: data/cos781/ and data/cos802/ in this repository

Datasets for Specific Tasks

Machine Translation

Start here: Vuk'uzenzele Corpus

  • 11 languages, 2,200+ aligned pairs

Also:

Text Classification

Named Entity Recognition (NER)

Part-of-Speech Tagging

Embeddings & Pre-training

How to Access Datasets

From HuggingFace

from datasets import load_dataset

# Load a dataset
dataset = load_dataset("dsfsi/PuoData")

# Load with specific split
train_data = load_dataset("dsfsi-anv/za-african-next-voices", split="train")

From GitHub

# Clone repository
git clone https://github.com/dsfsi/covid19za.git

# Or download specific files
wget https://raw.githubusercontent.com/dsfsi/vukuzenzele-nlp/main/data/file.csv

From Zenodo

  • Click "Download" on the Zenodo page
  • Use DOI for citation in papers
  • Some datasets have APIs (check Zenodo docs)

Local Datasets (This Repo)

import pandas as pd

# JSE stock data
stocks = pd.read_csv("data/stocks/top40_jse_2024_performance.csv")

# Or pickle format
stocks_df = pd.read_pickle("data/stocks/top40_jse_2024_performance.df")

# Course data
customers = pd.read_csv("data/cos781/customer_segment_data.csv")

Programmatic Search

Use datasets_index.json for programmatic discovery:

import json

# Load registry
with open('datasets_index.json') as f:
    registry = json.load(f)

# Find all speech datasets
speech = [d for d in registry['datasets']
          if d.get('category') == 'speech']

# Find datasets by language
zulu = [d for d in registry['datasets']
        if 'zu' in d.get('languages', [])]

# Find datasets on HuggingFace
hf_datasets = [d for d in registry['datasets']
               if d.get('platform') == 'huggingface']

# Find by tag
nlp_datasets = [d for d in registry['datasets']
                if 'nlp' in d.get('tags', [])]

Dataset Selection Checklist

When choosing a dataset, consider:

  • Language coverage: Does it include your target language(s)?
  • Size: Is it large enough for your task?
  • License: Can you use it for your purpose (research, commercial, etc.)?
  • Quality: Is it curated/validated? What's the error rate?
  • Documentation: Is there a paper or detailed documentation?
  • Format: Is it in a format you can work with?
  • Pre-trained models: Are there models already available?
  • Community: Is it actively maintained? Are others using it?
  • Citation: Is there proper citation information?

Need Help?

Can't find what you're looking for?

  1. Check the main README: Complete catalog
  2. Browse by category: See Datasets by Category
  3. Browse by language: See Datasets by Language
  4. Check publications: DSFSI Publications
  5. Contact us: [email protected]
  6. Follow us: Twitter/X | LinkedIn | Bluesky

Citing Datasets

When using DSFSI datasets in research:

  1. Cite the specific dataset: Use DOI or URL from dataset page
  2. Cite the registry: For general dataset discovery
  3. Cite the paper: If there's an associated publication

Example citation for the registry:

@misc{dsfsi-datasets-2025,
  title={DSFSI Public Datasets Registry},
  author={{Data Science for Social Impact Research Group}},
  year={2025},
  publisher={University of Pretoria},
  url={https://github.com/dsfsi/dsfsi-datasets}
}

Contributing

Found a dataset we should include? See CONTRIBUTING.md for guidelines on submitting new datasets.


DSFSI Research Group University of Pretoria www.dsfsi.co.za