Scraping WDumps data

In order to better understand what kinds of entity data dump subsets our users are interested in, this repository scrapes all dump subsets listed under "recent dumps". The scrape includes a JSON representation of the filters that were used to generate the dump.

The notebook generates a csv file that includes filter data in a human-readable form. Each row of the csv includes the following columns:

dump name
URL
filter (in human-readable form including labels for any items and properties used)
statements included in the dump (in human-readable form)
labels (yes/no)
descriptions (yes/no)
aliases (yes/no)
sitelinks (yes/no)
languages

Development

Install Dependencies for Package

pip install -e ".[dev]"

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src/wdumps_scraper		src/wdumps_scraper
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
notebook.ipynb		notebook.ipynb
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraping WDumps data

Development

Install Dependencies for Package

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scraping WDumps data

Development

Install Dependencies for Package

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages