Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:
- select all instances of newspaper (news media, or similar) having an official website:
SELECT DISTINCT ?item ?itemLabel ?lang ?url
WHERE
{
?item wdt:P31/wdt:P279* wd:Q11032.
?item wdt:P856 ?url. # with official website
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,ru,fr,es,it,ja,zh,*" }
OPTIONAL {
?item wdt:P407 ?language.
?language wdt:P220 ?lang.
}
}
LIMIT 50
(execute query on Wikidata query service)
Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list: