Public risk dataset bundles for LinkProof.
- Manifest:
https://raw.githubusercontent.com/Openclaw-Metis/linkproof-datasets/main/manifest.json - Dataset:
https://raw.githubusercontent.com/Openclaw-Metis/linkproof-datasets/main/scam-datasets.json
scam-datasets.json: LinkProof risk dataset consumed by the mobile apps.manifest.json: Version, checksum, and dataset URL used by app update checks.publications.json: Machine-readable release history for generated dataset bundles.CHANGELOG.md: Human-readable summary of published dataset changes.SOURCES.md: Official source list and normalization rules.THREAT_SOURCE_EVALUATION.md: Licensing and product-fit decisions for current and rejected threat sources.scripts/build_dataset.py: Fetches public threat feeds and rebuildsscam-datasets.json.scripts/fetch_phishtank.py: Fetches and normalizes the PhishTank online-valid phishing feed.scripts/merge_sources.py: Applies cross-source dedupe and priority rules.scripts/normalize_domain.py: Shared dataset-domain and path normalization helper.scripts/spike_urlhaus_overlap.py: Manual URLhaus overlap spike. It is not used by production refresh.scripts/requirements.txt: Python dependencies for dataset builds and tests.scripts/update_manifest.py: Validates the dataset and regeneratesmanifest.json.sources/: Generated per-source normalized records used for auditability.
scam-datasets.json uses dataset schemaVersion 2. Shared risk metadata is stored once in sources; each record references a source by sourceID.
{
"schemaVersion": 2,
"bundleVersion": "2026.05.17.gov2.9a67d8617efc",
"fetchedAt": "2026-05-17T00:00:00Z",
"sources": [
{
"id": "src_example",
"riskLevel": "confirmedScam",
"sourceName": {
"zhTW": "官方來源",
"enUS": "Official source"
},
"sourceURL": "https://data.gov.tw/dataset/example",
"category": {
"zhTW": "涉詐網域",
"enUS": "Scam domain"
}
}
],
"records": [
{
"domain": "example.test",
"sourceID": "src_example",
"datasetDate": "2026-05-17"
}
]
}-
Install dataset build dependencies:
python -m pip install -r scripts/requirements.txt
-
Rebuild the dataset from public sources:
python scripts/build_dataset.py
PhishTank is fetched from its public
online-valid.json.bz2dump by default, withonline-valid.csv.gzas a public fallback when the JSON dump is rate-limited. IfPHISHTANK_API_KEYexists in GitHub Secrets, the scheduled build uses the keyed URL for higher rate-limit headroom. If PhishTank is temporarily unavailable, the build logs a warning and reuses the last cached PhishTank source file when available; without a cache, it still publishes the government-sourced dataset. For explicit government-only local validation, runpython scripts/build_dataset.py --skip-phishtank.The build fails if the generated record count drops more than 20% from the committed dataset, which protects against partial upstream outages. The same guard is applied per source once
sourceStatsexists inpublications.json; use--allow-source-drop <source-id>only for an intentional source reset. -
Regenerate the manifest:
python scripts/update_manifest.py
-
Commit
scam-datasets.json,manifest.json,publications.json,CHANGELOG.md, and changed files insources/. -
Push to
main.
The scheduled Refresh public dataset GitHub Actions workflow runs the same build and manifest commands daily at 02:20 Asia/Taipei and commits only when generated files change. Each changed build prepends a release record to publications.json and regenerates CHANGELOG.md.
Run:
python -m pip install -r scripts/requirements.txt
python -m py_compile scripts/build_dataset.py scripts/update_manifest.py scripts/fetch_phishtank.py scripts/merge_sources.py scripts/normalize_domain.py
python -m pytest tests/
python scripts/update_manifest.py
python -m json.tool publications.json > /dev/null
git diff --exit-code manifest.jsonIf manifest.json changes, commit the updated checksum with the dataset.
The validation workflow verifies that the dataset is structurally valid and that manifest.json contains the current SHA-256 checksum.
URLhaus is intentionally not a production source because its commercial-use terms are not clean enough for LinkProof's possible future monetization path. For a local data-quality spike only, run:
python scripts/spike_urlhaus_overlap.py --urlhaus-text tests/fixtures/urlhaus_text_sample.txtLive URLhaus evaluation requires an explicit opt-in flag:
python scripts/spike_urlhaus_overlap.py --fetch-live --sample-size 100Do not wire spike scripts into .github/workflows/refresh.yml or scripts/build_dataset.py without first updating THREAT_SOURCE_EVALUATION.md.