Bulk metadata processing script using json-schema and strict author ID matching by nschneid · Pull Request #7517 · acl-org/acl-anthology

nschneid · 2026-02-14T14:43:51Z

Branches off of @weissenh's changes in #7395. The schema now allows for a deleted_authors entry for more explicit checking of the mapping between old and new authors.

#7642 is the accompanying front-end change (dialog stores more explicit info in JSON).

closes #7274
closes #6327

github-actions · 2026-02-14T15:03:12Z

Build successful. Some useful links:

Complete site preview: https://preview.aclanthology.org/json-schema
Potential changes of interest:

This preview will be removed when the branch is merged.

nschneid · 2026-02-15T03:53:40Z

I merged master into here so I could test updates against the current database. The only actual changes are to requirements.txt (adding jsonschema) and process_bulk_metadata.py.

- argparse to docopt - XML validation for abstract and title input - improved branch switching (stash, index computation) - simplified logic merge authors - changed logic match authors (2-step) - more input validation on JSON data Still needs testing

…mmed-alshakhori1/unverified/

….yaml changes if running the script consecutively)

nschneid · 2026-03-02T19:28:25Z

Renamed and moved to a subdirectory: bin/correct/bulk_process_metadata.py

The script has been working fine for me in --dry-run mode (it does commits but I create the PRs manually). I propose we merge to master and then add other data correction scripts that use the new library.

mbollmann

I re-checked the parts that interact with the library and those LGTM. Importantly I did not check the author matching logic from the issue JSON, but probably we shouldn’t aim to do that in a code review anyway, but by writing test cases first and foremost. Maybe we can add those soon?

bin/process_bulk_metadata.py

bin/correct/bulk_process_metadata.py

nschneid · 2026-03-03T15:45:40Z

Merging this version. Agreed that tests would be great to have!

nschneid changed the base branch from master to update-script-process-bulk-metadata February 14, 2026 14:44

nschneid mentioned this pull request Feb 14, 2026

Update bin/process_bulk_metadata.py to use library #7395

Draft

16 tasks

nschneid mentioned this pull request Feb 15, 2026

Bulk corrections 2026-02-14 #7518

Merged

weissenh and others added 21 commits March 2, 2026 14:16

Change from print to stderr to use of logger

6799900

WIP: update bulk processing script to new library

c3a6345

validate with json-schema; explicit deleted_authors list

127976a

black

95d60e3

new logic for author list changes

ac59fb9

fixes

749a2b5

editors; don't switch branch back

eb1d717

remove old code replaced by new logic

fa79f1b

try explicit load after save

3dce08a

add name variant if necessary

931b233

digits allowed in name slug e.g. https://aclanthology.org/people/moha…

d60d449

…mmed-alshakhori1/unverified/

don't crash on JSON schema validation error

06deb9e

support old-style anthology IDs

5127202

handle namespec lookup if author is duplicated

6bfdd60

black

2c2cfe8

load anthology AFTER switching branches (before will roll back people…

23a8ddc

….yaml changes if running the script consecutively)

try without extra reload

3a3b69f

resolve_namespec() -> get_by_namespec()

95ee12b

remove commented code

b72f9b9

anthology.resolve(); unused imports

d31e2b8

nschneid force-pushed the json-schema branch from bd81b7e to d31e2b8 Compare March 2, 2026 19:17

nschneid changed the base branch from update-script-process-bulk-metadata to master March 2, 2026 19:18

nschneid changed the title ~~Change JSON validation to use json-schema~~ Bulk metadata processing script using json-schema and strict author ID matching Mar 2, 2026

rename and move to subdirectory

3e6886e

nschneid marked this pull request as ready for review March 2, 2026 19:33

nschneid mentioned this pull request Mar 2, 2026

scripts for manual author page corrections with new library #7653

Open

mbollmann approved these changes Mar 2, 2026

View reviewed changes

bin/process_bulk_metadata.py Outdated Show resolved Hide resolved

bin/correct/bulk_process_metadata.py Outdated Show resolved Hide resolved

bin/correct/bulk_process_metadata.py Outdated Show resolved Hide resolved

bin/correct/bulk_process_metadata.py Outdated Show resolved Hide resolved

repo path, logging

9ade3ec

mbollmann reviewed Mar 3, 2026

View reviewed changes

bin/correct/bulk_process_metadata.py Outdated Show resolved Hide resolved

repo path, add_name() tweaks

ff974bf

nschneid merged commit b84019c into master Mar 3, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk metadata processing script using json-schema and strict author ID matching#7517

Bulk metadata processing script using json-schema and strict author ID matching#7517
nschneid merged 24 commits intomasterfrom
json-schema

nschneid commented Feb 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

nschneid commented Feb 15, 2026

Uh oh!

nschneid commented Mar 2, 2026 •

edited

Loading

Uh oh!

mbollmann left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nschneid commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nschneid commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nschneid commented Feb 15, 2026

Uh oh!

nschneid commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbollmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nschneid commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nschneid commented Feb 14, 2026 •

edited

Loading

github-actions bot commented Feb 14, 2026 •

edited

Loading

nschneid commented Mar 2, 2026 •

edited

Loading