Port ingestion scripts by mjpost · Pull Request #7539 · acl-org/acl-anthology

mjpost · 2026-02-18T14:43:44Z

No description provided.

…rt-ingestion-scripts

github-actions · 2026-02-18T14:56:31Z

Build successful. Some useful links:

Complete site preview: https://preview.aclanthology.org/port-ingestion-scripts
Potential changes of interest: 2005.eamt-1, 2005.iwslt-1, 2007.mtsummit-invited, 2007.mtsummit-papers, 2007.mtsummit-tutorials, 2007.mtsummit-wpt, 2007.mtsummit-cre, 2007.mtsummit-ucnlg, 2007.mtsummit-aptme, 2011.mtsummit-plenaries, 2011.mtsummit-papers, 2011.mtsummit-systems, 2011.mtsummit-tutorials, 2011.mtsummit-wpt, 2020.acl-main, 2020.acl-demos, 2020.acl-srw, 2020.acl-tutorials, 2020.coling-main, 2020.coling-demos, 2020.coling-tutorials, 2020.coling-industry, 2020.emnlp-main, 2020.emnlp-demos, 2020.emnlp-tutorials, 2020.intellang-1, 2020.lrec-1, 2021.acl-long, 2021.acl-short, 2021.acl-srw, 2021.acl-demo, 2021.acl-tutorials, 2021.clicit-1, 2021.cmcl-1, 2021.emnlp-main, 2021.emnlp-demo, 2021.emnlp-tutorials, 2021.findings-acl, 2021.findings-emnlp, 2021.law-1, 2021.naacl-main, 2021.naacl-demos, 2021.naacl-srw, 2021.naacl-tutorials, 2021.naacl-industry, 2021.tacl-1, 2022.aacl-main, 2022.aacl-short, 2022.aacl-srw, 2022.aacl-demo, (plus 632 more...)

This preview will be removed when the branch is merged.

data/yaml/sigs/semitic.yaml

bin/ingest.py

mjpost · 2026-02-18T18:32:31Z

FYI, we started removing middle names from ingestion scripts some time ago, since it led to a flood of corrections. I've reversed this because it is the wrong solution (we shouldn't be second-guessing the input) and we have better name resolution now.

This reverts commit 35f04f7.

This reverts commit 71214f4.

mjpost · 2026-02-18T20:05:10Z

Okay, I've now merged the two scripts into one. The particular format is abstracted away as (a) reading metadata and (b) producing a stream over incoming papers.

mjpost · 2026-02-18T20:05:23Z

I've used this for both types of ingestions. It's very, very nice!

mjpost · 2026-02-18T20:08:06Z

Later we can also merge in #7483.

mbollmann

Cool! There’s a lot going on here, so I have a lot of comments 😅

Mostly it is proposals to move stuff into the library, or notes about how to better use the library, but I also felt there are a lot of small functions in here that feel unnecessary to me, and generally there is a lot of code which could profit from some explanations.

bin/ingest.py

mbollmann · 2026-02-21T10:43:15Z

bin/ingest.py

+def add_page_numbers(
+    papers: List[Dict[str, Any]], ingestion_dir: str
+) -> List[Dict[str, Any]]:


This is a side note, but: If we add type annotations to a new/updated script, it would be nice if they were actually typechecked... Currently, the type checker only runs on the library code. I wonder if we could make it selectively check scripts that we mark as having type annotations.

mbollmann · 2026-02-21T10:46:01Z

bin/ingest.py

+
+
+def correct_names(author: Dict[str, Any]) -> Dict[str, Any]:
+    if author.get("middle_name") is not None and author["middle_name"].lower() == "de":


Should this only be "de"? I copied this list from David’s ingestion script:

{"al", "bin", "bint", "da", "de", "del", "di", "dos", "du", "la", "le", "van", "von"}

mbollmann · 2026-02-21T10:47:46Z

bin/ingest.py

+def correct_caps(name: Optional[str]) -> Optional[str]:
+    """
+    Many people submit their names in "ALL CAPS" or "all lowercase".
+    Correct this with heuristics.
+    """
+    if name is None:
+        return None
+
+    if name.islower() or name.isupper():
+        corrected = " ".join(part.capitalize() for part in name.split())


I think name.title() is better as it also capitalizes after hyphens etc.; but see #7570 for a proposal to put this functionality into the library.

mbollmann · 2026-02-21T10:49:23Z

bin/ingest.py

+def trim_orcid(orcid: str) -> str:
+    match = re.match(r".*(\d{4}-\d{4}-\d{4}-\d{3}[\dX]).*", orcid, re.IGNORECASE)
+    if match is not None:
+        return match.group(1).upper()
+    return orcid


I would propose to move that into the library as a "converter" for the orcid field.

Sounds great

bin/ingest.py

mbollmann · 2026-02-21T11:17:41Z

bin/ingest.py

+def attachment_reference_from_paths(src_path: str, dest_path: str) -> AttachmentReference:
+    if os.path.exists(dest_path):
+        return AttachmentReference.from_file(dest_path)
+    return AttachmentReference(name=os.path.basename(dest_path))


Same as above. src_path is never used in this function. But if src_path always exists, wouldn’t it be better to just always instantiate the reference from that? (making this function unnecessary)

bin/ingest.py

mbollmann · 2026-02-21T11:29:46Z

bin/ingest.py

+    sig = anthology.sigs[sig_key]
+    if volume_full_id not in sig.meetings:
+        sig.meetings.append(volume_full_id)
+    anthology.sigs.reverse[parse_id(volume_full_id)].add(sig_key)


That feels a bit arcane, and is probably only necessary because I haven’t given that part of the library much care yet (due to wanting to refactor the SIG YAML format first)... SIGs should probably just have .add_meeting() or something that takes care of everything.

mbollmann · 2026-02-21T11:35:11Z

bin/ingest.py


-        if volume_full_id in volumes:
-            print("Error: ")
+def make_name_spec(person) -> NameSpecification:


I'm a bit confused here: There's already namespec_from_author? I have a feeling (but am not sure) this might be due to there being two different ingestion formats, one using make_name_spec and one using namespec_from_author, but this is very confusing.

One is for ACLPUB format, the other for aclpub2. But I agree there's a missed consolidation here, and the names are not good.

mjpost and others added 10 commits February 17, 2026 15:36

Port ingest.py to library

16fbb3a

Ingest MOL 2025

55d0aa1

Remove XML processing and apply truelist to BibTeX

639759e

Correct names, add to SIG

6ed7021

Remove --dry-run

43ee84c

Merge branch 'master' into ingest-mol

2ebfcd2

Remove unused "tag" and set_disambiguation_ids

f672aee

Add automatic SIG handling!

483d627

Reformat SIG files with library

71214f4

Merge branch 'ingest-mol' of github.com:acl-org/acl-anthology into po…

076d2b7

…rt-ingestion-scripts

mjpost added this to the 2026Q1 milestone Feb 18, 2026

mjpost requested a review from mbollmann February 18, 2026 14:43

mjpost mentioned this pull request Feb 18, 2026

Port ingest.py and ingest MOL 2025 #7533

Closed

17 tasks

Normalize venue formats via library save

35f04f7

mbollmann reviewed Feb 18, 2026

View reviewed changes

data/yaml/sigs/semitic.yaml Show resolved Hide resolved

mbollmann reviewed Feb 18, 2026

View reviewed changes

bin/ingest.py Show resolved Hide resolved

mjpost added 2 commits February 18, 2026 12:01

Remove normalize_plain_text

8121c4e

Add middle names

9349ecd

mjpost added 8 commits February 18, 2026 13:56

unicodify ndash in page range

064ab1f

Port ingest_aclpub2.py to library

78eb976

First attempt at merging scripts

e5a6b67

Consolidate ingest_aclpub2.py into ingest_aclpub.py

fefa232

Remove particular ingestions

bfc66c7

Revert "Normalize venue formats via library save"

24550d6

This reverts commit 35f04f7.

Revert "Reformat SIG files with library"

7d09707

This reverts commit 71214f4.

Restore master

bb71014

Restore --dry-run

13524e8

mjpost added 2 commits February 18, 2026 15:58

load_all()

a5f7ec9

Minor documentation change

eb6757f

mbollmann reviewed Feb 21, 2026

View reviewed changes

mjpost and others added 9 commits February 21, 2026 19:41

Fix log()

8ef6883

Remove superfluous call to load()

0a5b5f5

Add documentation, address comments

2c37ed9

os.path -> Path

c8f4690

Rename namespec funcs

1b85c69

Merge branch 'master' into port-ingestion-scripts

6e1b35c

Fix setup_rich_logging import, restore normalize for abstracts

4ac74a3

Merge branch 'master' into port-ingestion-scripts

0b8bfd8

Fix frontmatter path

aee13ef

mjpost mentioned this pull request Mar 4, 2026

Ingestion Request: NEJLT Volume 11 #7657

Open

2 tasks

Merge remote-tracking branch 'origin/master' into port-ingestion-scripts

57a3088



		def correct_names(author: Dict[str, Any]) -> Dict[str, Any]:
		if author.get("middle_name") is not None and author["middle_name"].lower() == "de":

Conversation

mjpost commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mjpost commented Feb 18, 2026

Uh oh!

mjpost commented Feb 18, 2026

Uh oh!

mjpost commented Feb 18, 2026

Uh oh!

mjpost commented Feb 18, 2026

Uh oh!

mbollmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbollmann Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

mjpost Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mbollmann Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbollmann Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbollmann Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

mjpost Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 18, 2026 •

edited

Loading

mbollmann Feb 21, 2026 •

edited

Loading

mbollmann Feb 21, 2026 •

edited

Loading