Skip to content

Port ingestion scripts#7539

Open
mjpost wants to merge 34 commits intomasterfrom
port-ingestion-scripts
Open

Port ingestion scripts#7539
mjpost wants to merge 34 commits intomasterfrom
port-ingestion-scripts

Conversation

@mjpost
Copy link
Member

@mjpost mjpost commented Feb 18, 2026

No description provided.

@mjpost mjpost added this to the 2026Q1 milestone Feb 18, 2026
@mjpost mjpost requested a review from mbollmann February 18, 2026 14:43
@mjpost mjpost mentioned this pull request Feb 18, 2026
17 tasks
@mjpost
Copy link
Member Author

mjpost commented Feb 18, 2026

FYI, we started removing middle names from ingestion scripts some time ago, since it led to a flood of corrections. I've reversed this because it is the wrong solution (we shouldn't be second-guessing the input) and we have better name resolution now.

@mjpost
Copy link
Member Author

mjpost commented Feb 18, 2026

Okay, I've now merged the two scripts into one. The particular format is abstracted away as (a) reading metadata and (b) producing a stream over incoming papers.

@mjpost
Copy link
Member Author

mjpost commented Feb 18, 2026

I've used this for both types of ingestions. It's very, very nice!

@mjpost
Copy link
Member Author

mjpost commented Feb 18, 2026

Later we can also merge in #7483.

Copy link
Member

@mbollmann mbollmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! There’s a lot going on here, so I have a lot of comments 😅

Mostly it is proposals to move stuff into the library, or notes about how to better use the library, but I also felt there are a lot of small functions in here that feel unnecessary to me, and generally there is a lot of code which could profit from some explanations.

Comment on lines +162 to +164
def add_page_numbers(
papers: List[Dict[str, Any]], ingestion_dir: str
) -> List[Dict[str, Any]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side note, but: If we add type annotations to a new/updated script, it would be nice if they were actually typechecked... Currently, the type checker only runs on the library code. I wonder if we could make it selectively check scripts that we mark as having type annotations.

bin/ingest.py Outdated


def correct_names(author: Dict[str, Any]) -> Dict[str, Any]:
if author.get("middle_name") is not None and author["middle_name"].lower() == "de":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this only be "de"? I copied this list from David’s ingestion script:

{"al", "bin", "bint", "da", "de", "del", "di", "dos", "du", "la", "le", "van", "von"}

Comment on lines +764 to +773
def correct_caps(name: Optional[str]) -> Optional[str]:
"""
Many people submit their names in "ALL CAPS" or "all lowercase".
Correct this with heuristics.
"""
if name is None:
return None

if name.islower() or name.isupper():
corrected = " ".join(part.capitalize() for part in name.split())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think name.title() is better as it also capitalizes after hyphens etc.; but see #7570 for a proposal to put this functionality into the library.

Comment on lines +197 to +201
def trim_orcid(orcid: str) -> str:
match = re.match(r".*(\d{4}-\d{4}-\d{4}-\d{3}[\dX]).*", orcid, re.IGNORECASE)
if match is not None:
return match.group(1).upper()
return orcid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to move that into the library as a "converter" for the orcid field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great

bin/ingest.py Outdated
Comment on lines +249 to +252
def attachment_reference_from_paths(src_path: str, dest_path: str) -> AttachmentReference:
if os.path.exists(dest_path):
return AttachmentReference.from_file(dest_path)
return AttachmentReference(name=os.path.basename(dest_path))
Copy link
Member

@mbollmann mbollmann Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. src_path is never used in this function. But if src_path always exists, wouldn’t it be better to just always instantiate the reference from that? (making this function unnecessary)

sig = anthology.sigs[sig_key]
if volume_full_id not in sig.meetings:
sig.meetings.append(volume_full_id)
anthology.sigs.reverse[parse_id(volume_full_id)].add(sig_key)
Copy link
Member

@mbollmann mbollmann Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels a bit arcane, and is probably only necessary because I haven’t given that part of the library much care yet (due to wanting to refactor the SIG YAML format first)... SIGs should probably just have .add_meeting() or something that takes care of everything.

bin/ingest.py Outdated

if volume_full_id in volumes:
print("Error: ")
def make_name_spec(person) -> NameSpecification:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here: There's already namespec_from_author? I have a feeling (but am not sure) this might be due to there being two different ingestion formats, one using make_name_spec and one using namespec_from_author, but this is very confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is for ACLPUB format, the other for aclpub2. But I agree there's a missed consolidation here, and the names are not good.

@mjpost mjpost mentioned this pull request Mar 4, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants