Conversation
…rt-ingestion-scripts
|
FYI, we started removing middle names from ingestion scripts some time ago, since it led to a flood of corrections. I've reversed this because it is the wrong solution (we shouldn't be second-guessing the input) and we have better name resolution now. |
|
Okay, I've now merged the two scripts into one. The particular format is abstracted away as (a) reading metadata and (b) producing a stream over incoming papers. |
|
I've used this for both types of ingestions. It's very, very nice! |
|
Later we can also merge in #7483. |
mbollmann
left a comment
There was a problem hiding this comment.
Cool! There’s a lot going on here, so I have a lot of comments 😅
Mostly it is proposals to move stuff into the library, or notes about how to better use the library, but I also felt there are a lot of small functions in here that feel unnecessary to me, and generally there is a lot of code which could profit from some explanations.
| def add_page_numbers( | ||
| papers: List[Dict[str, Any]], ingestion_dir: str | ||
| ) -> List[Dict[str, Any]]: |
There was a problem hiding this comment.
This is a side note, but: If we add type annotations to a new/updated script, it would be nice if they were actually typechecked... Currently, the type checker only runs on the library code. I wonder if we could make it selectively check scripts that we mark as having type annotations.
bin/ingest.py
Outdated
|
|
||
|
|
||
| def correct_names(author: Dict[str, Any]) -> Dict[str, Any]: | ||
| if author.get("middle_name") is not None and author["middle_name"].lower() == "de": |
There was a problem hiding this comment.
Should this only be "de"? I copied this list from David’s ingestion script:
{"al", "bin", "bint", "da", "de", "del", "di", "dos", "du", "la", "le", "van", "von"}
| def correct_caps(name: Optional[str]) -> Optional[str]: | ||
| """ | ||
| Many people submit their names in "ALL CAPS" or "all lowercase". | ||
| Correct this with heuristics. | ||
| """ | ||
| if name is None: | ||
| return None | ||
|
|
||
| if name.islower() or name.isupper(): | ||
| corrected = " ".join(part.capitalize() for part in name.split()) |
There was a problem hiding this comment.
I think name.title() is better as it also capitalizes after hyphens etc.; but see #7570 for a proposal to put this functionality into the library.
| def trim_orcid(orcid: str) -> str: | ||
| match = re.match(r".*(\d{4}-\d{4}-\d{4}-\d{3}[\dX]).*", orcid, re.IGNORECASE) | ||
| if match is not None: | ||
| return match.group(1).upper() | ||
| return orcid |
There was a problem hiding this comment.
I would propose to move that into the library as a "converter" for the orcid field.
bin/ingest.py
Outdated
| def attachment_reference_from_paths(src_path: str, dest_path: str) -> AttachmentReference: | ||
| if os.path.exists(dest_path): | ||
| return AttachmentReference.from_file(dest_path) | ||
| return AttachmentReference(name=os.path.basename(dest_path)) |
There was a problem hiding this comment.
Same as above. src_path is never used in this function. But if src_path always exists, wouldn’t it be better to just always instantiate the reference from that? (making this function unnecessary)
| sig = anthology.sigs[sig_key] | ||
| if volume_full_id not in sig.meetings: | ||
| sig.meetings.append(volume_full_id) | ||
| anthology.sigs.reverse[parse_id(volume_full_id)].add(sig_key) |
There was a problem hiding this comment.
That feels a bit arcane, and is probably only necessary because I haven’t given that part of the library much care yet (due to wanting to refactor the SIG YAML format first)... SIGs should probably just have .add_meeting() or something that takes care of everything.
bin/ingest.py
Outdated
|
|
||
| if volume_full_id in volumes: | ||
| print("Error: ") | ||
| def make_name_spec(person) -> NameSpecification: |
There was a problem hiding this comment.
I'm a bit confused here: There's already namespec_from_author? I have a feeling (but am not sure) this might be due to there being two different ingestion formats, one using make_name_spec and one using namespec_from_author, but this is very confusing.
There was a problem hiding this comment.
One is for ACLPUB format, the other for aclpub2. But I agree there's a missed consolidation here, and the names are not good.
No description provided.