-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
There are two primary ways in which attempted charges are represented in our messy data:
- An
(att)suffix is appended to the charge, like430-6/5-2a1 (att) - A separate charge code,
720-5/8-4, is prepended to the charge, like720-5/8-4 720-570/402
Pattern 1) is relatively simple to detect, and we have a dedicated label for it in the ILCS parser. Pattern 2) is harder to detect since it comes out looking like two entirely separate charges. In addition, the existence of the two different types of patterns makes it difficult to match against the canonical set, since we can only have one representation of an attempted charge in our canonical set.
There are a few ways I can think to approach this:
- Data preprocessing: Check for the code
720-5/8-4in the messy data and replace it with the(att)suffix. This will work well for cases of pattern 2) where there are two separate charges, but unfortunately sometimes the attempted code is the only recorded charge, and I'm not sure how those cases will behave. It'll also be tricky because the instances of720-5/8-4are not all formatted the same way. - Update the parser to catch the charge prefix: Try to get the parser to parse
720-5/8-4as a separate label indicating an attempted charge. This is potentially the most semantically correct way to handle things but also seems difficult from a parsing perspective.
What do you think @fgregg?
Metadata
Metadata
Assignees
Labels
No labels