Skip to content

KeyError exception caused by a Corporation label returned by propablepeople #3

@mlollo

Description

@mlollo

I have a KeyError exception, while calculating the threshold in dedupe. One of my record is wrong and has a Corporation name in it. But it shouldn't cause an exception in parseratorvariable. (I'm using the Person Name FieldType).

This issue is raised :

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 76, in __call__
    filtered_pairs = self.fieldDistance(record_pairs)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 101, in fieldDistance
    distances = self.data_model.distances(records)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/datamodel.py", line 82, in distances
    record_2[field])
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/parseratorvariable/__init__.py", line 90, in comparator
    variable_type = self.variable_types[variable_type_1]
KeyError: 'Corporation'

Either parseratorvariable is not handling the case of probablepeople is returning a wrong label or probablepeople is not returning an error if the label doesn't correspond to the type 'person'.

for more see datamade/probablepeople#74

For those who wants to patch this with a work around, what I have done is replacing the comparator method in parseratorvariable/__init__ (line 54) :
Add this lines where you are using dedupe library

import dedupe
import numpy
import parseratorvariable
from probableparsing import RepeatedLabelError

def comparator(self, field_1, field_2):
    distances = numpy.zeros(self.expanded_size)
    i = 0

    if not (field_1 and field_2):
        return distances

    distances[i] = 1
    i += 1

    try:
        parsed_variable_1, variable_type_1 = self.tagger(field_1)
        parsed_variable_2, variable_type_2 = self.tagger(field_2)
    except RepeatedLabelError as e:
        if self.log_file:
            import csv
            with open(self.log_file, 'a') as f:
                writer = csv.writer(f)
                writer.writerow([e.original_string.encode('utf8')])
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances

    if 'Ambiguous' in (variable_type_1, variable_type_2):
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    elif variable_type_1 != variable_type_2:
        distances[i:3] = [0, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    elif variable_type_1 == variable_type_2:
        distances[i:3] = [0, 1]

    if variable_type_1 not in self.variable_types:
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    
    i += 2

    variable_type = self.variable_types[variable_type_1]

    distances[i:i + self.n_type_indicators] = variable_type['indicator']
    i += self.n_type_indicators

    i += variable_type['offset']
    for j, dist in enumerate(variable_type['compare'](parsed_variable_1,
                                                      parsed_variable_2),
                             i):
        distances[j] = dist

    unobserved_parts = numpy.isnan(distances[i:j + 1])
    distances[i:j + 1][unobserved_parts] = 0
    unobserved_parts = (~unobserved_parts).astype(int)
    distances[(i + self.n_parts):(j + 1 + self.n_parts)] = unobserved_parts

    return distances

parseratorvariable.ParseratorType.comparator = comparator

Then you can use dedupe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions