-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
I have a KeyError exception, while calculating the threshold in dedupe. One of my record is wrong and has a Corporation name in it. But it shouldn't cause an exception in parseratorvariable. (I'm using the Person Name FieldType).
This issue is raised :
Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 76, in __call__
filtered_pairs = self.fieldDistance(record_pairs)
File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 101, in fieldDistance
distances = self.data_model.distances(records)
File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/datamodel.py", line 82, in distances
record_2[field])
File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/parseratorvariable/__init__.py", line 90, in comparator
variable_type = self.variable_types[variable_type_1]
KeyError: 'Corporation'
Either parseratorvariable is not handling the case of probablepeople is returning a wrong label or probablepeople is not returning an error if the label doesn't correspond to the type 'person'.
for more see datamade/probablepeople#74
For those who wants to patch this with a work around, what I have done is replacing the comparator method in parseratorvariable/__init__ (line 54) :
Add this lines where you are using dedupe library
import dedupe
import numpy
import parseratorvariable
from probableparsing import RepeatedLabelError
def comparator(self, field_1, field_2):
distances = numpy.zeros(self.expanded_size)
i = 0
if not (field_1 and field_2):
return distances
distances[i] = 1
i += 1
try:
parsed_variable_1, variable_type_1 = self.tagger(field_1)
parsed_variable_2, variable_type_2 = self.tagger(field_2)
except RepeatedLabelError as e:
if self.log_file:
import csv
with open(self.log_file, 'a') as f:
writer = csv.writer(f)
writer.writerow([e.original_string.encode('utf8')])
distances[i:3] = [1, 0]
distances[-1] = self.compareString(field_1, field_2)
return distances
if 'Ambiguous' in (variable_type_1, variable_type_2):
distances[i:3] = [1, 0]
distances[-1] = self.compareString(field_1, field_2)
return distances
elif variable_type_1 != variable_type_2:
distances[i:3] = [0, 0]
distances[-1] = self.compareString(field_1, field_2)
return distances
elif variable_type_1 == variable_type_2:
distances[i:3] = [0, 1]
if variable_type_1 not in self.variable_types:
distances[i:3] = [1, 0]
distances[-1] = self.compareString(field_1, field_2)
return distances
i += 2
variable_type = self.variable_types[variable_type_1]
distances[i:i + self.n_type_indicators] = variable_type['indicator']
i += self.n_type_indicators
i += variable_type['offset']
for j, dist in enumerate(variable_type['compare'](parsed_variable_1,
parsed_variable_2),
i):
distances[j] = dist
unobserved_parts = numpy.isnan(distances[i:j + 1])
distances[i:j + 1][unobserved_parts] = 0
unobserved_parts = (~unobserved_parts).astype(int)
distances[(i + self.n_parts):(j + 1 + self.n_parts)] = unobserved_parts
return distances
parseratorvariable.ParseratorType.comparator = comparator
Then you can use dedupe.
Metadata
Metadata
Assignees
Labels
No labels