Conversation
… continuous pipeline
…eature/trimming_count Update with main branch
| is_outlier = (signal_model.w_soln < 0.1).astype(int) | ||
|
|
||
| inlier_pct = signal_model.inlier_pct | ||
| target_outlier_count = int(np.ceil((1.0 - inlier_pct) * data.num_obs)) |
There was a problem hiding this comment.
If we want, I think we can directly use this
https://github.com/ihmeuw-msca/limetr/blob/0.0.8/src/limetr/__init__.py#L205
It will be signal_model.lt.num_outliers
Realize that MRBeRT might not have this. Please see below for the alternative.
| inlier_indices = np.where(is_outlier_arr == 0)[0] | ||
| additional_outlier_indices = np.argsort(signal_model.w_soln[inlier_indices])[:num_to_add] | ||
| is_outlier_arr[inlier_indices[additional_outlier_indices]] = 1 | ||
| is_outlier = is_outlier_arr |
There was a problem hiding this comment.
Here the logic is more complicated than expected. I think we can simplify it as
trimming_weights = signal_model.w_soln
sub_lt_model = signal_model.sub_models[0].lt
num_outliers = sub_lt_model.num_outliers
outlier_indices = np.argsort(trimming_weights)[:num_outliers]
is_outlier = np.zeros(sub_lt_model.N, dtype=int)
is_outlier[outlier_indices] = 1Let me know what you think!
There was a problem hiding this comment.
Hi Peng – this makes sense and is much cleaner! One question: I think it's technically possible (if unlikely) for us to have more than num_outliers trimmed under the old approach of
is_outlier = (signal_model.w_soln < 0.1).astype(int)
If that were to be the case, do we want to allow this 'more than num_outliers are trimmed' scenario? We could build in something like
trimming_weights = ...
sub_lt_model = ...
num_outliers = int(sub_lt_model.num_outliers)
consensus_outliers = np.where(signal_model.w_soln < 0.1)[0]
if len(consensus_outliers) < num_outliers:
outlier_indices = np.argsort(trimming_weights)[:num_outliers]
else:
outlier_indices = consensus_outliers
is_outlier = ...
is_outlier[outlier_indices] = 1
if we do want to do this, or leave it at the strict cap of trimming num_outliers using your approach above.
| [project] | ||
| name = "bopforge" | ||
| version = "0.2.2" | ||
| version = "0.2.3" |
There was a problem hiding this comment.
Let me know if this version bump should be something else!
zhengp0
left a comment
There was a problem hiding this comment.
This looks good, thanks @n-gilbertson!
Add trimming count to ensure we always trim outlier_pct = (1 - inlier_pct)*100% of datapoints in the continuous model. If submodels do not agree and fewer than outlier_pct of points fall below the w_soln < 0.1 threshold, this guarantees we still trim outlier_pct of datapoints. Using slightly more aggressive ceiling strategy to round up to nearest integer of points to trim to be consistent with the number of inliers being set with floor at the limetr level.
Note: needs eventual version bump.
If there are any suggestions for how to handle the extra outlier identification better/more safely that doesn't use indexing, happy to incorporate those!