Skip to content

Ensuring Reproducibility of Deconvolution#19

Open
gordonkoehn wants to merge 20 commits intodevfrom
gordon/test_repro
Open

Ensuring Reproducibility of Deconvolution#19
gordonkoehn wants to merge 20 commits intodevfrom
gordon/test_repro

Conversation

@gordonkoehn
Copy link
Contributor

@gordonkoehn gordonkoehn commented Jul 8, 2025

This PR ensured lollipop.deconvolute reproducibility. It was recently noticed that we have variation in the abundance results and confidence intervals. This PR investigates the exact variation.

Introduced consistency, limited quoting of the results, and reestablished (establishes) reproducibility of results.

See the feed below for a full explanation.

@gordonkoehn
Copy link
Contributor Author

Other potential issues are – vibe code analysis to be verified.

1. PRIMARY ISSUE: RNG Seeding Incompatibility (Critical)

Location: lollipop/cli/deconvolute.py:156

Problem: The code uses modern NumPy's np.random.default_rng() but the resampling function still uses legacy np.random.randint().

# Current (broken) code:
def _deconvolute_bootstrap_wrapper(args: DeconvBootstrapsArgs):
    child_seed = args.pop("child_seed")
    np.random.default_rng(child_seed)  # ❌ This doesn't affect legacy functions!
    return _deconvolute_bootstrap(**args)

# In confints.py:
def resample_mutations(df_city1, mutations, namefield="mutations"):
    rand_idcs = np.random.randint(0, high=int(len(mutations) / 2), ...)  # ❌ Uses legacy RNG

Result: Bootstrap resampling is completely non-deterministic.

2. SECONDARY ISSUE: Numerical Precision in Optimization (Medium)

Location: lollipop/regressors.py:48

Problem: scipy.optimize.least_squares called without explicit tolerance parameters.

# Current code lacks precision control:
ls = least_squares(
    lambda beta: (np.expand_dims(k, 1) * X).dot(beta) - (k * y),
    b0,
    bounds=(0, 1),
    loss=self.loss_type,      # No tolerance specified
    f_scale=self.f_scale      # Uses scipy defaults
)

Result: Slight numerical differences in optimization convergence.

3. TERTIARY ISSUE: Confidence Interval Calculations (Low)

Location: lollipop/confints.py

Problem: Matrix inversions and floating-point operations without proper error handling.

try:
    inv_info = np.linalg.inv(linprob_info)
except np.linalg.LinAlgError:
    inv_info = linprob_info * np.nan  # Non-deterministic NaN handling

@gordonkoehn
Copy link
Contributor Author

gordonkoehn commented Jul 9, 2025

Are the same seeds assigned to the same process when rerun? Are sets or dictionaries different each time? [YES]

So we need a sorted list.

@gordonkoehn
Copy link
Contributor Author

gordonkoehn commented Jul 9, 2025

@gordonkoehn
Copy link
Contributor Author

Alright, 5c4e5df introduced a comprehensive, yet still relatively fast test on reproducibility.

It runs the deconvolution on 3 cores, for three locations, and compares the outputs to exact precision. This should catch any variations we don't want.

@gordonkoehn
Copy link
Contributor Author

gordonkoehn commented Jul 10, 2025

Actually, these tests still don't have bootstraps.

We should add bootstraps to this and then add the standard behaviour to round to 2-3 digits for no bootstrapping and to significant digits if we do bootstrapping.

@gordonkoehn
Copy link
Contributor Author

Implemented the full test now. And it fails not due to numerical precision but due to something else. This test checks the very setup we run in production. Check the very readable error logs.

We sadly see variation up to the second significant digit i.e. 0.7X %

I have already added a general rounding to three significant digits in proportion, proportionLower, and proportionUpper.

For reference, WiseDashboard and CovSpectrum Display results to the second decimal, i.e., XX.XX % for consistency.
So, quoting to the 4th significant digit should be enough here in any case. In any case, rigorous scientific quoting would lead to only one significant digit given these large errors (see the experimental physicists' error bible pg. 34), so this is the most practical to do here.

@gordonkoehn
Copy link
Contributor Author

Added reproducible seed assignment, but this does not seem to be it. Nevertheless, it is good to keep around. Certainly a potential avenue of error.

@gordonkoehn
Copy link
Contributor Author

The other potential cause is the ordering of set, which may result in different orders of processes and calculations within one with different seeds. I have now sorted all the lists that were once sets.

The issue persists.

@gordonkoehn
Copy link
Contributor Author

YES !!! This is reproducible now. Perhaps for the first time in Lollipop history ;) So, adding in a few more sorts and we got it.

Bottom line, this issue most likely was a combination of two issues:

Inconsistent usage of numpy random number generators:

When I introduced the parallel processing we needed for on-time reporting, I switched the random number generation to use numpy.random.SeedSequence, i.e., np.random.Generator to give each process its own seed. Now I didn't dig deep enough to see that the random number generator actually used the numpy legacy way of randomness, i.e., np.random.RandomState(). Also, I did not carefully propagate it through to the resampling. Clearly my mistake, we should have seen this upon code review; added better tests from the start.

Use of sets, adding non-determinism by unordered lists

Nevertheless, even if that would have been correct, it would have still failed because: Above, we assured a reproducible sequence of seeds, and that process 1 gets seed 1, and that seed 2 goes to process 2, and so on...

Yet we did not ensure that process 2 always processes the same data. That is because deconvolute.py exactly uses set() for everything: location_list, variants_list...

sets don't guarantee order in Python. Another thing we should have noticed.

Note that this issue has been around since inception, so while a distinct process may have just become affected since the parallelisation, intermediate computations for one location may also have varied. Yet I am not absolutely certain here.

Further non-determinism not addressed here is in the

scipy.optimize.least_squares, see diff_step which depends on the machine the optimization is run on, but I hope the difference is so small that we round it away anyway.


The test I built that models exactly the settings of production succeeds, so I am confident we have reproducibility, at least on the same machine.

@gordonkoehn gordonkoehn marked this pull request as ready for review July 10, 2025 15:45
@gordonkoehn gordonkoehn requested a review from dr-david July 10, 2025 15:45
@gordonkoehn gordonkoehn requested review from DrYak, Copilot and dr-david and removed request for DrYak, Copilot and dr-david July 10, 2025 15:45
@gordonkoehn gordonkoehn changed the title Adding Tests for Ensuring Reproducibility with Random Numbers Ensuring Reproducibility of Deconvolution Jul 10, 2025
@kirschen-k
Copy link

Really nice catch here, great job digging into this!

@gordonkoehn
Copy link
Contributor Author

gordonkoehn commented Jul 11, 2025

Another thought - as currently single-digit percentages may rarely still change given different initial random seeds, but we report to XX.XX% precision, we should consider increasing the number of bootstraps. Yet this is certainly a separate issue. Currently, we are running at 100 iterations; the original paper recommends 1000. Did we do an analysis of some kind at some point? But this would warrant further practical considerations in our setup, i.e., we deconvolve for all data, which may have too high a runtime if we do that.

@dr-david
Copy link
Collaborator

Amazing work, love it! I think with these tests, and this fix we should be good.

@dr-david
Copy link
Collaborator

Another thought - as currently single-digit percentages may rarely still change given different initial random seeds, but we report to XX.XX% precision, we should consider increasing the number of bootstraps. Yet this is certainly a separate issue. Currently, we are running at 100 iterations; the original paper recommends 1000. Did we do an analysis of some kind at some point? But this would warrant further practical considerations in our setup, i.e., we deconvolve for all data, which may have too high a runtime if we do that.

I would think 1000 resamples would make runtime a bit too prohibitive ... for not too much added precision. Except if we want to further parallelise this too ^^ But I would say it's probably not worth the extra compute.

@DrYak
Copy link
Member

DrYak commented Jul 28, 2025

Good for catching the order of variants (depending of it's handled, I could affect the order of the columns in the matrix and thus have some tiny impact on exact binary reproducibility of results)

@gordonkoehn
Copy link
Contributor Author

@DrYak let's merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Limit the Variant Abundance Estimate Outputs to Meaningful Accuracy

4 participants