Generalize data-dependent processes for simulations #1413

adrien-laposta · 2025-10-10T14:30:41Z

The purpose of this PR is to change the implementation of data-dependent filters in the current pipeline.
At the moment, only T-to-P leakage deprojection required a specific treatment outside of the processing pipeline structure.

This PR solves the issue for T-to-P leakage and for any other future processes for which we would need to have access to time ordered data that cannot be stored in the preprocessing database.

The intended use is:

Before filtering a set of simulations, we first load the data AxisManager to have access to time ordered data.

data_aman = pu.multilayer_load_and_preprocess(
    obs_id,
    configs_init,
    configs_proc,
    meta=meta,
    logger=logger,
    stop_for_sims=True
)

The new argument stop_for_sims allows to return a dict of AxisManager which saves the data before every steps needed to filter simulations. As an example for T-to-P leakage, data_aman will be

{(13, 'subtract_t2p'): AxisManager(timestamps[samps], ancil*[samps],..., bin_az_samps:IndexAxis(1351))}

This dictionary is then passed to the multilayer_load_and_preprocess_sim function, which will then use data_aman for processes having the associated flag in the process configuration: use_data_aman.
As explained above, this solution proposes to introduce a new boolean flag to the configuration file structure, use_data_aman to indicate which processes require to access the data. The way of using data is then specified in individual processes definition, where I already implemented the T2P leakage case.
Finally, I also changed the behavior of the skip_on_sims argument which no longer has a default value (False) as it was the cause of some confusion in the past.

kwolz · 2025-10-27T10:19:06Z

Working example on a single signal atomic with v3 preprocessing and fp_thin=8 as usual for sims. This compares the old solution (T2P template rerunning until end of preprocessing_init, before applying azss subtraction) with the new solution. Shown here are Q_old, Q_new, U_old, U_new, Q_diff, U_diff. This difference is expected to come from building T2P from data with azss subtracted (new, correct) vs without. It is at the percent level.

kwolz

Looks very good overall. Maybe @msilvafe and @mmccrackan have more comments.

kwolz · 2025-10-27T10:49:13Z

sotodlib/preprocess/preprocess_util.py

+                out_amans = {}
+                loc_aman = aman.copy()
+                for (step, name), pipe in pipes.items():
+                    pipe.run(loc_aman, aman.preprocess, select=False)


Here you ensure that no pipeline step is run twice, avoiding computational overhead. Nice!

msilvafe · 2025-11-04T20:31:52Z

What are the actual amplitude of the residuals here?--colorbar is 1% but then it seems like its not rescaled (i.e. residuals are a fraction of full cbar range). And where does the difference come from?

kwolz · 2025-11-05T13:58:02Z

The difference comes from the fact that before we didn't apply azss subtraction on the AxisManager we then used for T-to-P template subtraction, now we do. This is needed because we apply azss and t2p in this order on simulations. The relative difference in map amplitude is 1% for Q and ~0.4% for U in this particular case. @adrien-laposta has done transfer function tests and found no difference.

msilvafe · 2025-11-05T14:46:56Z

The difference comes from the fact that before we didn't apply azss subtraction on the AxisManager we then used for T-to-P template subtraction, now we do. This is needed because we apply azss and t2p in this order on simulations. The relative difference in map amplitude is 1% for Q and ~0.4% for U in this particular case. @adrien-laposta has done transfer function tests and found no difference.

Ok, but a fair comparison is probably an atomic map on exactly the same preprocessing with and without these commits. Can you make such a comparison. I'm finishing the rest of the review now.

kwolz · 2025-11-05T15:24:51Z

I could do that by inserting the azss step "by hand" on the data_aman after T-to-P. Probably the cleanest way would be to use a custom yaml_init_t2p that does that on the template only. Can do that

msilvafe

Some inline questions/comments.

msilvafe · 2025-11-05T15:43:30Z

sotodlib/preprocess/pcore.py

            has injested flags and other information into ``proc_aman``.
+        data_amans: dict (Optional)
+            A dictionary of AxisManagers with keys (step, process.name)
+            filled with AxisManager processed up to step-1. This is used


Unclear what "step-1" means here.

Sorry! Here step is the index of the process in a given config file. step-1 is then the process preceding it. (i.e. the step before getting the T2P template for example)

msilvafe · 2025-11-05T15:47:06Z

sotodlib/preprocess/preprocess_util.py

+    stop_for_sims: bool
+        Optinal. If True, will stop before each step of the pipeline
+        with the flag `use_data_aman` set to True. The intended use is
+        to prepare all necessary data products that cannot be stored in
+        the preprocessing database, to process simulations.


The data aman holds in the axismanager a full copy of the data for every preprocess step which has been specified with use_data_aman. This seems like it could balloon really quickly if the config file is improperly configured. Even 2 extra copies of the data gets pretty big. Perhaps there should be a check for if there's more than 2 or 3 steps which specify the data_aman that you warn the user and force them to acknowledge they're about to launch a job with ~2-3x in the normal memory usage?

Yes, I only tested it for a couple of stops in the pipe. I think this is related to your other comments about only keeping what's necessary in the AxisManagers, I will propose something more memory efficient

msilvafe · 2025-11-05T16:17:22Z

sotodlib/preprocess/preprocess_util.py

+            if stop_for_sims:
+                batch_idx = [
+                    (step, process.name)
+                    for step, process in enumerate(pipe_proc)
+                    if process.use_data_aman


why does this only include pipe_proc and ignore pipe_init?

This can be generalized to pipe_init. This adaptation was motivated by the way we were dealing with T2P (in proc)

msilvafe · 2025-11-05T16:33:39Z

sotodlib/preprocess/preprocess_util.py

+                for (step, name), pipe in pipes.items():
+                    pipe.run(loc_aman, aman.preprocess, select=False)
+                    out_amans[step, name] = loc_aman.copy()
+                return out_amans


Could this just be a dictionary of numpy arrays or restricted AxisManagers that only include the fields you need to reduce memory overhead?

I was thinking of doing this but fields to keep will depend on the filter and it will need to be defined somewhere. I find it a bit too verbose to be defined in the config files, I'm happy to take suggestions on this one. Also, this implementation follows the existing behaviour, i.e. loading a full data AxisManager in the filtering script and looping over all seeds to get filtered atomics.

msilvafe · 2025-11-05T16:35:42Z

sotodlib/preprocess/preprocess_util.py

-
-            if t2ptemplate_aman is not None:
-                # Replace Q,U with simulated timestreams
-                t2ptemplate_aman.wrap("demodQ", aman.demodQ, [(0, 'dets'), (1, 'samps')], overwrite=True)
-                t2ptemplate_aman.wrap("demodU", aman.demodU, [(0, 'dets'), (1, 'samps')], overwrite=True)
-
-                t2p_aman = t2pleakage.get_t2p_coeffs(
-                    t2ptemplate_aman,
-                    merge_stats=False
-                )
-                t2pleakage.subtract_t2p(
-                    aman,
-                    t2p_aman,
-                    T_signal=t2ptemplate_aman.dsT
-                )


Is this now called in your run script? Do you have examples of such scripts in another repo which I can look at?

Before, I was forced to hardcode it there for T2P leakage. The more general implementation in this PR allows me to write this piece of code in processes.py (see here). Which means that for each data-dependent process, we'll need to implement it in the corresponding process.

kwolz · 2025-11-08T14:08:20Z

I could do that by inserting the azss step "by hand" on the data_aman after T-to-P. Probably the cleanest way would be to use a custom yaml_init_t2p that does that on the template only. Can do that

Here is a map-level diff of the three methods mentioned: ISOv3 (ignoring AZsub completely in the T2P template), ALP (this PR), KW (ISOv3 w/ ad-hoc AZsub before T2P estimation). The diff ALP-KW is another ~2 orders smaller than the original bias reported above. Not sure where that comes from. See also this repo folder where you can rerun all of this.

adrien-laposta added 3 commits October 10, 2025 05:33

compute and use data from different pipeline stages in sim filtering

0175b8e

make skip_on_sim a required field

d305941

move skip_on_sim warning to run()

4b9ac4a

adrien-laposta requested review from mmccrackan and msilvafe October 10, 2025 14:30

adrien-laposta marked this pull request as draft October 10, 2025 14:48

kwolz approved these changes Oct 27, 2025

View reviewed changes

msilvafe reviewed Nov 5, 2025

View reviewed changes

adrien-laposta added 2 commits November 27, 2025 04:51

warning when saving multiple instances of axisman

61e1e2a

extend to init pipe

5fca3c8

Generalize data-dependent processes for simulations #1413

Are you sure you want to change the base?

Generalize data-dependent processes for simulations #1413

Uh oh!

Conversation

adrien-laposta commented Oct 10, 2025

Uh oh!

kwolz commented Oct 27, 2025

Uh oh!

kwolz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msilvafe commented Nov 4, 2025

Uh oh!

kwolz commented Nov 5, 2025

Uh oh!

msilvafe commented Nov 5, 2025

Uh oh!

kwolz commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msilvafe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrien-laposta Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwolz commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kwolz commented Nov 5, 2025 •

edited

Loading

adrien-laposta Nov 24, 2025 •

edited

Loading