-
Notifications
You must be signed in to change notification settings - Fork 22
Generalize data-dependent processes for simulations #1413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
kwolz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good overall. Maybe @msilvafe and @mmccrackan have more comments.
| out_amans = {} | ||
| loc_aman = aman.copy() | ||
| for (step, name), pipe in pipes.items(): | ||
| pipe.run(loc_aman, aman.preprocess, select=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you ensure that no pipeline step is run twice, avoiding computational overhead. Nice!
|
The difference comes from the fact that before we didn't apply azss subtraction on the AxisManager we then used for T-to-P template subtraction, now we do. This is needed because we apply azss and t2p in this order on simulations. The relative difference in map amplitude is 1% for Q and ~0.4% for U in this particular case. @adrien-laposta has done transfer function tests and found no difference. |
Ok, but a fair comparison is probably an atomic map on exactly the same preprocessing with and without these commits. Can you make such a comparison. I'm finishing the rest of the review now. |
|
I could do that by inserting the azss step "by hand" on the data_aman after T-to-P. Probably the cleanest way would be to use a custom yaml_init_t2p that does that on the template only. Can do that |
msilvafe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some inline questions/comments.
| has injested flags and other information into ``proc_aman``. | ||
| data_amans: dict (Optional) | ||
| A dictionary of AxisManagers with keys (step, process.name) | ||
| filled with AxisManager processed up to step-1. This is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what "step-1" means here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! Here step is the index of the process in a given config file. step-1 is then the process preceding it. (i.e. the step before getting the T2P template for example)
| stop_for_sims: bool | ||
| Optinal. If True, will stop before each step of the pipeline | ||
| with the flag `use_data_aman` set to True. The intended use is | ||
| to prepare all necessary data products that cannot be stored in | ||
| the preprocessing database, to process simulations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data aman holds in the axismanager a full copy of the data for every preprocess step which has been specified with use_data_aman. This seems like it could balloon really quickly if the config file is improperly configured. Even 2 extra copies of the data gets pretty big. Perhaps there should be a check for if there's more than 2 or 3 steps which specify the data_aman that you warn the user and force them to acknowledge they're about to launch a job with ~2-3x in the normal memory usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I only tested it for a couple of stops in the pipe. I think this is related to your other comments about only keeping what's necessary in the AxisManagers, I will propose something more memory efficient
| if stop_for_sims: | ||
| batch_idx = [ | ||
| (step, process.name) | ||
| for step, process in enumerate(pipe_proc) | ||
| if process.use_data_aman |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this only include pipe_proc and ignore pipe_init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be generalized to pipe_init. This adaptation was motivated by the way we were dealing with T2P (in proc)
| for (step, name), pipe in pipes.items(): | ||
| pipe.run(loc_aman, aman.preprocess, select=False) | ||
| out_amans[step, name] = loc_aman.copy() | ||
| return out_amans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be a dictionary of numpy arrays or restricted AxisManagers that only include the fields you need to reduce memory overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of doing this but fields to keep will depend on the filter and it will need to be defined somewhere. I find it a bit too verbose to be defined in the config files, I'm happy to take suggestions on this one. Also, this implementation follows the existing behaviour, i.e. loading a full data AxisManager in the filtering script and looping over all seeds to get filtered atomics.
|
|
||
| if t2ptemplate_aman is not None: | ||
| # Replace Q,U with simulated timestreams | ||
| t2ptemplate_aman.wrap("demodQ", aman.demodQ, [(0, 'dets'), (1, 'samps')], overwrite=True) | ||
| t2ptemplate_aman.wrap("demodU", aman.demodU, [(0, 'dets'), (1, 'samps')], overwrite=True) | ||
|
|
||
| t2p_aman = t2pleakage.get_t2p_coeffs( | ||
| t2ptemplate_aman, | ||
| merge_stats=False | ||
| ) | ||
| t2pleakage.subtract_t2p( | ||
| aman, | ||
| t2p_aman, | ||
| T_signal=t2ptemplate_aman.dsT | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this now called in your run script? Do you have examples of such scripts in another repo which I can look at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, I was forced to hardcode it there for T2P leakage. The more general implementation in this PR allows me to write this piece of code in processes.py (see here). Which means that for each data-dependent process, we'll need to implement it in the corresponding process.
Here is a map-level diff of the three methods mentioned: ISOv3 (ignoring AZsub completely in the T2P template), ALP (this PR), KW (ISOv3 w/ ad-hoc AZsub before T2P estimation). The diff ALP-KW is another ~2 orders smaller than the original bias reported above. Not sure where that comes from. See also this repo folder where you can rerun all of this. |








The purpose of this PR is to change the implementation of data-dependent filters in the current pipeline.
At the moment, only T-to-P leakage deprojection required a specific treatment outside of the processing pipeline structure.
This PR solves the issue for T-to-P leakage and for any other future processes for which we would need to have access to time ordered data that cannot be stored in the preprocessing database.
The intended use is:
AxisManagerto have access to time ordered data.The new argument
stop_for_simsallows to return adictofAxisManagerwhich saves the data before every steps needed to filter simulations. As an example for T-to-P leakage,data_amanwill be{(13, 'subtract_t2p'): AxisManager(timestamps[samps], ancil*[samps],..., bin_az_samps:IndexAxis(1351))}multilayer_load_and_preprocess_simfunction, which will then usedata_amanfor processes having the associated flag in the process configuration:use_data_aman.use_data_amanto indicate which processes require to access the data. The way of using data is then specified in individual processes definition, where I already implemented the T2P leakage case.skip_on_simsargument which no longer has a default value (False) as it was the cause of some confusion in the past.