|
| 1 | +************************** |
| 2 | +Dealing with Burn Failures |
| 3 | +************************** |
| 4 | + |
| 5 | +Sometimes the ODE integration of a reaction network will fail. Here |
| 6 | +we summarize some wisdom on how to avoid integrator failures. |
| 7 | + |
| 8 | +Error codes |
| 9 | +=========== |
| 10 | + |
| 11 | +The error code that is output when the integrator fails can be |
| 12 | +interpreted from the ``enum`` in ``integrator_data.H``. See |
| 13 | +:ref:`sec:error_codes` for a list of the possible codes. |
| 14 | + |
| 15 | +The most common errors are ``-2`` (timestep underflow) and ``-4`` (too |
| 16 | +many steps required). A timestep underflow usually means that |
| 17 | +something has gone really wrong in the integration and the integrator |
| 18 | +keeps trying to cut the timestep to be able to advance. Too many |
| 19 | +steps means that the integrator hit the cap imposed by |
| 20 | +``integrator.ode_max_steps``. This should never be made too |
| 21 | +large---usually if you need more than a few 1000 steps, then the some |
| 22 | +of the solutions discussed below can help. |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +Why does the integrator struggle? |
| 27 | +================================= |
| 28 | + |
| 29 | +There are a few common reasons why the integrator might encounter trouble: |
| 30 | + |
| 31 | +* The integrators don't know that the mass fractions should stay in |
| 32 | + $[0, 1]$. |
| 33 | + |
| 34 | + Note: they should ensure that the mass fractions sum to 1 as long as |
| 35 | + $\sum_i \dot{\omega}_k = 0$ in the righthand side function. |
| 36 | + |
| 37 | + This is a place where adjusting ``integrator.atol_spec`` can help. |
| 38 | + |
| 39 | +* The state wants to enter nuclear statistical equilibrium. As we |
| 40 | + approach equilibrium, the integrator will see large, oppositely |
| 41 | + signed flows from one step to the next, which should cancel, but |
| 42 | + numerically, the cancellation is not perfect. |
| 43 | + |
| 44 | + For some networks, the solution here is to use the NSE solver. |
| 45 | + |
| 46 | +* The Jacobian is not good enough to allow the nonlinear solver to |
| 47 | + converge. |
| 48 | + |
| 49 | + The Jacobians used in our networks are approximate (even the |
| 50 | + analytic one). For example, the analytic Jacobian neglects the |
| 51 | + composition dependence in the screening functions. In the analytic we |
| 52 | + neglect the composition influence in screening. |
| 53 | + |
| 54 | + |
| 55 | +Making the integration robust |
| 56 | +============================= |
| 57 | + |
| 58 | +.. index:: integrator.atol_spec, integrator.species_failure_tolerance, integrator.use_jacobian_caching, integrator.do_corrector_validation, integrator.use_burn_retry, integrator.X_reject_buffer |
| 59 | + |
| 60 | +Some tips for helping the integrator: |
| 61 | + |
| 62 | +* Use a tight absolute tolerance for the species |
| 63 | + (``integrator.atol_spec``). This ensures that even small species |
| 64 | + are tracked well, and helps prevent generating negative mass |
| 65 | + fractions. |
| 66 | + |
| 67 | + In general, ``atol_spec`` should be picked to be the magnitude of |
| 68 | + the smallest mass fraction you care about. You should never set |
| 69 | + it to be larger that ``rtol_spec``. See :ref:`sec:tolerances` |
| 70 | + for some discussion. |
| 71 | + |
| 72 | +* Adjust the step-rejection logic in the integrator. This is a |
| 73 | + check on the state after the step that rejects the advance |
| 74 | + if we find unphysical species. Not every integrator supports |
| 75 | + all of these options. |
| 76 | + |
| 77 | + * Reduce ``integrator.species_failure_tolerance``. This is used to |
| 78 | + determine whether a step should be rejected. If any of the mass |
| 79 | + fractions are more than ``integrator.species_failure_tolerance`` |
| 80 | + outside of $[0, 1]$, then the integrator will reject the step and |
| 81 | + try again (this is implemented in ``VODE`` and ``BackwardEuler``). |
| 82 | + |
| 83 | + If the value is too large (like ``0.01``), then this could allow the |
| 84 | + mass fractions to grow too much outside of $[0, 1]$, and then a |
| 85 | + single step rejection is not enough to recover. Setting this value to |
| 86 | + be close to the typical scale of a mass fraction that we care about |
| 87 | + (closer to ``integrator.atol_spec``) can help. |
| 88 | + |
| 89 | + * Increase ``integrator.X_reject_buffer``. This is used in the |
| 90 | + check on how much the species changed from one step to the next |
| 91 | + (inside of the integrator). We limit the change to a factor of 4 |
| 92 | + per step. We only check the species that are larger than |
| 93 | + ``X_reject_buffer * atol_spec``, so if trace species are causing |
| 94 | + problems, ``X_reject_buffer`` can be used to ignore them. |
| 95 | + |
| 96 | +* Try the numerical Jacobian (``integrator.jacobian=2``). The |
| 97 | + analytic Jacobian is faster to evaluate, but it approximates some |
| 98 | + terms. The numerical Jacobian sometimes works better. |
| 99 | + |
| 100 | +* Use the burn retry logic. By setting |
| 101 | + ``integrator.use_burn_retry=1``, the burn will immediately be |
| 102 | + retried if it fails, with a slightly different configuration. |
| 103 | + |
| 104 | + By default the type of Jacobian is swapped (if we were analytic, |
| 105 | + then do numerical, and vice vera). This is controlled by |
| 106 | + ``integrator.retry_swap_jacobian``. The tolerances can also be |
| 107 | + changed on retry. |
| 108 | + |
| 109 | + See :ref:`sec:retry` for the full list of options. |
| 110 | + |
| 111 | +* Use the right integrator. In general, VODE is the best choice. |
| 112 | + But if the network is only mildly stiff, then RKC can work well |
| 113 | + (typically, it works when the temperatures are below $10^9~\mathrm{K}$. |
| 114 | + |
| 115 | +* If you are near NSE, then use the NSE solver. This is described |
| 116 | + in :ref:`self_consistent_nse`. |
| 117 | + |
| 118 | + Note: not every network is compatible with the self-consistent |
| 119 | + NSE solver. |
| 120 | + |
| 121 | +* Use Jacobian-caching. If you build on GPUs, this is disabled by |
| 122 | + default. You can re-enable it by building with |
| 123 | + ``USE_JACOBIAN_CACHING=TRUE``. Also make sure that |
| 124 | + ``integrator.use_jacobian_caching=1`` (this is the default). |
| 125 | + |
| 126 | + By reducing the number of times the Jacobian is evaluated, we also |
| 127 | + reduce the possibility of trying to evaluate it with a bad state. |
| 128 | + |
| 129 | +* Use the corrector validation (``integrator.do_corrector_validation``). |
| 130 | + |
| 131 | + This checks to make sure the state is valid inside of the corrector |
| 132 | + loop, and if not, it bails out of the corrector, forcing the |
| 133 | + integrator to retry the entire step. |
| 134 | + |
| 135 | +Things we no longer recommend: |
| 136 | + |
| 137 | +.. index:: integrator.do_species_clip, integrator.renormalize_abundances |
| 138 | + |
| 139 | +* Clipping the species (``integrator.do_species_clip``) can lead to |
| 140 | + instabilities. This changes the integration state directly in the |
| 141 | + righthand side function, outside of the control of the integrator. |
| 142 | + While it sometimes may work, it often leads to problems. A better |
| 143 | + way to deal with keeping the species in $[0, 1]$ is through the |
| 144 | + absolute tolerance. |
| 145 | + |
| 146 | + The `SUNDIALS CVODE documentation <https://sundials.readthedocs.io/en/latest/cvode/Usage/index.html#advice-on-controlling-unphysical-negative-values>`_ has |
| 147 | + a good discussion on this numerical instability. |
| 148 | + |
| 149 | +* Renormalizing the species during the integration / righthand side call |
| 150 | + (``integrator.renormalize_abundances``). Like clipping, this can |
| 151 | + cause numerical instabilities. |
| 152 | + |
| 153 | +Debugging a burn failure |
| 154 | +======================== |
| 155 | + |
| 156 | +When a burn fails, the entire ``burn_t`` state will be output to |
| 157 | +stdout (CPU runs only). This state can then be used with |
| 158 | +``burn_cell_sdc`` to reproduce the burn failure outside of a |
| 159 | +simulation. For SDC, see the discussion at :ref:`sec:redo_burn_fail`. |
0 commit comments