Skip to content

Snesim#72

Open
mgravey wants to merge 19 commits intomasterfrom
snesim
Open

Snesim#72
mgravey wants to merge 19 commits intomasterfrom
snesim

Conversation

@mgravey
Copy link
Copy Markdown
Member

@mgravey mgravey commented Mar 9, 2026

No description provided.

Introduce a configurable g2s_path_index_t type and switch path/index-related variables to it.

- Add include/pathIndexType.hpp defining g2s_path_index_t and defaulting to unsigned, allowing override via G2S_PATH_INDEX_TYPE.
- Update build/Makefile to add G2S_PATH_INDEX_TYPE option and pass -DG2S_PATH_INDEX_TYPE to CFLAGS/CXXFLAGS.
- Update headers and sources (simulation.hpp, simulationAugmentedDim.hpp, src/ds-l.cpp, src/dsk.cpp, src/qs.cpp) to include pathIndexType.hpp and replace unsigned path/index variables, loops, allocations and related calculations with g2s_path_index_t. Adjust displayRatio and posterioryPath allocations accordingly.
- Add change_odc.md to .gitignore and fix trailing newlines in a couple of files.

This change makes the path/index integer width configurable (e.g. 32 vs 64-bit), improving flexibility for large datasets or memory-optimized builds.
Introduce distributed job utilities and make interfaces accept matrix/array job grids. Added build/qs_prepare.sh and build/qs_decentralized.py, registered them in build/algosName.config, and updated c++/intel Makefiles to create/install/clean these distributed-tools (symlink, chmod, and copy to libexec). Updated .gitignore to avoid tracking the symlink. Extended MATLAB and Python3 interfaces to normalize -job_grid/-job_grid_json/-jg inputs: convert numeric arrays and nested sequences into compact JSON job grids (with checks for finite integer identifiers) before running communication. These changes enable decentralized job preparation/launching and allow passing job grids from MATLAB/Python as native arrays.
Add an optional g2s_path_index_t* inputPosterioryPath parameter to simulation, simulationFull and simulationAD so callers can provide a preallocated posterior-path buffer. If no buffer is provided, the functions allocate, initialize (mark valid non-NaN entries) and populate the local posterioryPath, and only free it when it was locally allocated. This reduces redundant allocations and avoids freeing externally-owned buffers.
Introduce optional distributed-memory support and domain-padding utilities.

- Makefile: add G2S_QS_DISTRIBUTED build flag, QS_DISTRIBUTED_LIBS, and link libs when enabled; propagate QS_DISTRIBUTED_LIBS into c++ and intel build rules.
- New headers: include/qsDistributedUtils.hpp (parses -jg/-job_grid payloads, resolves unique job position) and include/qsPaddingUtils.hpp (helpers to compute padded dims, pad/crop DataImage, and remap simulation paths).
- src/qs.cpp: parse distributed CLI args, validate/resolve job position when built with G2S_QS_DISTRIBUTED (and error out when payload provided but build lacks support), compute spatial halo from kernels and optionally pad input images, remap simulation paths to padded domain, generate a posterior-path lookup and pass it to simulation routines, and update autosave/final save to crop padded domain before writing.

These changes enable distributed-job handling and safe halo-padding of the simulation domain while preserving backwards compatibility when distributed support is not compiled in.
Add full support for distributed job grids and optional di-grid metadata: parse -di_grid_json, store gridDims, local coordinates, flattened job IDs and DI names. Introduce robust grid flattening and value decoding helpers, and new APIs to map coordinates ↔ row-major indices. Wire distributed metadata into resolveDistributedJobPosition and use baseSeed + row-major position for per-job seeds. Implement posterior path stride/offset for distributed runs and add logic to load neighbor DIs and merge neighbor halos into padded domains (with numerous safety checks and logging). Also add useExternalPosteriorPath/currentPathOrder to threaded simulation routines to correctly compare posterior ordering across distributed setups and remove an unused zmq include.
Introduce a simulation update callback mechanism and distributed update propagation via ZMQ (guarded by G2S_QS_DISTRIBUTED).

- Add new header include/simulationUpdateCallback.hpp defining g2s_simulation_update_kind and g2s_simulation_update_callback_t.
- Extend simulation(), simulationFull(), and simulationAD() signatures to accept an update callback and user data, and invoke the callback at appropriate points (vector/full updates).
- Parse new endpoint grid CLI (-eg / -endpoint_grid_json) and store flattenedEndpointNames in QsDistributedOptions (qsDistributedUtils updates).
- Implement distributed send/receive logic in src/qs.cpp (when G2S_QS_DISTRIBUTED): ZMQ XSUB/XPUB sockets, publisher/receiver threads, a send queue/context, payload packing/unpacking helpers, precomputed lookup tables for mapping global->local indices, and applying received vector/full updates to the DataImage.
- Wire up callback to enqueue outgoing updates and set up ZMQ endpoints, neighbor connections, and graceful shutdown/cleanup.
- Add necessary includes and small utility functions for byte serialization and padded-domain checks.

These changes enable runtime propagation of partial/full simulation updates across worker endpoints for distributed simulations.
Introduce a virtual encodeJobGridMatrixToJsonString(...) in InterfaceTemplate and add normalizeJobGridParameter to convert -job_grid/-jg into JSON before communication. Implement matrix->JSON encoders for MATLAB and Python backends (handling numeric/logical arrays and integer-like floats) and remove duplicated per-backend normalization code. Call normalization in runStandardCommunication and tidy includes. Also add the -ld_classic linker flag to the macOS MATLAB build invocation.
Issue:\nIn distributed-memory runs, receiver lookup table construction skipped entries with globalPathIndex==0, assuming 0 always meant hard data. However with the current global indexing, 0 is also a valid simulated path value (owner 0, first simulated rank). Under specific seeds/topologies this dropped required halo updates, leaving some cells unresolved and causing long waits/holds.\n\nFix:\n- Build lookup tables by filtering hard-data from DI content instead of filtering by global path value.\n  - full-sim: skip entries where target scalar is already non-NaN (unless forceSimulation).\n  - vector-sim: skip entries where full target cell is already known (unless forceSimulation).\n- Keep only invalid sentinel filtering on maxPosteriorValue; do not discard globalPathIndex==0.\n- Wire forceSimulation into lookup table construction to preserve forced overwrite semantics.\n- Add inline code comments documenting the root cause and the filter rule.\n\nResult:\nReceiver no longer drops legitimate updates for global index 0, removing the observed distributed wait/hold behavior for the reproducing seed.
Remove verbose fprintf-based debug logging in distributed send/receive codepaths and replace them with a single comment indicating debug prints are disabled. The message parsing, validation and statistics updates remain unchanged; only the flockfile/fprintf/funlockfile debug outputs in src/qs.cpp were removed.
Detect and upload inline JSON payloads: add trimWhitespaceCopy, isInlineJsonPayload and uploadJson helpers in interfaceTemplate.hpp and wire them into lookForUpload so JSON strings are uploaded to the server and replaced with "/tmp/G2S/data/<hash>.json". Server-side storeJson parsing condition adjusted (inSize check) and UPLOAD_JSON now stores JSON as uncompressed (changed compressed flag to false). These changes ensure inline JSON parameters are recognized, uploaded, and referenced consistently by hash-backed temporary files.
Add algoNames to default targets in both c++ and intel Makefiles. Refactor normalizeJobGridParameter in interfaceTemplate.hpp: collect values from -jg, -job_grid_json and -job_grid into a single vector, canonicalize inserts to use the -jg key, include -jg in the JSON-parameter upload set, and update the error message and comment. In src/jobTasking.cpp replace execv with execvp so executed commands are searched on PATH.
Parse progress percentages from qs log output and report aggregated progress with ETA. Introduces a PROGRESS_PATTERN regex, polling/report intervals, and helper functions parse_progress_from_line and read_progress_from_log to read log files incrementally and extract progress values (including handling partial lines and finalization). Commands now include explicit log paths and internal data structures were extended to track per-job log offsets, fragments, and progress; processes list and all_cmds were updated accordingly. The main loop now non-blockingly polls child processes, updates per-job progress, finalizes log reads on exit, aggregates progress across jobs, and emits periodic progress reports.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant