Skip to content

Conversation

@RasmusOrsoe
Copy link
Collaborator

@RasmusOrsoe RasmusOrsoe commented Dec 9, 2025

This PR extends the list of supported backends to include lmdb, thereby addressing #834 and closing #820.

The main benefits of LMDB are threefold: It requires roughly half the space of SQLite, it has significantly faster random access for larger events than SQLite, and it provides a generic way of pre-computing data representations.

The downsides are subjective: No SQL syntax, and accessing large subsets of the dataset in one go is also slow.

Major Changes

  • Adds LMDBWriter:
    The writer outputs .lmdb databases, where entries are key-value pairs. The keys are created on the index column (similar to the primary key in sqlite) and the values associated with the entry is all extracted data for the given event. The values are serialized, and several common serialization methods (json, pickle, etc) are supported.

    To make the files self-contained, the databases contain a __meta__ entry with information on the serialization method used, and utility functions are added that will identify the correct method and use it for deserialization on queries. As a result, the user doesn't need to know the serialization method in order to read the files. Below is an example of a query:

    from graphnet.data.utilities.lmdb_utilities import query_database
    
    lmdb_path = "~/merged/merged.lmdb"
    event_no = 1
    
    # The query function automatically detects the serializer used and will deserialize the blop
    # Result is a dict with all table entries for a single event
    # Every entry in the dict is a `table`. E.g. result["truth"] 
    
    result = query_database(lmdb_path, event_no)

    I profiled the query speeds in a usual data loading scenario vs. the event size, and found the following relationship
    query_vs_n
    The query speed includes both event-level truth and the pulsemap (deserialized), and is repeated 100 times for each event. Real-time computation of representations and computational overhead of establishing connections are not included. From the figure it can be seen that for large events, lmdb offers a significant speed-up.

    Additionally, the LMDBWriter accepts a list of DataRepresentations - and if provided - the representations are calculated and stored in the file alongside other extracted data. Another meta field is written to the files that contain the config files of the representations, allowing users to re-instantiate the data representation modules used to compute the representations. A utility function for retrieving these is added. As such, this PR also closes Graph construction before training #781. An example of retrieving the data representation from the meta data can be seen below

    from graphnet.data.utilities.lmdb_utilities import query_database, get_data_representation_from_metadata
    
    event_no = 0
    lmdb_path = "~/merged/merged.lmdb"
    # Return list of available representations
    query_database(database = lmdb_path, index = event_no )["data_representations"].keys()
    
    # Returns perhaps dict_keys(['KNNGraph', 'GraphDefinition'])
    
    # Get the data representation for 'KNNGraph' from the metadata
    data_representation = get_data_representation_from_metadata(lmdb_path, "KNNGraph")

    It is assumed that the data representation used is part of the users graphnet installation - i.e. exotic representations that are not yet part of the library will fail. There is no robust way around this.

  • Adds LMDBDataset
    The dataset is compatible with the .lmdb files and largely identical to the existing SQLiteDataset. It supports str-selections and has a "pre-computed" mode, where the user may choose to query pre-computed data representations instead of calculating them in real-time.

  • Adds SQLiteToLMDBConverter
    A pre-configured converter that converts existing sqlite databases to lmdb format, similar to our ParquetToSQLiteConverter. This converter also accepts a list of data representations, allowing your to export to lmdb alongside pre-computed representations.

Minor Changes

  • Expanded the test suite to include the lmdb backend for unit testing
  • Replaced deprecated converters in the relevant test suite
  • Icetray conversion example adjusted to include the lmdb backend (which is now the default format in the example)
  • Added utility functions (graphnet.data.utilities.lmdb_utilities) - for querying events etc
  • Minor updates of the documentation. I kept these very small to not bloat the PR further.
  • Updated the GraphNeTDataModule to support the lmdb backend

Tagging @astrojarred @giogiopg @Aske-Rosted @sevmag and @pweigel as we've all discussed various aspects of this in the past.

@RasmusOrsoe RasmusOrsoe changed the title Lmdb pr Add lmdb as alternative file format Dec 10, 2025
@RasmusOrsoe RasmusOrsoe requested a review from giogiopg December 10, 2025 10:32
Copy link
Collaborator

@Aske-Rosted Aske-Rosted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two small things one of which might not be very specific to this PR and could just be an issue/improvement for the future since it also occurs for the parquet file format.

With this large of a contribution it is hard for me to determine just by reading the code whether or not all the logic in the new writer for an example is correct, but it seems to me that the tests should be covering the new functionality.

loss_weight_column: Optional[str] = None,
loss_weight_default_value: Optional[float] = None,
seed: Optional[int] = None,
labels: Optional[Dict[str, Any]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not do something smarter with kwargs, such that we both don't have duplicate code but also don't have to worry about forgetting to add new argument calls to the LMDB class in case new functionality is added for the dataset class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might a point to repeat some args and code so that non-expert users to have to dig into the native functions where all the args are coming from. Just my opinion.

try:
label = fn(data)
except KeyError:
raise KeyError(f"Key {key} not found in data.")
Copy link
Collaborator

@Aske-Rosted Aske-Rosted Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this KeyError to be correct, as what is going wrong is the function not finding the expected values upon which to create the labels in the data ,but the key being returned in the error is the name of the label which was supposed to be created by the function.

Copy link
Collaborator

@giogiopg giogiopg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is impressive. I only would suggest to add to this PR or in a new one an example on the example folder (or modify the examples/01_icetray/01_convert_i3_files.py) to include a conversion to lmdb including a precomputed graph representation than can be later called. I believe that with the documentation examples on docs/source/datasets/datasets.rst should be enough to know how to load precomputed data representations, so maybe there is no need to add an examples on the example folder training from an lmdb, but I still will go for the example of the data conversion including a precomputed data representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Graph construction before training

3 participants