-
Notifications
You must be signed in to change notification settings - Fork 112
Add lmdb as alternative file format
#852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Aske-Rosted
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found two small things one of which might not be very specific to this PR and could just be an issue/improvement for the future since it also occurs for the parquet file format.
With this large of a contribution it is hard for me to determine just by reading the code whether or not all the logic in the new writer for an example is correct, but it seems to me that the tests should be covering the new functionality.
| loss_weight_column: Optional[str] = None, | ||
| loss_weight_default_value: Optional[float] = None, | ||
| seed: Optional[int] = None, | ||
| labels: Optional[Dict[str, Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not do something smarter with kwargs, such that we both don't have duplicate code but also don't have to worry about forgetting to add new argument calls to the LMDB class in case new functionality is added for the dataset class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might a point to repeat some args and code so that non-expert users to have to dig into the native functions where all the args are coming from. Just my opinion.
| try: | ||
| label = fn(data) | ||
| except KeyError: | ||
| raise KeyError(f"Key {key} not found in data.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this KeyError to be correct, as what is going wrong is the function not finding the expected values upon which to create the labels in the data ,but the key being returned in the error is the name of the label which was supposed to be created by the function.
giogiopg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is impressive. I only would suggest to add to this PR or in a new one an example on the example folder (or modify the examples/01_icetray/01_convert_i3_files.py) to include a conversion to lmdb including a precomputed graph representation than can be later called. I believe that with the documentation examples on docs/source/datasets/datasets.rst should be enough to know how to load precomputed data representations, so maybe there is no need to add an examples on the example folder training from an lmdb, but I still will go for the example of the data conversion including a precomputed data representation.
This PR extends the list of supported backends to include
lmdb, thereby addressing #834 and closing #820.The main benefits of LMDB are threefold: It requires roughly half the space of SQLite, it has significantly faster random access for larger events than SQLite, and it provides a generic way of pre-computing data representations.
The downsides are subjective: No SQL syntax, and accessing large subsets of the dataset in one go is also slow.
Major Changes
Adds
LMDBWriter:The writer outputs .lmdb databases, where entries are key-value pairs. The keys are created on the index column (similar to the primary key in sqlite) and the values associated with the entry is all extracted data for the given event. The values are serialized, and several common serialization methods (
json,pickle, etc) are supported.To make the files self-contained, the databases contain a
__meta__entry with information on the serialization method used, and utility functions are added that will identify the correct method and use it for deserialization on queries. As a result, the user doesn't need to know the serialization method in order to read the files. Below is an example of a query:I profiled the query speeds in a usual data loading scenario vs. the event size, and found the following relationship

The query speed includes both event-level truth and the pulsemap (deserialized), and is repeated 100 times for each event. Real-time computation of representations and computational overhead of establishing connections are not included. From the figure it can be seen that for large events,
lmdboffers a significant speed-up.Additionally, the
LMDBWriteraccepts a list ofDataRepresentations- and if provided - the representations are calculated and stored in the file alongside other extracted data. Another meta field is written to the files that contain the config files of the representations, allowing users to re-instantiate the data representation modules used to compute the representations. A utility function for retrieving these is added. As such, this PR also closes Graph construction before training #781. An example of retrieving the data representation from the meta data can be seen belowIt is assumed that the data representation used is part of the users graphnet installation - i.e. exotic representations that are not yet part of the library will fail. There is no robust way around this.
Adds
LMDBDatasetThe dataset is compatible with the .lmdb files and largely identical to the existing
SQLiteDataset. It supports str-selections and has a "pre-computed" mode, where the user may choose to query pre-computed data representations instead of calculating them in real-time.Adds
SQLiteToLMDBConverterA pre-configured converter that converts existing sqlite databases to lmdb format, similar to our
ParquetToSQLiteConverter. This converter also accepts a list of data representations, allowing your to export to lmdb alongside pre-computed representations.Minor Changes
GraphNeTDataModuleto support the lmdb backendTagging @astrojarred @giogiopg @Aske-Rosted @sevmag and @pweigel as we've all discussed various aspects of this in the past.