Hello everybody !
As some of you might know, I have been generating ttbar events using the OOD for a while now to train the Seeding Transformer I have been developing.
Today I realised that the ActsExamples::RandomNumbers we use to generate event-based seed is not random at all...
It uses hash_combine (https://www.boost.org/doc/libs/1_36_0/doc/html/hash/reference.html#boost.hash_combine) to combine the global run seed with the event number to create a unique event seed. Unfortunately, this is not unique at all if you perform multiple simulation runs with an adjacent seed. So what is the issue with the hash_combine implementation?
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
If v (for us, the event number) is an int, then hash_value is the identity. What this means is that the new seed can be written (for seed>0):
seed ^= event_id + cte + seed*64 + int(seed / 4);
Which will lead to many collisions, for example, the 3 following seed-event pairs will result in seed 2655077414 : (10110, 64), (10111, 1), (10112, 13)
Right now, to avoid the issue, I try to make sure my seeds are not consecutive (using seed^2 works nicely). But I think we should either use something other than hash_combine (I didn't check what approaches exist) or maybe use a type for which hash_value is not the identity.
To loop back to what I am doing right now, I simulated 50k ttbar events (500 runs of 100 events) with a seed from 1000 to 1500. Results: 37% seed collision and only 30k unique events. More probematic my training and testing events are all mixed :(
If people want to recreate the issue, I am joining a small Python script that shows this exact effect.
test_seed.py
Hello everybody !
As some of you might know, I have been generating ttbar events using the OOD for a while now to train the Seeding Transformer I have been developing.
Today I realised that the ActsExamples::RandomNumbers we use to generate event-based seed is not random at all...
It uses
hash_combine(https://www.boost.org/doc/libs/1_36_0/doc/html/hash/reference.html#boost.hash_combine) to combine the global run seed with the event number to create a unique event seed. Unfortunately, this is not unique at all if you perform multiple simulation runs with an adjacent seed. So what is the issue with thehash_combineimplementation?If
v(for us, the event number) is an int, thenhash_valueis the identity. What this means is that the new seed can be written (for seed>0):Which will lead to many collisions, for example, the 3 following seed-event pairs will result in seed 2655077414 : (10110, 64), (10111, 1), (10112, 13)
Right now, to avoid the issue, I try to make sure my seeds are not consecutive (using seed^2 works nicely). But I think we should either use something other than
hash_combine(I didn't check what approaches exist) or maybe use a type for which hash_value is not the identity.To loop back to what I am doing right now, I simulated 50k ttbar events (500 runs of 100 events) with a seed from 1000 to 1500. Results: 37% seed collision and only 30k unique events. More probematic my training and testing events are all mixed :(
If people want to recreate the issue, I am joining a small Python script that shows this exact effect.
test_seed.py