Skip to content

Conversation

@lgritz
Copy link
Collaborator

@lgritz lgritz commented Jan 6, 2026

For IBA::resample() when bilinear interpolation is used, almost all of the expense was due to its relying on ImageBuf::interppixel which is simple but constructs a new ImageBuf::ConstIterator EVERY TIME, which is very expensive.

Reimplement in a way that reuses a single iterator. This speeds up IBA::resample by 20x or more typicaly.

Also refactor resample to pull the handling of deep images into a separate helper function and out of the main inner loop. And add some benchmarking.

For IBA::resample() when bilinear interpolation is used, almost all
of the expense was due to its relying on ImageBuf::interppixel which
is simple but constructs a new ImageBuf::ConstIterator EVERY TIME,
which is very expensive.

Reimplement in a way that reuses a single iterator.
This speeds up IBA::resample by 20x or more typicaly.

Also refactor resample to pull the handling of deep images into a
separate helper function and out of the main inner loop.
And add some benchmarking.

Signed-off-by: Larry Gritz <[email protected]>
@ssh4net
Copy link
Contributor

ssh4net commented Jan 7, 2026

I have added your fixes to hwy branch and run a comparison. It now looks more fair :)

[ Resample 75% ]
uint8      |       9.43 |       1.67 |   5.65x
uint16     |       9.17 |       1.81 |   5.07x
uint32     |      26.92 |      19.57 |   1.38x
float      |       9.55 |       1.87 |   5.11x
half       |      11.30 |       4.69 |   2.41x
double     |      32.98 |      30.87 |   1.07x

[ Resample 50% ]
uint8      |       4.20 |       0.79 |   5.31x
uint16     |       4.19 |       0.89 |   4.70x
uint32     |      17.40 |      17.01 |   1.02x
float      |       4.33 |       0.79 |   5.48x
half       |       5.36 |       2.25 |   2.38x
double     |      20.27 |      18.89 |   1.07x

[ Resample 25% ]
uint8      |       1.24 |       0.24 |   5.05x
uint16     |       1.19 |       0.26 |   4.52x
uint32     |      11.98 |      11.16 |   1.07x
float      |       1.08 |       0.20 |   5.39x
half       |       1.29 |       0.55 |   2.37x
double     |      16.50 |      13.57 |   1.22x
Type       | Scalar(ms) |   SIMD(ms) | Speedup

@jessey-git
Copy link
Contributor

Ran some tests locally and can also confirm the perf goodness (on boring uint8 pngs) as well as successful idiff comparisons of main vs the PR.

Time in ms average over 5 attempts:

main PR
4096x4096, 25% --> 1024x1024 : Avg 282.5024 4096x4096, 25% --> 1024x1024 : Avg 16.9484
4096x4096, 33% --> 1351x1351 : Avg 484.7668 4096x4096, 33% --> 1351x1351 : Avg 29.4051
4096x4096, 50% --> 2048x2048 : Avg 1126.7262 4096x4096, 50% --> 2048x2048 : Avg 61.9360
4096x4096, 67% --> 2744x2744 : Avg 2021.3136 4096x4096, 67% --> 2744x2744 : Avg 96.1518
4096x4096, 75% --> 3072x3072 : Avg 2529.3684 4096x4096, 75% --> 3072x3072 : Avg 124.1163
3840x2160, 25% --> 960x540 : Avg 139.5478 3840x2160, 25% --> 960x540 : Avg 7.0885
3840x2160, 33% --> 1267x712 : Avg 243.2302 3840x2160, 33% --> 1267x712 : Avg 12.2748
3840x2160, 50% --> 1920x1080 : Avg 558.0922 3840x2160, 50% --> 1920x1080 : Avg 25.5491
3840x2160, 67% --> 2572x1447 : Avg 996.6915 3840x2160, 67% --> 2572x1447 : Avg 47.6503
3840x2160, 75% --> 2880x1620 : Avg 1247.9340 3840x2160, 75% --> 2880x1620 : Avg 62.8001
2160x3840, 25% --> 540x960 : Avg 139.3882 2160x3840, 25% --> 540x960 : Avg 8.7678
2160x3840, 33% --> 712x1267 : Avg 244.5914 2160x3840, 33% --> 712x1267 : Avg 16.3722
2160x3840, 50% --> 1080x1920 : Avg 557.2218 2160x3840, 50% --> 1080x1920 : Avg 25.6951
2160x3840, 67% --> 1447x2572 : Avg 994.1367 2160x3840, 67% --> 1447x2572 : Avg 45.7704
2160x3840, 75% --> 1620x2880 : Avg 1246.6437 2160x3840, 75% --> 1620x2880 : Avg 60.2098
116x2052, 25% --> 29x513 : Avg 4.1224 116x2052, 25% --> 29x513 : Avg 0.3348
116x2052, 33% --> 38x677 : Avg 7.1067 116x2052, 33% --> 38x677 : Avg 0.4689
116x2052, 50% --> 58x1026 : Avg 16.1875 116x2052, 50% --> 58x1026 : Avg 0.8663
116x2052, 67% --> 77x1374 : Avg 28.1018 116x2052, 67% --> 77x1374 : Avg 1.4981
116x2052, 75% --> 87x1539 : Avg 36.0132 116x2052, 75% --> 87x1539 : Avg 2.4759

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 7, 2026

I have added your fixes to hwy branch and run a comparison. It now looks more fair :)

Yeah, around 5x speedup is what I'd expect, or at least hope for, for a good scalar vs good AVX2 implementation. Very far from that in either direction tells me that one of them has something wrong with it that is probably easily fixable.

@lgritz
Copy link
Collaborator Author

lgritz commented Jan 8, 2026

Approval on this?

@ssh4net
Copy link
Contributor

ssh4net commented Jan 8, 2026

If no regression in results, then for sure it ready :)

@lgritz lgritz merged commit 774368d into AcademySoftwareFoundation:main Jan 8, 2026
29 checks passed
@lgritz lgritz deleted the lg-resamplespeed branch January 8, 2026 19:48
lgritz added a commit to lgritz/OpenImageIO that referenced this pull request Jan 8, 2026
…wareFoundation#4993)

For IBA::resample() when bilinear interpolation is used, almost all of
the expense was due to its relying on ImageBuf::interppixel which is
simple but constructs a new ImageBuf::ConstIterator EVERY TIME, which is
very expensive.

Reimplement in a way that reuses a single iterator. This speeds up
IBA::resample by 20x or more typicaly.

Also refactor resample to pull the handling of deep images into a
separate helper function and out of the main inner loop. And add some
benchmarking.

Signed-off-by: Larry Gritz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants