-
Notifications
You must be signed in to change notification settings - Fork 72
Interleave spherical harmonics coefficients #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks. This is awesome. Happy to see that precision up! |
Did some profiling of the draw call that evaluates the spherical harmonics using Changing the representation to separate textures sampled at the same location had a considerably smaller impact. So with 3 Interestingly enough the time it saves while loading/parsing splat files is more than it takes to split the coefficients per texture. Going to rework this PR to support both old and new mode (specified through |
@mrxz I like the general idea of increasing the precision of the SH coefficients to be uniformly 8 bits, and it seems from #145 and from some other comments we saw on our Discord on SPZ file format's bit reduction that we get some banding as a result. I hope that 8 bits would be enough, but I'm worried that even that wouldn't do it. I don't think we can afford float16 from a memory/performance perspective... but should we allocate more than 8 bits possibly? Of maybe we need some off-line tool for people to analyze their PLY and get recommendations for shNmin/max? Not entirely a serious idea, but I wonder if a "random dithered rounding" could make the banding basically imperceptible, or if it would just look noisy :). So basically when rounding to sint8 (or lower bits) we add (rand() - 0.5) to the values and then round those. My impression with RGB-style textures is that they are generally discouraged because some hardware isn't equipped to deal with data stride 3 efficiently. I wonder if that's what was going on with Quest 3 / Android? Another theory I have is that the pseudo-compute shader presumably executes over "tiles", and if those tile sizes happen to match the tile size of the texture lookups, then it's almost like we are guaranteed good texture cache efficiency. But if the texture size doesn't match then the texture access pattern gets spread over multiple tiles in a strange way. This was the mental model I had when I originally coded up the sh1-3 textures. The most annoying thing is SH1 with 9 coefficients... 1 more than a power of 2! I can't think of a great way to pack things such that we have 8 bits per coefficient and don't waste too much memory (and bandwidth) :/ I think if we can be okay with If you can make the PR support both old and new modes then that would be great from a user's perspective - they can choose between slightly more memory efficient, or higher bit precision. Should we be worried about the code complexity from all these options?? I guess there are other places in Spark that are getting pretty complex :) Anyway, great work, I support this line of inquiry, as long as there are options so that Android/Quest users won't be impacted with no recourse! |
I would assume 8 bits should suffice given how many of the formats seem to use that as well (SOGS, Playcanvas Compressed .ply, SPZ). We should try to make the most out of these 8 bits, as there are two critical points:
Instead of an offline tool, I've been wondering if maybe there should be a mechanism for the loaders to suggest suitable But the current
That's what I understand to be the main reason as well, though I've also heard that this only applies to 8-bit channels. Either way I prefer trying things instead of just assuming, as there are many outdated best-practices floating around. And it never hurts to get an idea/quantify the impact it has.
The GPU in the Quest supports two modes: binned and direct. For the fullscreen passes it uses the direct mode, so they aren't binned into tiles. Though the fragment shader is of course still processed in (at least) 2x2 blocks. Clearly the separate textures perform better, whether due to caching or a more predictable access pattern for pre-fetching or something else. It's possible that laying out the pixels in a different way in a single texture might yield similar performance, but at that point there wouldn't really be any benefit.
Indeed, it's a lot more straightforward to simply use RGBA32UI textures throughout. And while it's suboptimal for the
There are definitely some worries about the increasing complexity the more options and configurations we add. It does increase the cost of maintenance, makes it harder for users to know what they should use and when investigating user-reported issues there are more variables/unknowns, etc.... |
Did some more testing/profiling with a singular texture. Instead of using Both of these devices give highly varying times from run to run. At one point the Quest 3 even reported a time that was 50% of the baseline. Trying to figure out what is causing this behaviour, I believe that the (async) Readback ends up getting capture in the timer query or otherwise impacting it. Looking at the timeline there seems to be a suboptimal pattern arising for these two devices. Since they can't reach the target frame-rate they'll have pending sort operations. Once the worker is done sorting it immediately kicks of the pending sort, which starts up a GPU workload (Readback). This happens right before the rAF and carries over into it, to the point that it delays the render work for the frame. Ideally the pending sort should only kick-off after the rAF. Now it naively attempts two sorts in between rAF callbacks. I don't think there's anyway to cleanly estimate if there's time for it, so heuristically allowing only one seems like a good middle-ground. It would also give the rAF code the opportunity to replace the pending sort with a more up-to-date one. This pattern seems to converge pretty quickly, which seems to be (partially) caused by the 4ms polling rate of Either way, going to try to see if deferring the pending sort to after rAF helps, as well as lowering the polling interval. Hopefully that concentrates the GPU workload right after the rAF and not right before with a period of idle. At the very least I hope this makes the measurements more consistent/reliable, though I could also try smaller splats for testing. |
This PR changes the way spherical harmonics are stored in
PackedSplats.extra
and how they are evaluated. Instead of keeping the coefficients for the different orders separate, they are interleaved in a single buffer. Their representation is unified to besint8
.Main motivation was to address the remaining quality issue reported in #145 (see comparison: #142 (comment)). The current representations (
sint7
,sint8
andsint6
) were chosen such that they fit in a single texel. Simply upping this tosint8
would require multiple texture fetches per order and introduces a lot of padding. By interleaving them the padding is minimized, keeping the amount of texture fetches required low.RGB
(3 * 4 = 12)RGB
(2 * 3 * 4 = 24)RGBA
(3 * 4 * 4 = 48)With the above there's only a single
sh
texture. Since the exact layout of the buffer and the resulting texture depends on the number of degrees in the source file, themaxSh
property of theSplatMesh
only changes which orders are evaluated, not how they are read. This means multiple combinations exist, which are handled using#define/#if
directives in the GLSL code.There are a couple of upsides:
Additionally I've added a
maxSh
property toSplatEncoding
. This allows someone to skip loading and encoding the sh coefficients for orders they don't intend on using. While not fully utilized in all loaders, this can further improve loading/parsing speeds as well as save memory usage.I haven't done extensive performance testing. In particular the RGB texture format might perform poorly on certain hardware, but it does allow for the least amount of padding. Not sure if in those cases having 3x
RG
formaxSh = 2
would perform better or worse. But before diving into that, let's first establish if this approach is the way to go.