Skip to content

Conversation

mrxz
Copy link
Collaborator

@mrxz mrxz commented Aug 1, 2025

This PR changes the way spherical harmonics are stored in PackedSplats.extra and how they are evaluated. Instead of keeping the coefficients for the different orders separate, they are interleaved in a single buffer. Their representation is unified to be sint8.

Main motivation was to address the remaining quality issue reported in #145 (see comparison: #142 (comment)). The current representations (sint7, sint8 and sint6) were chosen such that they fit in a single texel. Simply upping this to sint8 would require multiple texture fetches per order and introduces a lot of padding. By interleaving them the padding is minimized, keeping the amount of texture fetches required low.

SH Degrees Coefficients (cumulative) Texels / Format Padding
1 9 RGB (3 * 4 = 12) 3
2 24 (9 + 15) 2 x RGB (2 * 3 * 4 = 24) 0
3 45 (9 + 15 + 21) 3 x RGBA (3 * 4 * 4 = 48) 3

With the above there's only a single sh texture. Since the exact layout of the buffer and the resulting texture depends on the number of degrees in the source file, the maxSh property of the SplatMesh only changes which orders are evaluated, not how they are read. This means multiple combinations exist, which are handled using #define/#if directives in the GLSL code.

There are a couple of upsides:

Additionally I've added a maxSh property to SplatEncoding. This allows someone to skip loading and encoding the sh coefficients for orders they don't intend on using. While not fully utilized in all loaders, this can further improve loading/parsing speeds as well as save memory usage.

I haven't done extensive performance testing. In particular the RGB texture format might perform poorly on certain hardware, but it does allow for the least amount of padding. Not sure if in those cases having 3xRG for maxSh = 2 would perform better or worse. But before diving into that, let's first establish if this approach is the way to go.

@dmarcos
Copy link
Contributor

dmarcos commented Aug 1, 2025

Thanks. This is awesome. Happy to see that precision up!

@mrxz mrxz marked this pull request as draft August 4, 2025 14:50
@mrxz
Copy link
Collaborator Author

mrxz commented Aug 4, 2025

Did some profiling of the draw call that evaluates the spherical harmonics using EXT_disjoint_timer_query_webgl2. On desktop the difference was negligible, but the impact on an Android smartphone and Quest 3 was considerably larger (> 40%).

Changing the representation to separate textures sampled at the same location had a considerably smaller impact. So with 3 RGBA32UI textures holding the 45 coefficients, the impact was ~8% on the smartphone and ~10% on the Quest 3.

Interestingly enough the time it saves while loading/parsing splat files is more than it takes to split the coefficients per texture. Going to rework this PR to support both old and new mode (specified through splatEncoding) and try to retain the parsing speed gains.

@asundqui
Copy link
Contributor

asundqui commented Aug 4, 2025

@mrxz I like the general idea of increasing the precision of the SH coefficients to be uniformly 8 bits, and it seems from #145 and from some other comments we saw on our Discord on SPZ file format's bit reduction that we get some banding as a result. I hope that 8 bits would be enough, but I'm worried that even that wouldn't do it. I don't think we can afford float16 from a memory/performance perspective... but should we allocate more than 8 bits possibly? Of maybe we need some off-line tool for people to analyze their PLY and get recommendations for shNmin/max?

Not entirely a serious idea, but I wonder if a "random dithered rounding" could make the banding basically imperceptible, or if it would just look noisy :). So basically when rounding to sint8 (or lower bits) we add (rand() - 0.5) to the values and then round those.

My impression with RGB-style textures is that they are generally discouraged because some hardware isn't equipped to deal with data stride 3 efficiently. I wonder if that's what was going on with Quest 3 / Android? Another theory I have is that the pseudo-compute shader presumably executes over "tiles", and if those tile sizes happen to match the tile size of the texture lookups, then it's almost like we are guaranteed good texture cache efficiency. But if the texture size doesn't match then the texture access pattern gets spread over multiple tiles in a strange way. This was the mental model I had when I originally coded up the sh1-3 textures.

The most annoying thing is SH1 with 9 coefficients... 1 more than a power of 2! I can't think of a great way to pack things such that we have 8 bits per coefficient and don't waste too much memory (and bandwidth) :/ I think if we can be okay with maxSh=1 being very inefficient (likely much less common than maxSh>=2), and just used RGBA32UI for all textures instead of RG32UI for SH1? Then we can support up to SH3 with Sint8 coefficients like you described in your testing today.

If you can make the PR support both old and new modes then that would be great from a user's perspective - they can choose between slightly more memory efficient, or higher bit precision. Should we be worried about the code complexity from all these options?? I guess there are other places in Spark that are getting pretty complex :)

Anyway, great work, I support this line of inquiry, as long as there are options so that Android/Quest users won't be impacted with no recourse!

@mrxz
Copy link
Collaborator Author

mrxz commented Aug 8, 2025

I hope that 8 bits would be enough, but I'm worried that even that wouldn't do it. I don't think we can afford float16 from a memory/performance perspective... but should we allocate more than 8 bits possibly? Of maybe we need some off-line tool for people to analyze their PLY and get recommendations for shNmin/max?

I would assume 8 bits should suffice given how many of the formats seem to use that as well (SOGS, Playcanvas Compressed .ply, SPZ). We should try to make the most out of these 8 bits, as there are two critical points:

  • The loaders parse the coefficients as floating point numbers which are then encoded again. Mismatch in shNmin/max can cause loss of precision
  • When evaluating the spherical harmonics the end result is encoded into the RGB values. This can introduce clamping and loss of precision.

Instead of an offline tool, I've been wondering if maybe there should be a mechanism for the loaders to suggest suitable SplatEncoding values. Obviously there's a bit of a chicken-and-egg problem here as the loaders also perform the encoding. And it might be difficult to find a clean way to propagate that encoding to the SparkRenderer.

But the current SplatEncoding seems to provide a technical solution, but not a practical one for many users. An offline tool could help, but it'll only be a matter of time until people would want to run it online (as part of their app/experience).

My impression with RGB-style textures is that they are generally discouraged because some hardware isn't equipped to deal with data stride 3 efficiently. I wonder if that's what was going on with Quest 3 / Android?

That's what I understand to be the main reason as well, though I've also heard that this only applies to 8-bit channels. Either way I prefer trying things instead of just assuming, as there are many outdated best-practices floating around. And it never hurts to get an idea/quantify the impact it has.

Another theory I have is that the pseudo-compute shader presumably executes over "tiles", and if those tile sizes happen to match the tile size of the texture lookups, then it's almost like we are guaranteed good texture cache efficiency. But if the texture size doesn't match then the texture access pattern gets spread over multiple tiles in a strange way. This was the mental model I had when I originally coded up the sh1-3 textures.

The GPU in the Quest supports two modes: binned and direct. For the fullscreen passes it uses the direct mode, so they aren't binned into tiles. Though the fragment shader is of course still processed in (at least) 2x2 blocks. Clearly the separate textures perform better, whether due to caching or a more predictable access pattern for pre-fetching or something else.

It's possible that laying out the pixels in a different way in a single texture might yield similar performance, but at that point there wouldn't really be any benefit.

The most annoying thing is SH1 with 9 coefficients... 1 more than a power of 2! I can't think of a great way to pack things such that we have 8 bits per coefficient and don't waste too much memory (and bandwidth) :/ I think if we can be okay with maxSh=1 being very inefficient (likely much less common than maxSh>=2), and just used RGBA32UI for all textures instead of RG32UI for SH1? Then we can support up to SH3 with Sint8 coefficients like you described in your testing today.

Indeed, it's a lot more straightforward to simply use RGBA32UI textures throughout. And while it's suboptimal for the maxSh=1 case, it doesn't seem very common. And if it does become common/requested we can always consider adding a different/dedicated approach for it.

If you can make the PR support both old and new modes then that would be great from a user's perspective - they can choose between slightly more memory efficient, or higher bit precision. Should we be worried about the code complexity from all these options?? I guess there are other places in Spark that are getting pretty complex :)

There are definitely some worries about the increasing complexity the more options and configurations we add. It does increase the cost of maintenance, makes it harder for users to know what they should use and when investigating user-reported issues there are more variables/unknowns, etc....
So it's a trade-off, as long as we avoid a combinatorial explosion. In this case the spherical harmonics evaluation is pretty self-contained within SplatMesh so it should be okay IMO.

@mrxz
Copy link
Collaborator Author

mrxz commented Aug 11, 2025

Did some more testing/profiling with a singular texture. Instead of using splatIndex * 3 + i, computing the coordinate once and scaling the x component by 3. So using a texture of (width*3, height), and offsetting using texelFetchOffset. The performance of this approach was better than the 3 separate textures on PC and my smartphone. However, testing on an older smartphone and Quest 3 showed mixed results.

Both of these devices give highly varying times from run to run. At one point the Quest 3 even reported a time that was 50% of the baseline. Trying to figure out what is causing this behaviour, I believe that the (async) Readback ends up getting capture in the timer query or otherwise impacting it.

Looking at the timeline there seems to be a suboptimal pattern arising for these two devices. Since they can't reach the target frame-rate they'll have pending sort operations. Once the worker is done sorting it immediately kicks of the pending sort, which starts up a GPU workload (Readback). This happens right before the rAF and carries over into it, to the point that it delays the render work for the frame.

Ideally the pending sort should only kick-off after the rAF. Now it naively attempts two sorts in between rAF callbacks. I don't think there's anyway to cleanly estimate if there's time for it, so heuristically allowing only one seems like a good middle-ground. It would also give the rAF code the opportunity to replace the pending sort with a more up-to-date one.
(Additionally if we'd expect the second one to complete before the next rAF, there'd be no need to actually sort the first readback)

This pattern seems to converge pretty quickly, which seems to be (partially) caused by the 4ms polling rate of readRenderTargetPixelsAsync.

Either way, going to try to see if deferring the pending sort to after rAF helps, as well as lowering the polling interval. Hopefully that concentrates the GPU workload right after the rAF and not right before with a period of idle. At the very least I hope this makes the measurements more consistent/reliable, though I could also try smaller splats for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants