GPU Streaming Part 2

The Bad

In the previous post, GPU Streaming and Instancing, I have mentioned the poor performance of my initial solution for the GPU Streaming - using Hardware Instancing with Instance Streams. The tests I have done at the time suggested up to ~2x the performance cost of the approach used currently within v0.5 - building one, large object out of all the data (I call it Merged Mesh for short).

Unfortunately, upon closer inspection, I have found my testing methodology to be faulty, as it wasn’t correcting for several other aspects potentially affecting performance (mostly post-processing and some lighting). After accounting for all confounding variables, re-running the benchmarks and re-compiling all of the results, the average performance cost of using Instance Streams seems to actually be ~3.3x that of the Merged Mesh.

The Good

Now for the good news. I have been experimenting with different approaches to optimize the performance, and one of the solutions I found is to use Structured Buffers instead of Instance Streams to pass the instance data. This proved to be superior by ~50%, however it’s still long way away from the original approach.

Comparison of average performance cost to render 1,000,000 points using different techniques. In most situations, the amount of visible points should stay within 1.3 - 1.7 million.    Results obtained with i7-5960X@4.4GHz and NVIDIA GeForce GTX 980Ti

Comparison of average performance cost to render 1,000,000 points using different techniques. In most situations, the amount of visible points should stay within 1.3 - 1.7 million.

Results obtained with i7-5960X@4.4GHz and NVIDIA GeForce GTX 980Ti

With regards to VRAM memory requirements, both streaming options consume, on average, ~80 MB per 1,000,000 visible points, which is an improvement of several magnitudes.

There is also an accidental advantage of the Structured Buffer approach. Because it is bottlenecked by the number of instances and not polycount, it’s only marginally affected by the complexity of the sprite. In other words, you can get better quality sprites for essentially no extra cost.

Next Steps

For the time being I will attempt to provide and support both implementation in the plugin, with the idea of allowing users to optionally switch over to Streaming once they start reaching their VRAM ceiling, but still being able to enjoy the faster rendering for smaller clouds.

There are few other approaches I am currently considering:

  • Using Geometry Shader instead of Hardware Instancing

  • Using Dynamic Textures in place of the Structured Buffer

Other than that, I will soon be working on a complete LOD system overhaul, which will (hopefully) result in better quality with less visible points required. This should help mitigate the increased cost of rendering.