GPU Streaming and Instancing

GPU Streaming

Currently, the largest limitation of the plugin is it’s reliance on storing all data in the memory of the graphics card - the VRAM. To demonstrate why it is such a limiting factor, let’s run through the process.

Since vast majority of people prefer to visualise their clouds as Sprites, this will be the scenario I’ll concentrate on.

To render a single sprite we need 4 vertices, each requires a 3D location (3 floats - 12 bytes) and a color (8-bit format - 4 bytes). This, together gives us memory requirement of 4 x (12 + 4) = 64 bytes per point. In other words, to render 100,000,000 points it will take approximately 6 GB of VRAM to store the cloud on top of everything else required for the scene.

And while I agree with the fact that most people looking into Point Cloud visualisations in a serious manner will probably own top-level hardware (the likes of Titans or Quadros), it will only help so much. What happens when you need a cloud of 1,000,000,000 points or more? I’ve already had companies asking to visualise close to 2 billion points using a single cloud file - to put that into perspective, you would need a 120 GB of VRAM for that - staggering!

The above didn’t even touch on another, rather important matter - the Loading Times. Prefetching that amount of data may cause long delays and stuttering - both of which will degrade the end-user’s experience. With the streaming approach, only the relevant portion of the cloud is being transmitted, and there is no need for pre-loading large data sets.

Finally, many of you ask about the ability to efficiently update the data at runtime - be it for live data feeds from things like Kinect and LiDAR or to execute some form of CPU-driven animation. Using streaming makes this process significantly easier, more maintainable and expandable.

Instancing

As some of you might have noticed in the example shown above, the sprite uses 4 vertices to display 1 data point of the cloud - this means we are effectively duplicating the same data multiple times.

Here’s where the Hardware Instancing comes in. In essence, it allows us to generate only a single sprite, then instruct the graphics card to render it any number of times, at given locations. As a result, we no longer have to waste precious memory bandwidth on unnecessary data.

This, in turn, will allow for much smoother operation and more point updates between individual frames.

So, what’s the problem?

TL;DR

The benchmarks show a drop in performance up to 50% compared to current approach. Few potential ideas, but nothing really concrete yet.

Full Version

Initially, I thought (well, hoped, really) that my implementation is simply incorrect and requires some minor corrections. However, after many hours of mundane comparisons with the Epic’s existing code and some consultations, I started entertaining the idea that the true problem lies somewhere else.

I’ve quickly ruled out memory bandwidth as the bottleneck, since the problem persisted even when there were no data updates.

Polycount was next on the suspect list so I’ve run the same test using different polygonal representations - from a simple triangle to an octagon (6 triangles). Unfortunately, the extremums variance was ~10% meaning I had to look elsewhere.

First sings of success started showing after I began experimenting with batching the number of instances (rendering more than 1 sprite per instance). It seemed like the system didn’t quite enjoy rendering hundreds of thousands of copies of the same object, which made sense, as there aren’t really many scenarios where you would need that anyway. Strangely though, the results weren’t very consistent.

Finally, during one of the tests, I accidentally forgot to re-enabled the color stream, which resulted is significant performance increase. Following this discovery, I’ve shortly found what seems to be the culprit - iterating over the instance data buffer (not even using the data itself) appears to be responsible for nearly the whole performance loss.