Renderer cache
The operation of Qt 3D is based on two existing data structures:
Scene Graph - describes the content of a scene;
Frame Graph - describes how to render a Scene Graph.
Every time we render a frame, we have to do a lot of work to convert the abstract descriptions in the scene graph and frame graph into low-level draw calls and transmit them to the GPU. In short, the steps are as follows:
Traverse the frame graph and identify each rendering stage. Each stage includes rendering the target (screen or FBO); Which camera to use; Which window to use; Which parts of the scene graph should be drawn; Set the specific state of the GPU (for example, disable deep testing or writing, or enable template testing).
Each rendering stage in step 1 needs to filter out the entities we care about from the scene graph.
Select the corresponding shader for each entity and the current rendering stage. Entities can use different shaders at different stages, for example, using a simple clip shader to perform early Z fill or generate shadow maps, while using a full lighting shader to achieve the final effect on the screen.
Merge uniform variables (used to customize variables in shaders).
Bind all this information to RenderCommands.
Once all stages are completed, we will submit RenderOrders to OpenGL through an independent thread, which is very picky about threads due to its long history.
The OpenGL submission thread iterates through each rendering stage and the commands contained within it, translates them from our intermediate description into OpenGL format, and assigns them to the original OpenGL function calls.
All of this makes Qt 3D very flexible, but at the cost of runtime performance. The usual way to significantly improve performance is to avoid unnecessary drawing overhead through caching. In theory, we can achieve improvement by caching some intermediate results. However, in reality, there are many things to consider, such as how to combine dynamic rendering modes, and it is indeed difficult to achieve renderer caching.
There are too many things that can affect the appearance of the rendered scene that need to be tracked, and it is also important to figure out the minimum task set that must be redrawn when certain properties between different images are updated. We have added some tracking features in Qt 5 version, but achieving this completely requires greater refactoring.
Before providing a detailed description of our work in this area, let me first discuss another issue:
Modern Graphics API
So far, Qt Quick (basically) has been fully built on top of OpenGL (or OpenGL ES), and Qt 3D is mostly like this. Although OpenGL has long provided excellent services for graphic engineers, it is a very ancient API with some inherent structural issues that cannot be solved without introducing new APIs. In addition, OpenGL has undergone years of expansion and "transformation" in an attempt to keep up with the actual working methods of modern GPUs and handle the constantly increasing amount of data demanded by artists. Although this has prompted OpenGL to make impressive improvements, it is still limited, especially in its multi-threaded model and heuristic patterns in driver implementation, where drivers attempt to predict the behavior patterns of application developers.
As mentioned in the previous section, the operation of OpenGL within the driver program is very similar to Qt 3D. When you make a bunch of OpenGL function calls, they are converted into commands and stored in the command buffer, and then submitted to hardware for processing at a certain point in time (determined by the driver's best estimate).
Once the command in the command buffer is processed by the hardware, we must issue the OpenGL function call again in the next frame. The same process occurs frame by frame, which can be very wasteful.
In drivers, creating commands is a very resource intensive operation, and in OpenGL, all of this is limited to a single thread. So, clearing the command buffer is a bit wasteful. GPU manufacturers who write drivers have added various heuristic algorithms in an attempt to predict the actual intentions of library and application developers, in order to cache data as much as possible and optimize operations. This makes drivers larger, more complex, and difficult to maintain, and in some cases leads to significant performance differences between GPU manufacturers.
The threading model of OpenGL is essentially single threaded. Yes, multithreading can be supported through methods such as sharing context, but calls within the driver will still be serialized. Considering that OpenGL has a history of over 20 years, this is not surprising.
The outdated OpenGL standard is another issue. Apple has announced the abandonment of OpenGL and will only focus on using Metal as its graphics API. At some point in the future, we may find OpenGL disappearing from MacOS and iOS. Even before that, the OpenGL libraries on these platforms would not see any new features (in fact, they have not been updated for many years).
What can we do about these issues? Well, in the past few years, the emergence of modern graphics APIs has been used to solve these and other problems. Vulkan, Metal, and DirectX 12 are all very popular APIs that provide a more direct interface for controlling GPUs compared to OpenGL.
You may say this is great, but there is actually a compromise. Most of the work done by OpenGL drivers is now the responsibility of library or application developers. At first glance, it may sound scary, but to some extent, it is indeed so. However, after all, we can leverage our macro understanding of the application's working mode to extract performance from the GPU. On the other hand, we can choose to complete similar tasks in a shorter amount of time, allowing the CPU/GPU to enter sleep or power-saving mode, ultimately improving battery life performance. This is a huge improvement for both mobile devices and desktops.
The OpenGL driver will discard command buffers and its creation cost is high at each frame, but when we use Vulkan or similar tools as application developers, we can know when to keep these command buffers and resubmit them in the next frame, which is safe.
You may want to know what the benefits are. Submitting the same command buffer only allows us to see exactly the same content on the screen as the previous frame, isn't it? If so, what is the point of doing this?
This is a good question. In fact, even if we repeatedly submit the same command buffer to the GPU, the resources they reference can contain different data. Not only vertex buffers and textures, but also uniform buffer objects commonly used to store material properties and camera transformation matrices. If we can track which things in the scene have changed, we can determine whether the same command can be resubmitted to the GPU, saving a lot of work, which is great.
There is another icing on the cake situation! Vulkan uses the concepts of main command buffer and auxiliary command buffer. The main command buffer is the content we submit to the GPU, which may include calls to the secondary command buffer. A common usage is to pre record the drawing commands of certain entities and save them to the auxiliary command buffer.
When we want to draw the entire scene, our renderer can create a main command buffer and call the command buffer of those visible entities. When visibility changes (for example, if the camera moves or certain entities move), we can re record the main command buffer. That's also great.
More icing on the cake! Using Vulkan, we can also read and write command buffers on different threads! We are responsible for submitting command buffers to the GPU queue and synchronizing tasks between different GPU queues (graphics/computing/transmission, etc.) as well as between the GPU and CPU.
As you can see, we can gain more control over the operations and hardware involved, but we must do more work. Overall, this is a huge opportunity for performance improvement.
Continuing to talk about Qt 3D in Qt 6
Looking at the Qt 6 development timeline, we are actively researching these two major directions. From the above description, it can be seen that both tasks involve a lot of work on how to track changes in user status on the scene graph and frame graph, as well as the remaining work that Qt 3D must complete next. This includes how we ultimately cache command buffers and other intermediate states between frames to avoid unnecessary duplicate work.
As you may already know, Qt Quick and Qt Quick 3D will be rebuilt on top of the QRhi layer, which provides support for Vulkan, Metal, DirectX 11, and OpenGL. We are still researching whether it can reasonably expand this feature to meet the functional and multi-threaded needs of Qt 3D, or whether other ways of integrating graphics APIs are needed so that Qt 3D can still work well with Qt Quick and Qt Widgets modules.
There is still a lot of work to be done in this area, but the preliminary results look very promising. We tested a scene containing approximately 1000 entities, and on a mid-range desktop platform, we were able to achieve a rendering speed of 600 frames per second (without tearing) when attempting to maximize GPU utilization, or 1% CPU load when limiting to 60fps! Now this only uses a single kernel! In order to further improve the multi-threaded architecture and surpass the limits of the Qt 5 series, we are currently validating some ideas.
This work has a byproduct, and we have also developed the next iteration version of the frame graph, which has a very natural and smooth update, making it easier to understand and also easier for Qt 3D users to modify.
summary
As you can see, we will do a lot of work behind the scenes to improve Qt 3D in the Qt 5. x cycle and beyond. We will also look for ways to improve the public API, but we do not expect significant changes in this area. Instead, we will clean up some less than ideal function and property names.
All these improvements will also benefit Kuesa based on Qt 3D and any other 3D applications developed using Qt 3D. These changes will help build a solid foundation, allowing us to add more exciting extensions in Qt 6.