A few days ago, benchmarks performed using Oxide’s Ashes of the Singularity benchmark tool revealed that NVIDIA's cards were underperforming in DirectX 12 mode. After digging around, the development team discovered that GeForce 9xx GPUs don't support a DX 12 feature called Async Compute although they claim otherwise.
After trying to sweep this information under the rug, NVIDIA is currently working with Oxide to add full Async Compute support to its Maxwell graphics cards through driver updates.
"We actually just chatted with Nvidia about Async Compute, indeed the driver hasn’t fully implemented it yet, but it appeared like it was" said Oxide’s developer, Kollock. "We are working closely with them as they fully implement Async Compute. We’ll keep everyone posted as we learn more."
In a discussion on Overclock.com, Mahigan provided the following technical outline of the solution:
"The Asynchronous Warp Schedulers are in the hardware. Each SMM (which is a shader engine in GCN terms) holds four AWSs. Unlike GCN, the scheduling aspect is handled in software for Maxwell 2. In the driver there’s a Grid Management Queue which holds pending tasks and assigns the pending tasks to another piece of software which is the work distributor. The work distributor then assigns the tasks to available Asynchronous Warp Schedulers. It’s quite a few different "parts" working together. A software and a hardware component if you will.
With GCN the developer sends work to a particular queue (Graphic/Compute/Copy) and the driver just sends it to the Asynchronous Compute Engine (for Async compute) or Graphic Command Processor (Graphic tasks but can also handle compute), DMA Engines (Copy). The queues, for pending Async work, are held within the ACEs (8 deep each)… and ACEs handle assigning Async tasks to available compute units.
Simplified…
Maxwell 2: Queues in Software, work distributor in software (context switching), Asynchronous Warps in hardware, DMA Engines in hardware, CUDA cores in hardware.
GCN: Queues/Work distributor/Asynchronous Compute engines (ACEs/Graphic Command Processor) in hardware, Copy (DMA Engines) in hardware, CUs in hardware."
Of course this software-based solution will always be slower than native hardware implementation of the feature. The real question is: how considerable will the performance impact be?