Bob Miller's findings from Dax tests and shared memory

From Daxtoolkit
Jump to: navigation, search

Findings on DAX performance

Test Setup

The test environment runs four implementations of each test: VTK, DAX (initial), DAX (modified), and a hand-tuned CUDA version.

Each implementation is given an identical set of input elements, which the tests must copy to their respective API. Results from each implementation must be stored in an output array for later verification.

To prevent CPU caching from skewing the results, the tests are run in the following order: Hand-tuned, DAX (modified), DAX (initial), VTK.

After all tests have been run, the results are compared against each other for verification.

Each implementation is timed separately. For the CUDA tests, the time spent in the kernel is timed in addition to the total run time.

So far, I have implemented two tests:

Pass-through test: Each input element is simply copied to an output. For VTK this is accomplished simply through a vtkFloatArray, copied to a vtkDoubleArray.

Gradient test: A cube with side length floor((#Elements)^(1/3)) with equally spaced points and an associated value is passed as input. The gradient vector at each point is calculated and passed back as output.

Timing Results

These results are from tests where 10,000,000 and 100,000,000 elements were processed. The tests were run on a 2.66GHz Intel Xeon CPU with 16GB of RAM, and a Tesla C2070 (Compute capability 2.0), running RedHat Linux 6.0

Pass-through test:

10,000,000 elements:

  • Time for VTK version: 562 ms.
  • Time for unmodified DAX version: 417 ms.
  • Kernel time for unmodified DAX version: 4 ms.
  • Time for modified DAX version: 420 ms.
  • Kernel time for modified DAX version: 4 ms.
  • Time for hand-tuned version: 220 ms.
  • Kernel time for hand-tuned version: 5 ms.

100,000,000 elements:

  • Time for VTK version: 5539 ms.
  • Time for unmodified DAX version: 4152 ms.
  • Kernel time for unmodified DAX version: 49 ms.
  • Time for modified DAX version: 4153 ms.
  • Kernel time for modified DAX version: 49 ms.
  • Time for hand-tuned version: 2134 ms.
  • Kernel time for hand-tuned version: 56 ms.

Analysis:

No modifications to DAX were deemed necessary for the pass-through test, so the results are identical. It is unknown why the kernel time for the hand-tuned version is slower than the DAX version

As expected, the kernel time for a CUDA pass-through test is negligible. The tests are dominated by memory transfer time. The hand-tuned version gains a 50% improvement in speed over DAX because DAX requires additional memory transfers: initially the user copies memory into the DAX data objects and then the DAX data object copies the data to the device, whereas the hand-tuned version simply copies the memory directly to the device from the input. Since this would seem to be a common operation, perhaps DAX data objects could be altered to allow for a direct initialization of device memory upon construction.

Dax arrays certainly should be able to shallow copy array pointers provided by the client application. In fact, I thought that was already possible. Someone should check. --Kenneth Moreland 18:17, 13 July 2011 (EDT)
It's possible I'm missing something, but I don't see any implemented way to accomplish this. I suspect it's simply something that Utkarsh hasn't had time for so far. It's not especially important since I also have the numbers for the kernel times without the memory transfers. If it does become important for testing, it would be simple for me to implement. --Robert Miller 11:11, 14 July 2011 (EDT)

Gradient test:

10,000,000 elements:

  • Time for VTK version: 10592 ms.
  • Time for unmodified DAX version: 436 ms.
  • Kernel time for unmodified DAX version: 50 ms.
  • Time for modified DAX version: 375 ms.
  • Kernel time for modified DAX version: 27 ms.
  • Time for hand-tuned version: 216 ms.
  • Kernel time for hand-tuned version: 5 ms.

100,000,000 elements:

  • Time for VTK version: 106645 ms.
  • Time for unmodified DAX version: 4309 ms.
  • Kernel time for unmodified DAX version: 534 ms.
  • Time for modified DAX version: 3712 ms.
  • Kernel time for modified DAX version: 307 ms.
  • Time for hand-tuned version: 2158 ms.
  • Kernel time for hand-tuned version: 66 ms.

Analysis:

As with the pass-through test, the time taken for memory transfer dominates the rest of the test, with approximately 90% of the test consisting of memory transfers to and from the device. This is likely because the gradient computation is relatively simple. With longer computational pipelines, I would expect the memory transfer overhead to become negligible in comparison to computation.

Even ignoring the memory transfer time, there is a clear disparity between DAX's performance and that of a hand-tuned kernel. DAX initially took approximately 10x longer than the hand-tuned version for gradient computation. Through extensive use of the NVIDIA compute profiler, the source of these slowdowns was determined to come from three causes:

DAX Slowdown cause #1: unnecessary temporary and unrolled loop in DaxCell::Derivative()

The tests show a 230 ms (42%) improvement between the unmodified and modified DAX versions. The only modifications made to DAX for this test took place in DaxCell::Derivative(). Two changes were made. First, the unecessary temporary array to store the results of the partial sum of functionDerivs * Values was removed. Second, the derivative function loop was partially unrolled. Testing indicates that unrolling the loop manually or using the "#pragma unroll" directive have approximately the same result. The loop unroll accounts for 70% of the improvement, while the removal of the temporary variable accounts for the other 30%.

Although this was the only change I was able to make to DAX which actually increased performance, it should be noted that this is also the least important. Many such optimizations will be necessary before DAX release, and it is unnecessary to do so until much later in development. The change was primarily necessary so that I could determine the other current sources of slowdowns in DAX vs the hand-tuned version.


The relevant code changes are given below:

Unmodified DaxCell::Derivative()
DaxVector3 output;
...
DaxScalar derivs[3];
for (DaxId j=0; j < 3; j++) //loop over derivative directions
{
    DaxScalar sum = 0.0;
    for (DaxId i=0; i < 8; i++) //loop over interp. function derivatives
    {
        sum += functionDerivs[8*j + i] * values[i];
    }
    derivs[j] = sum;
}
output = make_DaxVector3(derivs[0]/spacing.x, derivs[1]/spacing.y, derivs[2]/spacing.z);
Modified DaxCell::Derivative()
DaxVector3 output = make_DaxVector3(0,0,0);
...
//An alternative to unrolling the loop manually would be the use of the #pragma unroll directive.
for (int i=0; i < 8; i++) //loop over interp. function derivatives
{
    output.x += functionDerivs[8 * 0 + i] * values[i];
    output.y += functionDerivs[8 * 1 + i] * values[i];
    output.z += functionDerivs[8 * 2 + i] * values[i];
}
output.x /= spacing.x;
output.y /= spacing.y;
output.z /= spacing.z;
DAX Slowdown cause #2: index recomputation in DaxStructuredConnectivity::GetConnectedElement()

78% of the difference between the hand-tuned version and the modified DAX version appears to come from the DaxStructuredConnectivity::GetConnectedElement() function. Specifically, the hand-tuned version only needs to compute the 'cell-based' index once, and then simply loads all 8 of the neighboring values. Because the GetConnectedElement function is static, there is no way to store this computation across calls. I briefly modified DAX to have this value passed in directly from DaxCell::Derivative(). This caused a 49% performance increase. Tests with the NVIDIA compute profiler seem to show that another 7% of the performance difference is caused by the recomputation of the neighbor offsets in GetConnectedElement().

Perhaps the cell-based index could be computed before the worklet is called and stored in one of the passed in objects (probably the work object). If the worklet has asked for a point-based field, there is a very high chance that the computed cell-based index will be used multiple times. --Kenneth Moreland 18:17, 13 July 2011 (EDT)
I'll hack this into the modified DAX version right now to test if it makes a significant difference. --Robert Miller 11:07, 14 July 2011 (EDT)
This hack was pretty severe. I originally couldn't figure out how to store it in the work object because that object has no knowledge of the structured connectivity nature of the array. The reason we have that knowledge in the Derivative function is because it knows it is operating on a voxel. I just circumvented the issue by adding a public version of GetConnectedElement to DaxArrayStructuredConnectivity. This provided a change from 2700 ms to 1900 ms (30%). It's not as significant as I would have thought based on earlier tests but it's still a marginal increase in performance. --Robert Miller 12:48, 14 July 2011 (EDT)
Additional note: This change somehow reduced the instruction count by 20%, and cut the L2 reads and writes almost in half. I realized just now that this change also bypasses a switch statement, which could allow NVCC to do extra optimizations. This may not be ideal since a real solution might not bypass the switch. --Robert Miller 13:30, 14 July 2011 (EDT)
DAX Slowdown cause #3: switch statements necessary for generic (structured/unstructured/etc) array types?

According to the NVIDIA compute profiler, there is still a 22% difference between the DAX and hand-tuned performance that is unaccounted for. I could not verify this to my satisfaction, but it appears that most of this difference may be due to the lookups of type information and corresponding switch statements and static function calls in DAX. This may be approximately the difference we can expect between DAX and optimal once the other performance bottlenecks have been worked out.

We might be able to get around this via (hidden) templates. In essence, the compiler will simply create separate versions for the array types. --Kenneth Moreland 18:19, 13 July 2011 (EDT)
I agree that this might be an interesting idea to test. Because of the optimizations performed by NVCC, it is sometimes difficult to identify exact sources of slowdowns, but I note that in almost all cases, the addition of the switch statement is the 'trigger' that prevents further optimization by the compiler. --Robert Miller 11:05, 14 July 2011 (EDT)

Findings on DAX Shared Caching vs Lazy Evaluation

The initial version of DAX is, according to the NVIDIA compute profiler, approximately balanced between compute time and memory fetches. As a result, using shared-memory caching techniques did not significantly affect performance for this version. After adding the optimization outline above in the gradient performance analysis (loop unroll), the DAX kernel became memory-bound, so I expected that shared-memory techniques would help. However, I was unable to achieve a measurable performance increase in practice.

What about the hand-tuned kernel? Was that memory-bound? If not, why not? The previous section suggests that Dax is doing more computation, so should therefore be less memory-bound. --Kenneth Moreland 18:34, 13 July 2011 (EDT)
The hand-tuned kernel isn't memory-bound, it's compute-bound. I will provide some more of my findings from the profiler here:
  • DAX does indeed perform more computation: With 1,000,000 elements, my modified DAX version performs 2,509,941 instructions, while the hand-tuned version only takes 604,422, so DAX performs about 4x the computation. I believe this extra computation may have something to do with the nature of the switch statements/static function call design of DAX, because when I attempt to mimic that structure in the hand-tuned version, the number of instructions explodes.
  • DAX also seems to perform about 6 times the necessary global memory loads (43,294,832 vs 7,529,536), although the number of global memory stores is identical (both have 2,823,576 global stores). I have not yet been able to determine the source of these extra global memory reads.
  • L1 cache hit percentages are approximately equal between the two kernels, although as noted earlier DAX has more memory accesses for some reason.
  • Finally, and actually most distressing, I now note that the DAX version requires 22 times the number of L2 reads as the hand-tuned version (8,292,744 vs 375,452), and roughly twice the number of L2 writes (2,190,913 vs 1,058,841). I suspect that the extra L2 accesses are mostly due to register spilling, but I'm unsure as to how to reduce the number of registers used to confirm this.
--Robert Miller 11:03, 14 July 2011 (EDT)
I think the global memory loads is probably the most immediate concern. I would be surprised to learn that register spilling could bump enough out of cache to effect global memory reads. I'd really like to know where those are coming from.
OK I'll try to track down where those are coming from. I don't believe they're due to register spilling either, but I'm not sure what IS causing them. --Robert Miller 13:03, 14 July 2011 (EDT)
This was a pain to track down. They seem to be related to the class/static function nature of DAX, because you can remove them by bypassing static function calls (especially with switch statements) where the underlying call accesses a DaxArray (like GetScalar). On the bright side, I was able to eliminate most of the remaining difference between the runtime of DAX vs the runtime of my hand-tuned version. (On 1,000,000 elements, it is now 683us for DAX vs 560us for the hand-tuned version). The change I made was to bypass the switch statement inside GetScalar, which leaves the extra global reads but amazingly reduces the runtime. Most of the changes I've made to my modified version of DAX can be imported directly into the real branch, but I'm not certain whether my solution for precomputation of the cell indices is really the approach we want to take (maybe there are better locations to put it, I just store it in DaxCell). Also, I bypassed DAX's switch statements to achieve this performance :(. We'll have to discuss some of this in person when you get back. --Robert Miller 11:38, 15 July 2011 (EDT)
I'm not sure there is much we can do about register spilling. Dax is going to need extra data per thread to hide CUDA indexing. We might remove some by holding as few indices in memory as possible and computing where necessary (the opposite of one of the other suggestions) in an assume compute is free attitude. What is the hit that is taken by going to L2 cache instead of registers? --Kenneth Moreland 11:51, 14 July 2011 (EDT)
I'm having trouble finding hard numbers, but the interwebs seem to say that it's almost as bad as going to global memory. It really seems odd to me that all of these would be caused by register spilling but that is what the compute profiler suggests is likely. I'll see if I can get a little more information on this too. --Robert Miller 13:03, 14 July 2011 (EDT)

Part of the reason I may have been unable to achieve performance improvements with shared memory could be my inexperience with CUDA. However, I strongly suspect that the primary reason I have been unable to improve performance in this manner is that the DAX L1 cache hit rate is already quite high (~75-90%). Since L1 cache and shared memory are actually the same memory space on the hardware, I suspect I was just moving the memory accesses from one area to the other.

Most of my experimentation with caching has been assuming a 'staged' approach, where all of the calculations for one stage of the pipeline are calculated at once and stored, and then the subsequent stages of the pipeline are considered. At the last technical meeting, it was suggested that I could use shared memory to implement a series of arrays to prevent recalculation of the same values multiple times within a block. Initial experimentation with this has not been promising. Even for the relatively simple Input->PD2CD->Gradient pipeline, this requires 12K of shared memory (out of a maximum 16K on compute capability 1.x), and performance does not appear to differ significantly. For longer pipelines, the use of this technique would force smaller thread counts per block (shared memory usage = (thread count per block) * (shared array size in kernel)). The problem with this is that DAX's performance appears to be best when using at least 256 threads per block, and at least 128 concurrent blocks. Therefore, decreasing the number of threads per block to allow longer pipelines with this technique would likely reduce performance significantly. However, as noted in the timing section above, a significant part of DAX's slowdown appears to come from recalculation of indices and such. I have not been able to determine how to cache such results using this technique because the calculations occur underneath the worklet rather than within the executive. However, caching of these results in some way may yield more performance improvements. Furthermore, I have not attempted a 'hybrid' technique, which would perform this type of caching for portions of the pipeline, then store the results in global GPU memory and exit the CUDA kernel, so that another CUDA kernel could use shared caching on the next portion of the pipeline. This would resolve the concerns about shared memory, and likely such a technique will be necessary at some point to circumvent the CUDA requirement that all kernels exit within 5 seconds.

Other ideas and conclusions

Use texture memory for input data - Texture memory has its own cache and is optimized for 2D access, but is read-only. I suspect that it could be used efficiently for the input data for DAX.

Constant memory is problematic for use in DAX because DAX is expected to be operating on very large data, and constant memory only provides 64K for use. Perhaps it could be used to store invariant parameters for individual pipeline stages?

The NVIDIA compute profiler consistently complains about 'uncoalesced memory accesses'. I have been unable to devise kernels that do not trigger this warning in the profiler, even following the guidelines in the CUDA Programming guide. Furthermore, my research on the topic online seems to show that this is a significantly reduced issue in compute capability 2.0, and is expected to become even less important as the hardware continues to develop.

Final conclusions

Shared memory vs lazy evaluation doesn't seem to be a huge issue on the pipelines implemented so far. My recommendation is to crank the amount of memory available for cache up to the maximum (48k), and let the cache handle it.

switch statements seem to be really, really bad for performance. Templates or polymorphism may be the best approach to replace the functionality that we are currently achieving through the switch statements.

We should be conscious of when a calculation will likely be needed many times, such as when recomputing an index. Note that the absence of the switch statements may alleviate this, because NVCC was often able to optimize out the recalculations when I removed them.

Acknowledgements

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

SandiaLogo.png DOELogo.png

SAND 2011-5772P