From Daxtoolkit
Jump to: navigation, search

This document describes the intention and design of the ScheduleRemoveCell.

Control Changes

Schedule Remove Cell

The Schedule Remove Cell was designed to be a general purpose solution framework for algorithms that reduce the topology but doesn't modify the cell type. This works by having the inherited classes implement the worklet that marks which cells need to be removed. While the parent class does the actual topology reduction, and reduction of points. This behaviour has been implemented using the CRTP to allow compile time polymorphism. The derived classes only need to implement the following methods to be a valid child of ScheduleRemoveCell.

We might be able to generalize this a bit without significantly changing the worklet type implementation or adding new operations to the device adapter. Rather than have a worklet type that specifically removed cells, which can basically be applied to a threshold and that's about it, we could have a worklet type that generated new cells using only points existing in the input. From this we could implement both threshold and triangulation. With the extra bit to identify duplicates, we could also implement extract faces, extract edges, external faces, and feature edges. --Kenneth Moreland 17:09, 7 March 2012 (EST)
I like the idea and we should talk about it more today. If we move to this system will the user have to define two worklet functions, the first that classifies the cell, and the second that generates the new cell topology? --Robert Maynard 11:02, 8 March 2012 (EST)
Yes, except the first worklet function will identify how many cells the second one will output. As per the current implementation of unstructured grids, there can be only one type of cell output, and that cell type will be known by the input cell type. --Kenneth Moreland 14:31, 8 March 2012 (EST)

<typename WorkType> Parameters GenerateParameters(const GridType& grid);
<typename WorkType> void GenerateOutputFields()

Unstructured Grid

The control unstructured grid is implemented to only handle a single cell type as a template parameter currently. To act like ArrayHandle the topology and coordinates for the unstructured grid are defined by the user by passing in ArrayHandles that hold the needed information. Since the topology of the unstructured grid are currently uniform the topology list doesn't need to start each cell with the number of points in the cell. Instead we have a convention that the cell type must specify the static int NUM_POINTS which describes the number of points in the cell.

Execution Changes


The improvements to the work and cell types for accessing point field information was initial done by passing C arrays. This was considered to break the convention of how dax pass data. The solution was to create a generic tuple class that can store any type with an arbitrary size. The result was the creation dax::Tuple which is now the parent class for dax::Vector3, dax::Vector4 and dax::Id3.

Now when you query a cell work for all the point field values, you are returned a dax::Tuple of the Field Type with a length of the number of points in the cell.

//compare the value at each point with the threshold min/max
dax::Tuple<FieldType,CellType::NUM_POINTS> fieldValues= work.GetFieldValues(inField);

DoThreshold<CellType::NUM_POINTS,FieldType> threshold(thresholdMin,thresholdMax);


To get a generic StreamCompaction dax needed to be able to identify if a value was equal to the default value of a given type. The solution was to implement a Predicate function that returns if a value is equal to the default. All this code has been placed in the Functional header.

namespace dax
/// Predicate that takes a single argument \c x, and returns
/// True if it isn't the identity of the Type \p T.
template<typename T>
struct not_default_constructor
  DAX_EXEC_CONT_EXPORT bool operator()(const T &x)
    return (x  != T());


To have the new ScheduleRemoveCell framework we needed to add a new worktype that understands the concept of removing a cell. The work type that I designed is called WorkRemoveCell and it tracks what cells and points need to be removed after the execution of the worklet. This ability is implemented by having the work type accept an array to hold the point and cell status.

To use the ability to remove a cell the user has to call SetRemoveCell(char dead) with the status of the cell. Currently a value of 0 means alive, while 1 means dead. Each point that is used by cell is also marked dead if the cell is, otherwise nothing happens.

Device Adapter Changes

Stream Compact

Is the parallel primitive function of stream compaction to remove unwanted elements. It supports both using a stencil or compacting a single array.

This function is needed to determine what subset of topology, coordinates or field is required when doing the ScheduleRemoveCell. For example to get the reduced form of a point field you call StreamCompact with the stencil set as the point mask that was generated in ScheduleRemoveCell.


//warning this is pseudo code
int stencil[10] = {1, 0, 1, 0, 1, 0, 1, 0, 1, 0};
int result[5]

//result is now 0,2,4,6,8

float values[5] = {1.0, 0.5, 0.25, 0.125, 0.0625};
char stencil[5] = {1, 0, 1, 0, 1};
float result[5]

//result is now 1.0,0.25,0.0625
In the first example, I think the array named value is better named stencil. That array acts like a stencil and the values are implicitly taken as the indices. --Kenneth Moreland 17:14, 7 March 2012 (EST)


Given an ArrayHandle will copy the values from one array into another. I debated if this should be a call on the ArrayHandle, but no matter how we expose the function it should go through the device adapter so we can use cudaMemCpy, etc instead of using a slow manual copy functor.


I presume the whole purpose of Copy is to preserve arrays when calling other functions like Sort and Unique that are implemented to modify data in-place in arrays. I expect copy to be a relatively inefficient operation: lots of memory access with very little computation. It could be more efficient to make functions like copy and unique write their results to a separate array. Is that feasible? --Kenneth Moreland 20:08, 10 March 2012 (EST)
You are correct the need for the copy operation is for usage before calling functions like Sort and Unique. I have looked at the efficiency, but I do know that it boils down to a memmove or memcopy depending on the device. The bigger design issue that I am having is what does it mean to copy an ArrayHandle. Does that mean I need to copy the control environment if the source has it defined, or should I only copy the execution contents. As far as adding options to Sort and Unique to not work inline it is doable for Unique easily enough, but Sort would require a manual copy under the covers. I wanted to implement Copy as a unique function as that offers the most re-use. --Robert Maynard 09:18, 12 March 2012 (EDT)
That's unfortunate about sort. I'm sure in-place vs. copy is an implementation detail, but appears that pretty much every implementation does in place. I've also checked the implementation for CUDPP and TBB; both do in place. Oh, well. I guess we're stuck with that unless we want to implement our own, and I really really don't want to do that. --Kenneth Moreland 12:08, 12 March 2012 (EDT)
As far as ArrayHandle is concerned, I would avoid spending too much time worrying about it. We really need to redesign that thing. For now, the current implementation allows you to have a nil control array, which will prevent it from being copied back. For now can we just do that? --Kenneth Moreland 12:08, 12 March 2012 (EDT)


Given an ArrayHandle will sort all the values contained inside it so they are in ascending order


float values[5] = {0.5, 1, 0.25, 125, 0.0625};

//values is now 0.0625, 0.25, 0.5, 1, 125


Removes the duplicate consecutive elements from the range [first,last). This is done by removing from the resulting range all the elements that compare equal to the element right preceding them (only the first element in each group of consecutive equal elements is kept). The array Handle will be re-sized to only contain the result of the operation.


int values[9] = {1,1,1,0,0,0,1,0,1};
//values is now 1,0,1,0,1


Each value in output is the first element in the sorted input array for each item in values


int input[5]={-1,0,2,3,4};
int values[2] = {1,3};
int output[2];
//output is  2,3

Schedule Map

After the discussion on better ways of doing the subset schedule functionality, and the fact that I only needed the subset function to do a mapping schedule worklet I have replaced all the code with the following class (thanks ken).

template<class Functor>
struct ScheduleMappingAdapter
  DAX_CONT_EXPORT ScheduleMappingAdapter(const Functor& functor,
                                          dax::internal::DataArray<dax::Id> lookup)
    : Function(functor), LookupTable(lookup) { }

  template<class Parameters, class ErrorHandler>
  DAX_EXEC_EXPORT void operator()(Parameters parameters, dax::Id index,
                                  ErrorHandler& errorHandler )
    //send the index as the key, and the LookupTable[index] as value

  Functor Function;
  dax::internal::DataArray<dax::Id> LookupTable;

} //namespace dax::exec::kernel::internal

namespace dax {
namespace cont {
namespace internal {

template<class Functor, class Parameters, class DeviceAdapter>
DAX_CONT_EXPORT void ScheduleMap(
    Functor functor,
    Parameters parameters,
    dax::cont::ArrayHandle<dax::Id,DeviceAdapter> values)
  //package up the ids to extract so we can do valid lookups
  const dax::Id size(values.GetNumberOfEntries());

  dax::exec::kernel::internal::ScheduleMappingAdapter<Functor> mapFunctor(




Here are the benchmarks for release DAX Threshold on a uniform grid of size 256.

Device Time
DaxCpu 5.03747842105263
DaxCuda 2.028389
DaxOpenMP 1.206308
PistonCuda 0.168151
PistonOpenMP 0.13462935
VTK 2.79259

We can see that while DAX is faster than VTK, it hasn't reached the true potential of what a GPU can do as Piston is able to do the same threshold in 1/12 of the time. We should also notice that the DaxOpenMP implementation is significantly faster than the Cuda implementation meaning that a bottleneck is registers and/or non coalesced memory access.

The next step is to time each independent component of the ScheduleRemoveCell algorithm and see where the actual time is being spent.

robert@Dejagore:bin$ ./ThresholdTimingCuda --size=256
Pipeline #1
Running pipeline 1: Elevation -> Threshold
Schedule Worklet time: 1.69
Start GenerateOuput
Stream Compact 1 time: 0.05
Stream Compact 2 time: 0.04
ExtractTopology: 0.19
Extract Coordinates: 0.01
UpdateHandles: 0.05
End GenerateOuput
Generate Output time: 0.34
Total RemoveCell time is: 2.03
original GetNumberOfCells: 16581375
threshold GetNumberOfCells: 511775
Elapsed time: 2.02866 seconds.

So we see that two biggest bottlenecks are the Worklet and ExtractTopology calls, both of which access all the points for a given cell.

Performance Improvements Round 1

After learning from the previous benchmarks that querying point information was a bottle neck. The solution was to reduce the number of times we have call work and cell for a point index. The solution was to add API which allowed the worklet to get back all the field values for the points of a cell in a single call.

The results of this change produced the following time:

Device Time
DaxCpu 1.603977
DaxCuda 0.9084353
DaxOpenMP 1.0151116
PistonCuda 0.168151
PistonOpenMP 0.13462935
VTK 2.79259

robert@Dejagore:bin$ ./ThresholdTimingCuda --size=256
Pipeline #1
Running pipeline 1: Elevation -> Threshold
Schedule Worklet time: 0.59
Start GenerateOuput
Stream Compact 1 time: 0.04
Stream Compact 2 time: 0.04
ExtractTopology: 0.16
ExtractCoordinates: 0.02
UpdateHandles time: 0.04
End GenerateOuput
Generate Output time: 0.3
Total RemoveCell time is: 0.89

original GetNumberOfCells: 16581375
threshold GetNumberOfCells: 511775
Elapsed time: 0.900044 seconds.

The results are amazing, we more than halved the runtime of the algorithm by merely reducing the number of calls, and temporary variables.

Performance Improvements Round 2

The second of performance tweaks was to reduce the number of unneeded writes to shared memory. Instead of updating the status of each point in a cell whether it was alive or dead, we only update the points if the cell is dead.

Device Time
DaxCpu 1.4285795
DaxCuda 0.5270713
DaxOpenMP 0.80193205
PistonCuda 0.168151
PistonOpenMP 0.13462935
VTK 2.79259

robert@Dejagore:releaseBuild$ ./bin/ThresholdTimingCuda --size=256
Pipeline #1
Running pipeline 1: Elevation -> Threshold
Schedule Worklet time: 0.2
Start GenerateOutput
Stream Compact 1 time: 0.05
Stream Compact 2 time: 0.05
ExtractTopology: 0
ExtractTopology - Weld: 0.16
ExtractCoordinates: 0.02
UpdateHandles time: 0.05
End Generate Output
Generate Output time: 0.33
Total RemoveCell time is: 0.53

original GetNumberOfCells: 16581375
threshold GetNumberOfCells: 511775
Elapsed time: 0.527453 seconds.

Performance Improvements Round 3

Hypothesis: The Worklet performance could still be bad when a high ratio of cells need to be removed and we have to access the point field from numerous cores at the same time. The solution is to move the creation of the point mask to be another algorithm.

Reality: By extracting the point mask work from the workremovecell it was expected that the performance would improve. The actual result is that the performance has decreased. The reasoning seems to be that the bottleneck in the the worklet isn't actually the non-coalesced memory access but something else in the worklet. I am thinking the actual issue is the number of registers being used.

The timing differences is the the worklet is still taking 0.2 seconds to run, but now we are adding on 0.08 to that for executing the point mask creation. The sad part is that the sort, unique version that generates the mask actually scales worse than the writing to global memory.

see git sha1: 29ba857de223a8eb4f2eb4fcb

We need an extracted point mask because it is not clear that we will always need one. Based on this theory I have made the generation of the point mask a separate call, but used the original algorithm of writing to global memory and than doing a stream compaction instead of doing a sort/unique.

Performance Improvements Round 4

Bench marking the threshold worklet has shown that the iteration over the point values to create the valid flag is not efficient. If we manually unroll the iteration we gain 0.04/0.05 seconds shaving the worklet time down to 0.15/0.16 second.

Performance Improvements Round 5

The mask that we are using for the point and cell usage was being stored is a char array. From the CUDA Programming Guide:

To coalesce, the memory request for a half-warp must satisfy the following conditions:
The size of the words accessed by the threads must be 4, 8, or 16 bytes;
If this size is:
  4, all 16 words must lie in the same 64-byte segment,
  8, all 16 words must lie in the same 128-byte segment,
  16, the first 8 words must lie in the same 128
If the half-warp does not meet these requirements, 16 separate 32-byte memory transactions are issued.

Also from later on:

The compiler must on occasion insert conversion instructions, introducing additional
execution cycles. This is the case for Functions operating on char or short whose operands generally need to be
converted to an int

Both of these make using char arrays are not recommend as a mask array as on compute 1.1 cards it will cause uncoalesced access. In the future if we move to compute 2.0 and up cards we can revisit if the conversion overhead is causing a measurable performance decrease.

So back to DAX, when switching to an dax::Id mask we move the time for the StreamCompact from 0.05s down to 0.02s bringing us even closer to matching PISTON time for doing the cell generation and stream compaction.

The following results is when we just benchmark the computation of the threshold status for each and than do a stream compact on that result. We can see that DaxCuda has reached 93% of the efficiency of Piston for this operation.

Device Time
DaxCpu 0.79271935
DaxCuda 0.1595175
DaxOpenMP 0.25779675
PistonCuda 0.1493415
VTK 2.79259

More Performance Improvements

I am still curious in how piston is able to run the OpenMP implementation faster than the Cuda. I think they are using a special container detector so that when compiling to OpenMP they don't have to create a host and device vector instead they just create a device vector.

I concur that this could be the problem. I've previously noticed that changing the OpenMP device adapter to use ArrayContainerExecutionCPU instead of ArrayContainerExecutionThrust actually speeds things up even though you still have two copies. I think the issue is that when thrust copies, it tries to do a multithreaded copy and runs into memory affinity problems, so it runs even slower than a serial copy. I suspect that if we were smart about using only one array we would do even better. --Kenneth Moreland 18:17, 7 March 2012 (EST)