Dax Technical Meeting 2011-05-25
There was a gathering for a ParaView development meeting at Kitware on May 24. We took advantage of the gathering to plan an impromptu meeting around that time. Here is an informal list of agenda items.
- Worklet Data Model
- CUDA vs OpenCL
- Executive sub elements
- Connectivity analysis
- Data Access
- Communication/Execution Patterns
- UC Davis Student
CUDA vs OpenCL
We vote to move to CUDA.
Worklet Data Model
Lots of spirited discussion on the worklet data model. One interesting point was how to implement marching cubes. Marching cubes would have to be a two pass algorithm. The first pass identifies the number of points and cells to be generated. In between the passes, a prefix sum determines offsets to put the results of the second pass. The second pass generates the actual geometry. It generates an array of ids defining the topology (which is constrained to a single type of point) and field values. Because the worklet must be able to generate field values on the new cells that it knows nothing about, it must get collections of fields that it can work with in bulk. The prototypes for the two worklets work something like the following.
DAX_WORKLET void marchingCubesSizePass( DAX_IN daxWorkGenerateToplogySizePass *work, DAX_IN daxPointField *contour_field, DAX_IN daxIdType *num_points, DAX_IN daxIdType *num_cells)
Somewhere the type of cell must also be specified (which determines how many point ids per cell). It's unclear how to do that right now, although as a first pass we will probably constrain to support only one type.
DAX_WORKLET void marchingCubesGeneratePass( DAX_IN daxWorkGenerateTopologyGeneratePass *work, DAX_IN daxFieldCollection *in_fields, DAX_OUT daxCellArray *out_cell_array, DAX_OUT daxFieldCollection *out_fields)
None of this solves the problem of finding coincident points. Berk suggested we punt and say that Dax only supports some subset. Ken suggested the implementation leave the links unfound, but also provide a heavyweight operation that rebuilds it. The operation is similar to the collection operation in MapReduce, perhaps we could borrow from that implementation.
We identified a short list of useful algorithms.
- Marching cubes
- Critical point finding
- Vortex cores
The current flow is very lazy evaluation. The problem with that is you end up with a lot of repeated calculations and even worse, repeated memory fetches. The idea to get around that is to be less lazy. Based on the work type, assume that all memory requested is used. Load those fields before the worklet ever starts. If you do it that way, you can organize a thread block to load fields into shared memory and then shared points/cells need only loaded once.
Division of Work
- Different algorithms
- Data model within the device and host (hidden from interface)
- Control environment (API to connect modules together)
- Executive (creating CUDA kernels, etc.)
- Merge points on GPU with MapReduce
- Prototyping the shared caching of thread data for group data locality.
- Handling unstructured data
- Worklets mapping fields is easy
- Worklets operating on cells is naively easy, but how do you share data for shared points?
- How do you implement worklets operating on points that look at neighboring cells (or neighborhoods in general)?