Stream Compaction Logo

Parallel Primitives Library:

  • a.k.a. chag::pp
  • Based on our paper on Efficient Stream Compaction.
  • Authors: Markus Billeter, Ola Olsson, Ulf Assarsson
  • Download: Source
  • License: MIT

Introduction

Our parallel primitivies library, a.k.a. chag::pp, is based on the paper Efficient Stream Compaction on Wide-SIMD Many-Core Architectures by myself, Ola Olsson and Ulf Assarsson. The paper can be downloaded here. The paper and its results were presented at the High Performance Graphics conference in 2009; presentation slides are available here.

The chag::pp library provides efficient implementations of

  • reduction
  • prefix operations (scan)
  • compaction
  • radix sort

Using chag::pp requires CUDA 2.3. We have tested the code on the following platforms:

  • CUDA 3.1, Linux 64bit, GCC 4.5.0, GTX480
  • CUDA 3.1, Linux 64bit, GCC 4.5.9, GTX260
  • CUDA 2.3, Linux 64bit, GCC 4.3.3, GTX280
  • CUDA 2.3, Linux 64bit, GCC 4.3.3, GTX260
  • CUDA 2.3, WinXP 64bit, Visual Studio 2005 SP1, GTX280

(Please update to the 20100811 version if you're using CUDA 3.1 on a 64-bit machine.)

News

Update @ 2011-09-05

  • WARNING Apparently, there's a bug/problem in sort() when compiled on Fermi-level cards. The problem is not yet resolved. :-(
  • Added inclusive prefix sum operation, chag::pp::prefix_inclusive(). It has the same parameters as the exclusive (old) prefix function.
  • Fixed bug where a custom operator in chag::pp::prefix() would cause incorrect results. (Thanks to C. Allen Waddle for reporting this!)
  • Workaround for a NVCC code generation bug in release 3.1, V0.2.1221. This file demonstrates the bug (and a workaround) in ~50LOC.

Update: 2009-10-14

  • chag::pp is now available under the MIT license

Download

The library consists of headers only, no need to build/link against any binaries.

Examples

For more complete documentation, see the section Documentation below.

The following code demonstrates the compact() method with a user-provided predicate function. Thrust is used for host-device communication; chag::pp only deals with device memory.

-- {File: compact.cu - Language: CUDA} --  1 #include <chag/pp/compact.cuh>         /* chag::pp::compact() */  2 namespace pp = chag::pp;  3  4 #include <thrust/device_malloc.h>      /* device_malloc() */  5 namespace t = thrust;  6  7 #include <cstdio>  8  9 /* Predicate function */ 10 struct Predicate 11 { 12     __device__ bool operator() (uint2 value) const 13     { 14         return value.x > value.y; 15     } 16 }; 17 18 int main( void ) 19 { 20     enum { ELEMENTS = 4*1024*1024 + 4321 }; 21 22     t::device_ptr<uint2> src = t::device_malloc<uint2>(ELEMENTS); 23     t::device_ptr<uint2> out = t::device_malloc<uint2>(ELEMENTS); 24 25     t::device_ptr<pp::SizeType> count = t::device_malloc<pp::SizeType>(1); 26 27     /* Generate data for 'src' */ 28 29     // Compaction 30     pp::compact(  31         src.get(),              /* Input start pointer */ 32         src.get()+ELEMENTS,     /* Input end pointer */ 33         out.get(),              /* Output start pointer */ 34         count.get(),            /* Storage for valid element count */ 35         Predicate()             /* Predicate */ 36     ); 37 38     // Output 39     printf( "Valid elements: %d\n"int(*count) ); 40 41     return 0; 42 }

CUDA-related code, e.g. memory-allocations, are omitted. The library only deals with CUDA device memory, i.e. all pointers are assumed to point to device memory.

TODO

BibTeX

Please use the following BibTeX-entry for citations:
@inproceedings{1572795,
	 author = {Billeter, Markus and Olsson, Ola and Assarsson, Ulf},
	 title = {Efficient stream compaction on wide SIMD many-core 
		architectures},
	 booktitle = {HPG '09: Proceedings of the Conference on High 
		Performance Graphics 2009},
	 year = {2009},
	 isbn = {978-1-60558-603-8},
	 pages = {159--166},
	 location = {New Orleans, Louisiana},
	 doi = {http://doi.acm.org/10.1145/1572769.1572795},
	 publisher = {ACM},
	 address = {New York, NY, USA},
}

Documentation

chag::pp provides three kinds of interfaces:

  • STL-like algorithms
  • Class-based API
  • In-kernel utility functions

The first two interfaces expose the same kind of functionality; the class-based API allows for more fine-tuning. The in-kernel utility functions are aimed at people who develop their own CUDA-kernels (they are also used to build the kernels involved in the first two interfaces).

TODO: create & upload doxygen documentation

STL-like interface

Operations:
  • Reduction (Header: <chag/pp/reduce.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline void reduce( const T* aStart, const T* aEnd, T* aOutput ); 3 template< typename T, class Op > 4 inline void reduce( const T* aStart, const T* aEnd, T* aOutput, const Op& aOperator );

    Parallel reduction with operator+ or user specified operator. Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aOutput: Output, single element
    • aOperator: User specified operator. See below.
  • Prefix Sum (Header: <chag/pp/prefix.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline void prefix( const T* aStart, const T* aEnd, T* aOutput, T* aTotal = 0 ); 3 template< typename T, class Op > 4 inline void prefix( const T* aStart, const T* aEnd, T* aOutput, T* aTotal, const Op& aOperator );

    Exclusive prefix sum (scan). Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aOutput: Output, (aEnd-aStart) elements
    • aTotal: Output total
    • aOperator: User specified operator. See below.
  • Compaction (Header: <chag/pp/compact.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline void compact( const T* aStart, const T* aEnd, T* aOutput,  3     SizeType* aNumValid = 0 ); 4 template< typename T, class Predicate > 5 inline void compact( const T* aStart, const T* aEnd, T* aOutput,  6     SizeType* aNumValid, const Predicate& aPredicate );

    Compaction (stream reduction). Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aOutput: Output, at most (aEnd-aStart) elements
    • aTotal: Output number of elements in aOutput
    • aPredicate: User specified predicate. See below.
  • Split (Header: <chag/pp/split.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline void split( const T* aStart, const T* aEnd, T* aOutput, 3     SizeType* aNumValid = 0 ); 4 template< typename T, class Predicate > 5 inline void split( const T* aStart, const T* aEnd, T* aOutput, 6     SizeType* aNumValid, const Predicate& aPredicate );

    Split. Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aOutput: Output, (aEnd-aStart) elements
    • aTotal: Output number of valid elements
    • aPredicate: User specified predicate. See below.
  • Radix Sort (Header: <chag/pp/sort.cuh>)

    -- {Language: CUDA} -- 1 template< typename T > 2 inline T* sort( const T* aStart, const T* aEnd, T* aPing, T* aPong ); 3   4 template< 5     unsigned Low, unsigned High, template <unsigned,typename> class Pred, typename T 6 > inline T* sort( const T* aStart, const T* aEnd, T* aPing, T* aPong );

    Radix sort. Parameters:

    • aStart, aEnd: Input range [aStart, aEnd)
    • aPing: output buffer A, (aEnd-aStart) elements
    • aPing: output buffer B, (aEnd-aStart) elements
    • return value: Buffer that holds output data, i.e. on of aPing or aPong.

    A split operation is performed for each value in the range [Low ... High), using the predicate Pred<Value,T>. The method returns either aPing or aPong, depending on the number of splits that were performed. The aPong may alias the aStart buffer.

Todo

  • Documentation
  • Upload examples and tests
  • Updated performance graphs
  • Tweaks. Several minor optimizations have not been implemented in the new code.