3.9 KiB
CuPBoP: Cuda for Parallelized and Broad-range Processors
Introduction
CuPBoP is a framework which support executing unmodified CUDA source code on non-NVIDIA devices. Currently, CuPBoP support serveral CPU backends, including x86, AArch64, and RISC-V. Supporting Vortex (a RISC-V GPU) is working in progress.
Install
Prerequisites
- Linux system
- LLVM 14.0.1
- CUDA Toolkit
Although CuPBoP does not require NVIDIA GPUs, it needs CUDA to compile the source programs to NVVM/LLVM IRs. CUDA toolkit can be built on machines without NVIDIA GPUs. For building CUDA toolkit, please refer to https://developer.nvidia.com/cuda-downloads.
Installation
-
Clone from github
git clone --recursive https://github.com/drcut/CuPBoP cd CuPBoP export CuPBoP_PATH=`pwd` export LD_LIBRARY_PATH=$CuPBoP_PATH/build/runtime:$CuPBoP_PATH/build/runtime/threadPool:$LD_LIBRARY_PATH export CUDA_PATH=/usr/local/cuda-11.7 # set to your own location
-
Build CuPBoP
mkdir build && cd build #set -DDEBUG=ON for debugging cmake .. \ -DLLVM_CONFIG_PATH=`which llvm-config` \ -DCUDA_PATH=$CUDA_PATH make
-
(Optional) Use CuPBoP to execute Hetero-mark benchmark for verification
make test
Run Vector Addition example
In this section, we provide an example of how to use CuPBoP to execute a CUDA program.
cd examples/vecadd
# Compile CUDA source code (both host and kernel) to bitcode files
clang++ -std=c++11 vecadd.cu \
-I../.. --cuda-path=$CUDA_PATH \
--cuda-gpu-arch=sm_50 -L$CUDA_PATH/lib64 \
-lcudart_static -ldl -lrt -pthread -save-temps -v || true
# Apply compilation transformations on the kernel bitcode file
$CuPBoP_PATH/build/compilation/kernelTranslator \
vecadd-cuda-nvptx64-nvidia-cuda-sm_50.bc kernel.bc
# Apply compilation transformations on the host bitcode file
$CuPBoP_PATH/build/compilation/hostTranslator \
vecadd-host-x86_64-unknown-linux-gnu.bc host.bc
# Generate object files
llc --relocation-model=pic --filetype=obj kernel.bc
llc --relocation-model=pic --filetype=obj host.bc
# Link with runtime libraries and generate the executable file
g++ -o vecadd -fPIC -no-pie \
-I$CuPBoP_PATH/runtime/threadPool/include \
-L$CuPBoP_PATH/build/runtime \
-L$CuPBoP_PATH/build/runtime/threadPool \
host.o kernel.o \
-I../.. -lc -lx86Runtime -lthreadPool -lpthread
# Execute
./vecadd
How to contribute?
Any kinds of contributions are welcome. Please refer to Contribution.md for more detail.
Related publications
If you want to refer CuPBoP in your projects, please cite the related papers:
- COX: Exposing CUDA Warp-Level Functions to CPUs
- CuPBoP: CUDA for Parallelized and Broad-range Processors
Contributors
- Ruobing Han
- Jun Chen
- Bhanu Garg
- Xule Zhou
- John Lu
- Chihyo Ahn
- Haotian Sheng
- Blaise Tine
- Hyesoon Kim
Acknowledgements
- POCL is an open-source OpenCL implementations that based on LLVM. We reuse some code from it (e.g., apply optimizations, load/store LLVM IRs).
- Hetero-Mark and Rodinia Benchmark are two benchmark suites for heterogeneous system computation. CuPBoP uses them as integrated test to verify the correctness.
- [moodycamel::ConcurrentQueue] (https://github.com/cameron314/concurrentqueue/tree/master) is a fast multi-producer, multi-consumer lock-free concurrent queue for C++11. CuPBoP uses it as the task queue for launching and executing kernels.