Profiling CUDA through Python with NVVP

For those of you who don't know, I'm currently pursuing my Masters in Computer Science. Part of this undertaking requires me to complete a thesis.

For the purposes of my thesis, I decided to implement a framework into the Vapoursynth video filtering project that will allow it to run as many video operations as possible on a CUDA-enabled GPU.

In so doing, I've been doing a lot of CUDA kernel writing and profiling recently. The problem, of course, is that Vapoursynth works through a Python script wrapper around a C/C++ core. So, whenever I want to profile my filters, I have to insert their associated calls into a python script, and obtain their timings through it's execution.

Of course, this is a less than ideal solution. What I've (finally) been able to do is profile my CUDA filters through their python script files using NVidia's NVVP (NVidia Visual Profiler) program. It took a few tricks to get it working correctly, especially considering that these scripts output video data directly to standard out (stdout) (which NVVP attempts to display, poorly, on the Console tab of the profiler).

Let's get to the tricks:

  1. Make sure that a hash-bang is at the top of your script. Ex:

    #!/usr/bin/env python3
    
  2. Make sure that instead of sending out video output to stdout, we send it to /dev/null on Linux, or NUL on Windows. This prevents NVVP from exploding on the massive amount of video data. Ex:

    with open(os.devnull, 'w') as f:
        clip.output(f, y4m=True)
    
  3. Be sure to make the script executable, otherwise NVVP won't be able to, well, execute it.

    example@example-desktop:~/src/testing $ chmod +x test.py
    
  4. Start a new profiling session in NVVP, and load your target Python script under the File path, optionally setting the Working Directory to the same directory as your script, in case you need to load any external files. After that, you're done! Go ahead and profile your application like you would any other CUDA executable. It may not be able to run all ~28 profiling passes, but it will run most, and give you nice execution time line to boot.

So, to sum up, simply make your target Python script executable, being sure to redirect any output to /dev/null instead of stdout if there is a large amount of data, and then run NVVP like you would a normal CUDA program and enjoy the wealth of analysis tools it has to offer.