Skip to content

Enhance the GPU documentation

  1. Add A40 and GH200 to the table and constraint lists
  2. Remove the list of cuda versions and instead give the command that generates it so we don't need to keep the list up-to-date (or add it to ci/cd, but I think it just looks messy)
  3. Add a section on "How to choose a GPU", suggesting starting with the lowest end if unsure.
    1. Document nvitop (installable with pip) as a convenient way to monitor usage.
    2. Document common error messages like "torch.cuda.OutOfMemory Error: CUDA out of memory" and what to do
  4. Add a section on partition selection (pointing out the difference between priority partitions and general), selecting multiple partitions when possible
  5. Suggest check-pointing when using gpu-preempt to access GPU types that may only be in a priority queue.