Enhance the GPU documentation
- Add A40 and GH200 to the table and constraint lists
- Remove the list of cuda versions and instead give the command that generates it so we don't need to keep the list up-to-date (or add it to ci/cd, but I think it just looks messy)
- Add a section on "How to choose a GPU", suggesting starting with the lowest end if unsure.
- Document
nvitop
(installable with pip
) as a convenient way to monitor usage.
- Document common error messages like "torch.cuda.OutOfMemory Error: CUDA out of memory" and what to do
- Add a section on partition selection (pointing out the difference between priority partitions and general), selecting multiple partitions when possible
- Suggest check-pointing when using
gpu-preempt
to access GPU types that may only be in a priority queue.