RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

这个问题解决了一天。。。。

好好的训练代码,换了一台机器,就报错了。

以为是cuda11造成的,担心cuda版本和pytorch版本不匹配,一顿重装,结果没解决。

问题现象:

raceback (most recent call last):
  File "train.py", line 100, in <module>
    main(opt)
  File "train.py", line 71, in main

……

  File "/home/xxxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0xaa030590
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
output: TensorDescriptor 0xaa0d6560
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 1, 64, 80, 144,
    strideA = 737280, 11520, 144, 1,
weight: FilterDescriptor 0xaa0d0360
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 64, 3, 3,
Pointer addresses:
    input: 0x567e50000
    output: 0x568120000
    weight: 0x550a2da00

解决方法:

把cuda的提示保存到文件,

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 64, 80, 144], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

然后python运行它,就会报相同的错,然后挑选其中的开关做调整,再次尝试是否仍报错。

对我的代码来说,修改以下这项就work了。

torch.backends.cudnn.benchmark = False

然后把这个放到出问题的代码前方就可以了。

一些讨论参见:F.conv2d() causes RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR · Issue #45769 · pytorch/pytorch · GitHub

上一篇:java-Solr-使用Httpclient实例化HttpSolrServer


下一篇:java-如何重视特定领域?