torch_amp


AMP

AUTOMATIC MIXED PRECISION唯一要求是 Pytorch 1.6 以上版本和支持 CUDA 的 GPU

torch.cuda.amp 为混合精度提供了方便的方法,其中一些操作使用 torch.float32 (浮点数) 数据类型,另一些操作使用 torch.float16 (半精度)。一些操作,如线性层和卷积,在 float16 中速度更快。其他操作,如reductions 还原,通常需要 float32 的动态范围。混合精度尝试将每个操作与其相应的数据类型相匹配,这可以减少网络的运行时间和内存占用。

混合精度主要有利于支持张量核的架构(Volta、Turing、Ampere)。在这些架构上,此配方应能显示出显著(2-3 倍)的速度提升。在早期的架构(开普勒、麦克斯韦、帕斯卡)上,您可能会观察到适度的速度提升。运行 nvidia-smi 来显示 GPU 的架构。

batch_size、in_size、out_size 和 num_layers 被选择得足够大,以便让 GPU 的工作达到饱和。通常情况下,当 GPU 达到饱和时,混合精度会带来最大的速度提升。在这种情况下,混合精度不会提高性能。在选择尺寸时,线性层的参与维度也是 8 的倍数,以便在支持张量核的 GPU 上使用张量核

torch.cuda.amp.autocast

torch.cuda.amp.autocast在这些区域,CUDA 操作会在自动转换选择的 dtype 中运行,以提高性能,同时保持准确性。

for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        # Runs the forward pass under autocast.
        with torch.cuda.amp.autocast():
            output = net(input)
            # output is float16 because linear layers autocast to float16.
            assert output.dtype is torch.float16

            loss = loss_fn(output, target)
            # loss is float32 because mse_loss layers autocast to float32.
            assert loss.dtype is torch.float32

        # Exits autocast before backward().
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True here can modestly improve performance

torch.cuda.amp.GradScaler

Gradient scaling 梯度缩放有助于防止在使用混合精度进行训练时,幅度较小的梯度冲向零(”下溢”)

# Constructs scaler once, at the beginning of the convergence run, using default args.
# If your network fails to converge with default GradScaler args, please file an issue.
# The same GradScaler instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh GradScaler instance.  GradScaler instances are lightweight.
scaler = torch.cuda.amp.GradScaler()

for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(opt)

        # Updates the scale for next iteration.
        scaler.update()

        opt.zero_grad() # set_to_none=True here can modestly improve performance

All together

use_amp = True

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")

Inspecting/modifying gradients

所有由 scaler.scale(loss).backward() 生成的梯度都是按比例缩放的。如果要在backward()scaler.step(optimizer) 之间修改或检查参数的.grad 属性,应首先使用 scaler.unscale_(optimizer) 取消缩放。

for epoch in range(0): # 0 epochs, this section is for illustration only
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(opt)

        # Since the gradients of optimizer's assigned params are now unscaled, clips as usual.
        # You may use the same value for max_norm here as you would without gradient scaling.
        torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.1)

        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance

Saving/Resuming

scaler.state_dict()

checkpoint = {"model": net.state_dict(),
              "optimizer": opt.state_dict(),
              "scaler": scaler.state_dict()}
# Write checkpoint as desired, e.g.,
# torch.save(checkpoint, "filename")

# Read checkpoint as desired, e.g.,
# dev = torch.cuda.current_device()
# checkpoint = torch.load("filename",
#                         map_location = lambda storage, loc: storage.cuda(dev))
net.load_state_dict(checkpoint["model"])
opt.load_state_dict(checkpoint["optimizer"])
scaler.load_state_dict(checkpoint["scaler"])
  • 如果检查点是在没有使用 Amp 的情况下创建的,而你又想继续使用 Amp 进行训练,请像往常一样从检查点加载模型和优化器状态。检查点不包含已保存的缩放器状态,因此需要使用新的 GradScaler 实例。
  • 如果检查点是在使用 Amp 的情况下创建的,并且您想在不使用 Amp 的情况下恢复训练,则应像往常一样从检查点加载模型和优化器状态,而忽略已保存的缩放器状态。

Troubleshooting

Speedup with Amp is minor

  1. Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.
    • A rough rule of thumb to saturate the GPU is to increase batch and/or network size(s) as much as you can without running OOM.
    • Try to avoid excessive CPU-GPU synchronization (.item() calls, or printing values from CUDA tensors).
    • Try to avoid sequences of many small CUDA ops (coalesce these into a few large CUDA ops if you can).
  2. Your network may be GPU compute bound (lots of matmuls/convolutions) but your GPU does not have Tensor Cores. In this case a reduced speedup is expected.
  3. Matmul dimensions are not Tensor Core-friendly. Make sure matmuls’ participating sizes are multiples of 8. (For NLP models with encoders/decoders, this can be subtle. Also, convolutions used to have similar size constraints for Tensor Core use, but for CuDNN versions 7.3 and later, no such constraints exist. See here for guidance.)

Loss is inf/NaN

First, check if your network fits an advanced use case. See also Prefer binary_cross_entropy_with_logits over binary_cross_entropy.

If you’re confident your Amp usage is correct, you may need to file an issue, but before doing so, it’s helpful to gather the following information:

  1. Disable autocast or GradScaler individually (by passing enabled=False to their constructor) and see if infs/NaNs persist.
  2. If you suspect part of your network (e.g., a complicated loss function) overflows , run that forward region in float32 and see if infs/NaNs persist. The autocast docstring’s last code snippet shows forcing a subregion to run in float32 (by locally disabling autocast and casting the subregion’s inputs).

Type mismatch error (may manifest as CUDNN_STATUS_BAD_PARAM)

Autocast tries to cover all ops that benefit from or require casting. Ops that receive explicit coverage are chosen based on numerical properties, but also on experience. If you see a type mismatch error in an autocast-enabled forward region or a backward pass following that region, it’s possible autocast missed an op.

Please file an issue with the error backtrace. export TORCH_SHOW_CPP_STACKTRACES=1 before running your script to provide fine-grained information on which backend op is failing.

Total running time of the script: ( 0 minutes 0.000 seconds)

Ref


文章作者: Lee Jet
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Lee Jet !
评论
  目录