GPU模式讲座1的笔记

分析器

计算机性能取决于时间和内存的权衡。由于计算设备比较昂贵，所以大多数时候，时间是首先要关心的。

为什么要使用分析器？

cuda 是异步的，因此无法使用 python 时间模块分析器更加强大

工具

共有三个分析器：

autograd 分析器：数值pytorch 分析器：视觉nvidia nsight 计算

autograd 分析器利用 torch.cuda.event() 来测量性能。

pytorch profiler 利用 profiler 上下文管理器 torch.profiler 中的 profile() 方法来分析性能。
您可以将结果导出为 .json 文件并将其上传到 chrome://tracing/ 进行可视化。

演示

课程提供了一个简单的程序来展示如何使用autograd profiler来分析三种平方运算方法的性能：

通过 torch.square()由 ** 操作员由 * 操作员

def time_pytorch_function(func, input): # cuda is async so can’t use python time module start = torch.cuda.event(enable_timing=true) end = torch.cuda.event(enable_timing=true) # warmup for _ in range(5): func(input) start.record() func(input) end.record() torch.cuda.synchronize() return start.elapsed_time(end)time_pytorch_function(torch.square, b)time_pytorch_function(square_2, b)time_pytorch_function(square_3, b)

下面的结果是在 nvidia t4 gpu 上完成的。

profiling torch.square:self cpu time total: 10.577msself cuda time total: 3.266msprofiling a * a:self cpu time total: 5.417msself cuda time total: 3.276msprofiling a ** 2:self cpu time total: 6.183msself cuda time total: 3.274ms

事实证明：

cuda 运算速度比 cpu 更快。* 运算符执行的是 aten::multiply 操作，而不是 aten::pow，并且前者更快。这可能是因为乘法比 pow 使用得更多，并且许多开发人员花时间对其进行优化。cuda 上的性能差异很小。考虑到 cpu 时间，torch.square 是最慢的操作aten::square 是对 aten::pow 的调用所有三种方法都启动了一个名为native::vectorized_elementwise_kernel在 pytorch 中集成 cuda 内核

有几种方法可以做到这一点：

使用torch.utils.cpp_extendsion中的load_inline使用 numba，它是一个编译器，可将经过修饰的 python 函数编译为在 cpu 和 gpu 上运行的机器代码使用 triton

我们可以使用torch.utils.cpp_extendsion中的load_inline通过load_inline（name，cpp_sources，cuda_sources，functions，with_cuda，build_directory）将cuda内核加载为pytorch扩展。

from torch.utils.cpp_extension import load_inlinesquare_matrix_extension = load_inline( name=’square_matrix_extension’, cpp_sources=cpp_source, cuda_sources=cuda_source, functions=[‘square_matrix’], with_cuda=true, extra_cuda_cflags=["-o2"], build_directory=’./load_inline_cuda’, # extra_cuda_cflags=[‘–expt-relaxed-constexpr’])a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device=’cuda’)print(square_matrix_extension.square_matrix(a))

动手实践对均值操作使用 autograd 分析器

使用 autograd profiler 时，请记住：

录制前预热gpu，使gpu进入稳定状态平均多次运行以获得更可靠的结果

import torch# method 1: use `torch.mean()`def mean_all_by_torch(input_tensor): return torch.mean(input_tensor)# method 2: use `mean()` of the tensordef mean_all_by_tensor(input_tensor): return input_tensor.mean()# method 3: use `torch.sum()` and `tensor.numel()`def mean_all_by_bination(input_tensor): return torch.sum(input_tensor) / input_tensor.numel()def time_pytorch_function(func, input_tensor, warmup=5, runs=100): # warmup for _ in range(warmup):func(input_tensor) times = [] start = torch.cuda.event(enable_timing=true) end = torch.cuda.event(enable_timing=true) for _ in range(runs): start.record() func(input_tensor) end.record() torch.cuda.synchronize() times.append(start.elapsed_time(end)) return sum(times) / len(times)input_tensor = torch.randn(10000, 10000).cuda()print("torch.mean() time:", time_pytorch_function(mean_all_by_torch, input_tensor))print("tensor.mean() time:", time_pytorch_function(mean_all_by_tensor, input_tensor))print("manual mean time:", time_pytorch_function(mean_all_by_bination, input_tensor))with torch.profiler.profile() as prof: mean_all_by_torch(input_tensor)print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))with torch.profiler.profile() as prof: mean_all_by_tensor(input_tensor)print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))with torch.profiler.profile() as prof: mean_all_by_bination(input_tensor)print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

使用 pytorch 分析器进行均值操作

import torchfrom torch.profiler import profile, profileractivitywith profile(activities=[profileractivity.cpu, profileractivity.cuda]) as prof: for _ in range(10): mean_tensor = torch.mean(torch.randn(10000, 10000).cuda())prof.export_chrome_trace("mean_trace.json")

为 torch.mean() 实现 triton 代码

import tritonimport triton.language as tlimport torch@triton.jitdef mean_kernel( x_ptr, # pointer to input tensor output_ptr, # pointer to output tensor n_elements, # total number of elements BLOCK_SIZE: tl.constexpr, # number of elements per block): pid = tl.program_id(0) block_start = pid * BLOCK_SIZE block_end = tl.minimum(block_start + BLOCK_SIZE, n_elements) acc = 0.0 for idx in range(block_start, block_end): x = tl.load(x_ptr + idx) acc += x block_mean = acc / n_elements # Store result tl.store(output_ptr + pid, block_mean)# Wrapper functiondef triton_mean(x: torch.Tensor) -> torch.Tensor: x = x.contiguous().view(-1) n_elements = x.numel() BLOCK_SIZE = 1024 grid = (triton.cdiv(n_elements, BLOCK_SIZE),) output = torch.empty(grid[0], device=x.device, dtype=x.dtype) mean_kernel[grid]( x_ptr=x, output_ptr=output, n_elements=n_elements, BLOCK_SIZE=BLOCK_SIZE, ) return output.sum()# Example usage:if __name__ == "__main__": # Create test tensor x = torch.randn(1000000, device=’cuda’) # Compare results torch_mean = torch.mean(x) triton_mean_result = triton_mean(x) print(f"PyTorch mean: {torch_mean}") print(f"Triton mean: {triton_mean_result}") print(f"Difference: {abs(torch_mean – triton_mean_result)}")

参考gpu 模式讲座 – github活动 – pytorchpytorch 分析器nvidia nsight 计算torch.utils.cpp_extension.load_inline海卫一

以上就是GPU 模式讲座 1 的笔记的详细内容，更多请关注范的资源库其它相关文章！

转载请注明：范的资源库 » GPU模式讲座1的笔记