当前位置：首页 > news >正文

单页营销型网站模板哪个网站能在百度做推广

news 2026/5/25 5:39:05

单页营销型网站模板,哪个网站能在百度做推广,企业网站广告,网站推广免费渠道【CUDA】Triton 1. CUDA 与 Triton 的基本区别 CUDA 编程模型#xff1a; 在传统的 CUDA 编程中#xff0c;CUDA 是标量程序#xff0c;带有阻塞线程#xff08;blocked threads#xff09;。标量程序#xff08;Scalar Program#xff09;#xff1a;表示我们直接…【CUDA】Triton 1. CUDA 与 Triton 的基本区别 CUDA 编程模型在传统的 CUDA 编程中CUDA 是标量程序带有阻塞线程blocked threads。标量程序Scalar Program表示我们直接为每个线程编写操作代码每个线程处理一个数据元素。阻塞线程Blocked Threads为了高效地处理大数据集线程被组织成线程块thread blocks。线程块之间的计算是隔离的线程块内的线程通过共享内存shared memory进行协作。在 CUDA 中程序员直接处理线程和线程块。你需要自己编写线程级别的细节如如何加载数据、如何存储结果、如何管理共享内存等。CUDA 程序员需要掌握这些低级细节以便充分利用 GPU 的硬件特性。 Triton 编程模型 Triton 是基于 CUDA 的高层抽象。Triton 是带有标量线程scalar threads的块化程序blocked program。块化程序Blocked Program表示你不再直接操作单个线程而是关注线程块层次的优化。你编写的程序将自动被编译器转换为低级代码。标量线程Scalar Threads在 Triton 中程序员不再需要管理线程级别的细节如内存访问模式编译器会自动为你处理这些复杂的操作如数据加载、存储、共享内存使用等。 2. Triton vs CUDA高层抽象 vs 低层控制在 CUDA 中程序员需要掌握线程级操作包括如何组织线程、如何管理共享内存和同步等。程序员通常需要关注线程块之间如何交互以及如何利用硬件特性如共享内存、寄存器来提高效率。简而言之CUDA 提供的是较为底层的控制允许程序员手动优化每个操作的细节。而在 Triton 中程序员更多关注高层操作例如卷积、矩阵乘法等深度学习中的标准操作。Triton 会通过编译器自动处理低级细节如数据加载、存储、内存调度、线程间同步等。Triton 让深度学习的 GPU 编程更像是高级语言编程减少了繁琐的底层优化工作。直观理解 CUDA程序员编写的是标量程序并通过线程块来管理计算工作。你需要手动优化每个线程的行为以及线程块之间的协调。这意味着你要深入理解 GPU 的硬件架构。Triton程序员编写的是块化程序关注大块操作并使用标量线程。Triton 的编译器将自动处理线程级别的操作使得程序员不再需要处理每个线程的细节能更专注于高级操作。 3. Triton 的优势简化深度学习编程 Triton 的一个重要特点是为深度学习程序员提供了一个更高层次的抽象使得复杂的 GPU 编程变得更简洁。程序员不再需要掌握底层的线程管理或内存优化可以像写 Python 代码一样编写高效的 GPU 核心操作如卷积、矩阵乘法等。这对于深度学习研究者来说是一个很大的优势因为他们的工作更关注模型设计和算法优化而不是 GPU 编程的低级细节。举个例子传统上使用 CUDA 实现 cuBLAS 或 cuDNN 等高效的深度学习库需要深入理解 GPU 架构和高效的内存管理策略而 Triton 让 Python 程序员通过一个更直观的 API 编写出与这些库同样高效的代码。 4. 为什么不能直接跳过 CUDA 使用 Triton Triton 是建立在 CUDA 之上的因此理解 CUDA 的一些基本概念仍然是非常重要的。虽然 Triton 抽象了很多复杂的低级细节但它的性能仍然依赖于底层 CUDA 的硬件特性。 CUDA 是 Triton 的基础Triton 的编译器最终会将代码转换成 CUDA 代码并在 GPU 上执行。因此了解 CUDA 中的线程组织、内存模型等基础概念仍然是优化性能的重要基础。自定义优化对于一些特定场景程序员可能需要手动优化代码或针对特定硬件架构进行调优这时可能需要直接编写 CUDA 内核或者深入理解 CUDA 的底层特性来进一步优化 Triton 生成的代码。 5. 资源学习 Triton 文档可以通过官方文档详细了解 Triton 的编程模型、API 和最佳实践。OpenAI 博客可以深入了解 Triton 的设计思想、底层实现以及与 CUDA 的关系。GitHub可以查看 Triton 的源代码、示例程序和开源项目帮助你理解如何使用 Triton 编写高效的深度学习程序。总结 CUDA 是一种低级 GPU 编程框架程序员需要自己处理线程调度、内存访问等底层优化细节。Triton 提供了一个更高层次的抽象简化了深度学习 GPU 编程让程序员能够专注于算法层次的开发而不需要担心低级硬件细节。Triton 是建立在 CUDA 基础之上的因此了解 CUDA 的基本概念对深入理解 Triton 及其性能优化非常重要。 Code 通过下面简易的代码学习triton的使用方法以及了解triton相较于cuda的高层次抽象。 vec_add.py 这个程序使用 Triton 和 PyTorch 实现了向量加法x y并对比了两者的性能。Triton 是一个用于编写高效 GPU 内核的工具类似于 CUDA 但更简单。程序的核心是一个 Triton 内核 add_kernel它分块处理数据并支持边界检查。通过性能测试程序比较了 Triton 和 PyTorch 原生加法操作的吞吐量GB/s并生成图表展示结果。最终目的是展示 Triton 在高性能计算中的优势。 import torch import triton import triton.language as tltriton.jit def add_kernel(x_ptr, # *Pointer* to first input vector. 指向第一个输入向量的指针。y_ptr, # *Pointer* to second input vector. 指向第二个输入向量的指针。output_ptr, # *Pointer* to output vector. 指向输出向量的指针。n_elements, # Size of the vector. 向量的大小。BLOCK_SIZE: tl.constexpr, # Number of elements each program should process. 每个程序应处理的元素数量。# NOTE: constexpr so it can be used as a shape value. 注意constexpr 因此它可以用作形状值。):# There are multiple programs processing different data. We identify which program# 有多个“程序”处理不同的数据。需要确定是哪一个程序pid tl.program_id(axis0) # We use a 1D launch grid so axis is 0. 使用 1D 启动网格因此轴为 0。# This program will process inputs that are offset from the initial data.# 该程序将处理相对初始数据偏移的输入。# For instance, if you had a vector of length 256 and block_size of 64, the programs would each access the elements [0:64, 64:128, 128:192, 192:256].# 例如如果有一个长度为 256, 块大小为 64 的向量程序将各自访问 [0:64, 64:128, 128:192, 192:256] 的元素。# Note that offsets is a list of pointers:# 注意 offsets 是指针列表block_start pid * BLOCK_SIZEoffsets block_start tl.arange(0, BLOCK_SIZE)# Create a mask to guard memory operations against out-of-bounds accesses.# 创建掩码以防止内存操作超出边界访问。mask offsets n_elements# Load x and y from DRAM, masking out any extra elements in case the input is not a multiple of the block size.# 从 DRAM 加载 x 和 y如果输入不是块大小的整数倍则屏蔽掉任何多余的元素。x tl.load(x_ptr offsets, maskmask)y tl.load(y_ptr offsets, maskmask)output x y# Write x y back to DRAM.# 将 x y 写回 DRAM。tl.store(output_ptr offsets, output, maskmask)def add(x: torch.Tensor, y: torch.Tensor):# We need to preallocate the output.# 需要预分配输出。output torch.empty_like(x)assert x.is_cuda and y.is_cuda and output.is_cudan_elements output.numel()# The SPMD launch grid denotes the number of kernel instances that run in parallel.# SPMD单程序多数据启动网格表示并行运行的内核实例的数量。# It is analogous to CUDA launch grids. It can be either Tuple[int], or Callable(metaparameters) - Tuple[int].# 它类似于 CUDA 启动网格。它可以是 Tuple[int]也可以是 Callable(metaparameters) - Tuple[int]。# In this case, we use a 1D grid where the size is the number of blocks:# 在这种情况下使用 1D 网格其中大小是块的数量grid lambda meta: (triton.cdiv(n_elements, meta[BLOCK_SIZE]), )# NOTE:# 注意# - Each torch.tensor object is implicitly converted into a pointer to its first element.# - 每个 torch.tensor 对象都会隐式转换为其第一个元素的指针。# - triton.jited functions can be indexed with a launch grid to obtain a callable GPU kernel.# - triton.jit 函数可以通过启动网格索引来获得可调用的 GPU 内核。# - Dont forget to pass meta-parameters as keywords arguments.# - 不要忘记以关键字参数传递元参数。add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE1024)# We return a handle to z but, since torch.cuda.synchronize() hasnt been called, the kernel is still running asynchronously at this point.# 返回 z 的句柄但由于 torch.cuda.synchronize() 尚未被调用此时内核仍在异步运行。return outputtorch.manual_seed(0) size 2**25 x torch.rand(size, devicecuda) y torch.rand(size, devicecuda)triton.testing.perf_report(triton.testing.Benchmark(x_names[size], # Argument names to use as an x-axis for the plot. 用作绘图 x 轴的参数名称。x_vals[2**i for i in range(12, 28, 1)], # Different possible values for x_name. x_name 的不同可能值。x_logTrue, # x axis is logarithmic. x 轴为对数。line_argprovider, # Argument name whose value corresponds to a different line in the plot. 参数名称其值对应于绘图中的不同线条。line_vals[triton, torch], # Possible values for line_arg. line_arg 的可能值。line_names[Triton, Torch], # Label name for the lines. 线条的标签名称。styles[(blue, -), (green, -)], # Line styles. 线条样式。ylabelGB/s, # Label name for the y-axis. y 轴标签名称。plot_namevector-add-performance, # Name for the plot. Used also as a file name for saving the plot. 绘图名称。也用作保存绘图的文件名。args{}, # Values for function arguments not in x_names and y_name. 不在 x_names 和 y_name 中的函数参数值。)) def benchmark(size, provider):x torch.rand(size, devicecuda, dtypetorch.float32)y torch.rand(size, devicecuda, dtypetorch.float32)quantiles [0.5, 0.2, 0.8]if provider torch:ms, min_ms, max_ms triton.testing.do_bench(lambda: x y, quantilesquantiles)if provider triton:ms, min_ms, max_ms triton.testing.do_bench(lambda: add(x, y), quantilesquantiles)# 3 * x.numel() * x.element_size() 因为在执行 x y 操作时实际上是执行了三个步骤加载 x 张量到 GPU加载 y 张量到 GPU将 x y 的结果写回到 GPU 内存gbps lambda ms: 3 * x.numel() * x.element_size() / ms * 1e-6return gbps(ms), gbps(max_ms), gbps(min_ms)benchmark.run(print_dataTrue, show_plotsTrue,save_path./)可以看到二者性能十分接近 softmax.cu #include cuda_runtime.h #include math.h #include stdio.h #include stdlib.h__global__ void softmax_cuda(float* input, float* output, int B, int N) {int tid blockIdx.x * blockDim.x threadIdx.x;int bid blockIdx.y;if (tid N bid B) {// 简易实现每个线程都在重复计算整个 batch 的最大值、指数和、softmaxint offset bid * N;float max_val input[offset];for (int i 1; i N; i) {max_val max(max_val, input[offset i]);}float sum 0.0f;for (int i 0; i N; i) {sum expf(input[offset i] - max_val);}for (int i 0; i N; i) {output[offset i] expf(input[offset i] - max_val) / sum;}} }void softmax(float* x, int N) {float max x[0];for (int i 1; i N; i) {if (x[i] max) max x[i];}float sum 0.0f;for (int i 0; i N; i) {x[i] exp(x[i] - max);sum x[i];}for (int i 0; i N; i) {x[i] / sum;} }int main() {const int B 32; // 批量大小const int N 1024; // 行长float* x_cpu (float*)malloc(B * N * sizeof(float));float* x_gpu (float*)malloc(B * N * sizeof(float));float *d_input, *d_output;for (int i 0; i B * N; i) {x_cpu[i] (float)rand() / RAND_MAX;x_gpu[i] x_cpu[i];}cudaMalloc((void**)d_input, B * N * sizeof(float));cudaMalloc((void**)d_output, B * N * sizeof(float));cudaMemcpy(d_input, x_gpu, B * N * sizeof(float), cudaMemcpyHostToDevice);int theads_per_block 256;int blocks_per_grid_x (N theads_per_block - 1) / theads_per_block;dim3 grid_dim(blocks_per_grid_x, B);// 二维网格一维blocksoftmax_cudagrid_dim, theads_per_block(d_input, d_output, B, N);cudaMemcpy(x_gpu, d_output, B * N * sizeof(float), cudaMemcpyDeviceToHost);softmax(x_cpu, N);float max_diff 0.0f;for (int i 0; i N; i) {float diff fabsf(x_cpu[i] - x_gpu[i]);if (diff max_diff) {max_diff diff;}}printf(Maximum difference between CPU and GPU results (first batch): %e\n, max_diff);// Clean upfree(x_cpu);free(x_gpu);cudaFree(d_input);cudaFree(d_output);return 0; }softmax.py import torch import triton import triton.language as tltriton.jit def softmax_kernel(output_ptr, input_ptr, input_row_stride, output_row_stride, n_cols,BLOCK_SIZE: tl.constexpr, ):# Get the program IDrow_idx tl.program_id(axis0)# Compute the memory offsets for this rowrow_start_ptr input_ptr row_idx * input_row_strideout_row_start_ptr output_ptr row_idx * output_row_stride# Load the row into SRAMrow tl.load(row_start_ptr tl.arange(0, BLOCK_SIZE), masktl.arange(0, BLOCK_SIZE) n_cols, other-float(inf))# other-float(inf) 为了保证数值稳定性在 softmax 计算中指数运算前减去了最大值。如果某些列的值为负无穷在指数运算后会变成零不会影响最终的归一化结果。# Compute max for numerical stabilityrow_max tl.max(row, axis0)# Subtract max from row and exponentiatenumerator tl.exp(row - row_max)# Compute sum for normalizationdenominator tl.sum(numerator, axis0)# Normalizesoftmax_output numerator / denominator# Store the outputtl.store(out_row_start_ptr tl.arange(0, BLOCK_SIZE), softmax_output, masktl.arange(0, BLOCK_SIZE) n_cols)def triton_softmax(x):n_rows, n_cols x.shapeoutput torch.empty_like(x)# Determine the block sizeBLOCK_SIZE triton.next_power_of_2(n_cols)BLOCK_SIZE min(BLOCK_SIZE, 1024) # Launch the Triton kernelgrid (n_rows,)softmax_kernel[grid](output, x,x.stride(0), output.stride(0),n_cols, BLOCK_SIZEBLOCK_SIZE)return output# Set up the input tensor torch.manual_seed(0) x torch.randn(256, 1000, devicecuda) # x torch.tensor([[1.0, 2.0, 3.0]], devicecuda) # Compute softmax using PyTorch torch_result torch.softmax(x, dim1)# Compute softmax using Triton triton_result triton_softmax(x)# Compare results max_diff torch.max(torch.abs(torch_result - triton_result)) print(fMaximum difference between PyTorch and Triton results: {max_diff:.2e})# Check if results are close is_close torch.allclose(torch_result, triton_result, rtol1e-5, atol1e-5) print(fResults are close: {is_close})参考资料https://github.com/Infatoshi/cuda-course/tree/master/08_Triton

查看全文

http://www.eeditor.cn/news/124748/