使用Group Shared Memory加速高斯模糊

2023年4月5日更新

再议高斯模糊，更实用的高斯模糊。

为什么要用Compute Shader来做高斯模糊

在之前的博客中我说Compute Shader的优势就是快，因为GPU的并行运算比CPU强大很多。但是相比于同样运行在GPU上的Fragment Shader之类，Compute Shader是不是就毫无优势了呢？答案是否定的，在DX11的文档中我们可以看到Thread Group Shared Memory这么一个概念，Compute Shader的Thread Group中的每一个Thread，都可以极快速的访问到对应的Group Shared Memory中的数据，这个效率比采样一张贴图来的高。因此在进行高斯模糊这样的需要大量贴图采样的计算时，先将贴图数据缓存到Group Shared Memory中，再多次访问Group Shared Memory，这样运行的效率会高得多。

相关的一些参考可以在英伟达的PPT里找到。

Compute Shader进行高斯模糊的具体操作

这里暂时不使用半分辨率的优化方法，目的是把SrcIden对应的RenderTexture经过高斯模糊储存到DestIden中，这时我们需要一张临时的相同大小的RenderTextureBlurIden。
确定需要的最大的高斯模糊的像素宽度MAX_RADIUS，越大的高斯模糊宽度，需要越大的Group Shared Memory，而Group Shared Memory的大小是有上限的（cs_5.0是32768 bytes）。当然也没有必要在全分辨率的情况下做特别大的高斯模糊就是了。这里我设置最大的高斯模糊像素数是32，即高斯模糊当前像素点和最远采样点之间的距离不能超过32（双线性采样的话还要缩小一个像素）。
高斯模糊往往使用水平和竖直两个高斯核心进行模糊，对应的需要两个Compute Shader Kernel（也可以写成一个，不过思考起来有点绕），我们这里设置两个Kernel，对应水平和竖直两个pass。水平Kernel使用[numthreads(64, 1, 1)]，最高可以是1024，竖直Kernel则使用[numthreads(1, 64, 1)]。
由于高斯模糊中会采样像素点的左右（上下）两侧的像素，Group Shared Memory需要在GroupThreads的基础上向两侧扩大最大高斯模糊像素数MAX_RADIUS，用于保存额外的像素数据。这时我们需要的Group Shared Memory的大小是numthreads + 2 * MAX_RADIUS个，在本文章中是64 + 2 * 32个float3的数据。

GaussianBlur.cs

这里略去Unity SRP的设置pass的操作，仅展示高斯模糊相关的操作。当blurRadius过大时，就能看到明显的多重采样的痕迹了。

private void DoGaussianBlurHorizontal(CommandBuffer cmd, RenderTargetIdentifier srcid, RenderTargetIdentifier dstid, ComputeShader computeShader, float blurRadius)
{
    int gaussianBlurKernel = computeShader.FindKernel("GaussianBlurHorizontalMain");

    computeShader.GetKernelThreadGroupSizes(gaussianBlurKernel, out uint x, out uint y, out uint z);
    cmd.SetComputeTextureParam(computeShader, gaussianBlurKernel, "_InputTexture", srcid);
    cmd.SetComputeTextureParam(computeShader, gaussianBlurKernel, "_OutputTexture", dstid);
    cmd.SetComputeFloatParam(computeShader, "_BlurRadius", blurRadius);
    cmd.SetComputeVectorParam(computeShader, "_TextureSize", new Vector4(halfRes.x, halfRes.y, 1f / halfRes.x, 1f / halfRes.y));
    cmd.DispatchCompute(computeShader, gaussianBlurKernel,
        Mathf.CeilToInt((float)halfRes.x / x),
        Mathf.CeilToInt((float)halfRes.y / y),
        1);
}

private void DoGaussianBlurVertical(CommandBuffer cmd, RenderTargetIdentifier srcid, RenderTargetIdentifier dstid, ComputeShader computeShader, float blurRadius)
{
    int gaussianBlurKernel = computeShader.FindKernel("GaussianBlurVerticalMain");

    computeShader.GetKernelThreadGroupSizes(gaussianBlurKernel, out uint x, out uint y, out uint z);
    cmd.SetComputeTextureParam(computeShader, gaussianBlurKernel, "_InputTexture", srcid);
    cmd.SetComputeTextureParam(computeShader, gaussianBlurKernel, "_OutputTexture", dstid);
    cmd.SetComputeFloatParam(computeShader, "_BlurRadius", blurRadius);
    cmd.SetComputeVectorParam(computeShader, "_TextureSize", new Vector4(halfRes.x, halfRes.y, 1f / halfRes.x, 1f / halfRes.y));
    cmd.DispatchCompute(computeShader, gaussianBlurKernel,
        Mathf.CeilToInt((float)halfRes.x / x),
        Mathf.CeilToInt((float)halfRes.y / y),
        1);
}

public override void Execute(ScriptableRenderContext context, ref RenderingData renderingData)
{
    CommandBuffer cmd = CommandBufferPool.Get(profilerTag); 
    context.ExecuteCommandBuffer(cmd);
    cmd.Clear();

    using (new ProfilingScope(cmd, gaussianBlurSampler))
    {
        DoGaussianBlurHorizontal(cmd, cameraColorIden, blurIden, depthOfFieldComputeShader, blurRadius);
        DoGaussianBlurVertical(cmd, blurIden, destIden, depthOfFieldComputeShader, blurRadius);
        cmd.Blit(destIden, cameraColorIden);
    }
 
    context.ExecuteCommandBuffer(cmd);
    cmd.Clear();
    CommandBufferPool.Release(cmd);
}

GaussianBlur.compute

在Compute Shader的计算中，很需要注意clamp的操作，对于超出屏幕范围的像素，应该使用屏幕边缘的像素的颜色。同时还要注意由于存在如currentPosition - int2(MAX_RADIUS, 0)这样的减法操作，不能随意的使用uint类型的变量，否则出现意料之外的效果时很难查明。

#pragma kernel GaussianBlurHorizontalMain
#pragma kernel GaussianBlurVerticalMain

float _BlurRadius;
float4 _TextureSize;

Texture2D<float4> _InputTexture;
RWTexture2D<float4> _OutputTexture;

static float gaussian17[] =
{
    0.00002611081194810,
    0.00021522769030413,
    0.00133919168719865,
    0.00628987509902766,
    0.02229954363469697,
    0.05967667338326389,
    0.12055019394312867,
    0.18381709484250766,
    0.21157217927735517,
    0.18381709484250766,
    0.12055019394312867,
    0.05967667338326389,
    0.02229954363469697,
    0.00628987509902766,
    0.00133919168719865,
    0.00021522769030413,
    0.00002611081194810,
};

#define MAX_RADIUS 32
groupshared float3 gs_Color[64 + 2 * MAX_RADIUS];

[numthreads(64, 1, 1)]
void GaussianBlurHorizontalMain(uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID)
{
    int2 currentPosition = dispatchThreadID.xy;
    int2 tempPosition = clamp(currentPosition, 0, _TextureSize.xy - 1);
    gs_Color[groupIndex + MAX_RADIUS] = _InputTexture.Load(uint3(tempPosition, 0)).rgb;

    if (groupIndex < MAX_RADIUS)
    {
        int2 extraSample = currentPosition - int2(MAX_RADIUS, 0);
        extraSample = clamp(extraSample, 0, _TextureSize.xy - 1);
        gs_Color[groupIndex] = _InputTexture.Load(uint3(extraSample, 0)).rgb;
    }

    if(groupIndex >= 64 - MAX_RADIUS)
    {
        int2 extraSample = currentPosition + int2(MAX_RADIUS, 0);
        extraSample = clamp(extraSample, 0, _TextureSize.xy - 1);
        gs_Color[groupIndex + 2 * MAX_RADIUS] = _InputTexture.Load(uint3(extraSample, 0)).rgb;
    }
    GroupMemoryBarrierWithGroupSync();

    float3 color = 0;
    for (uint i = 0; i < 17; i++)
    {
        float weight = gaussian17[i];
        float sampleOffset = ((float)i - 8) * _BlurRadius * 0.125;
        int floorInt = floor(sampleOffset);
        float lerpValue = sampleOffset - floorInt;
        float3 sampleColorFloor = gs_Color[groupIndex + MAX_RADIUS + floorInt] ;
        float3 sampleColorCeil = gs_Color[groupIndex + MAX_RADIUS + floorInt + 1];
        float3 sampleColor = lerp(sampleColorFloor, sampleColorCeil, lerpValue);
        color += sampleColor * weight;
    }

    _OutputTexture[dispatchThreadID.xy] = float4(color, 1);
}

[numthreads(1, 64, 1)]
void GaussianBlurVerticalMain(uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID)
{
    int2 currentPosition = dispatchThreadID.xy;
    int2 tempPosition = clamp(currentPosition, 0, _TextureSize.xy - 1);
    gs_Color[groupIndex + MAX_RADIUS] = _InputTexture.Load(uint3(tempPosition, 0)).rgb;

    if (groupIndex < MAX_RADIUS)
    {
        int2 extraSample = currentPosition - int2(0, MAX_RADIUS);
        extraSample = clamp(extraSample, 0, _TextureSize.xy - 1);
        gs_Color[groupIndex] = _InputTexture.Load(uint3(extraSample, 0)).rgb;
    }

    if (groupIndex >= 64 - MAX_RADIUS)
    {
        int2 extraSample = currentPosition + int2(0, MAX_RADIUS);
        extraSample = clamp(extraSample, 0, _TextureSize.xy - 1);
        gs_Color[groupIndex + 2 * MAX_RADIUS] = _InputTexture.Load(uint3(extraSample, 0)).rgb;
    }
    GroupMemoryBarrierWithGroupSync();

    float3 color = 0;
    for (uint i = 0; i < 17; i++)
    {
        float weight = gaussian17[i];
        float sampleOffset = ((float)i - 8) * _BlurRadius * 0.125;
        int floorInt = floor(sampleOffset);
        float lerpValue = sampleOffset - floorInt;
        float3 sampleColorFloor = gs_Color[groupIndex + MAX_RADIUS + floorInt];
        float3 sampleColorCeil = gs_Color[groupIndex + MAX_RADIUS + floorInt + 1];
        float3 sampleColor = lerp(sampleColorFloor, sampleColorCeil, lerpValue);
        color += sampleColor* weight;
    }

    _OutputTexture[dispatchThreadID.xy] = float4(color, 1);
}

一些限制和思考

Group Shared Memory确实在Compute Shader的优化中有很关键的地位，但是一旦使用了Group shared Memory，会一定程度上降低代码的可阅读性，尤其是看到GroupMemoryBarrierWithGroupSync()的时候，会有莫名的恐慌。但是确实也拓宽了代码设计的思路，有一种透过小洞看到新的一片天地的感觉。在高斯模糊的时候，还会有双线性采样的问题，特别是使用2D的kernel的时候，需要自己来做双线性的采样，不太好debug。

2023年4月5日更新#

为什么要用Compute Shader来做高斯模糊#

Compute Shader进行高斯模糊的具体操作#

GaussianBlur.cs#

GaussianBlur.compute#

一些限制和思考#