2023年4月5日更新

再议高斯模糊,更实用的高斯模糊。

为什么要用Compute Shader来做高斯模糊

在之前的博客中我说Compute Shader的优势就是快,因为GPU的并行运算比CPU强大很多。但是相比于同样运行在GPU上的Fragment Shader之类,Compute Shader是不是就毫无优势了呢?答案是否定的,在DX11的文档中我们可以看到Thread Group Shared Memory这么一个概念,Compute Shader的Thread Group中的每一个Thread,都可以极快速的访问到对应的Group Shared Memory中的数据,这个效率比采样一张贴图来的高。因此在进行高斯模糊这样的需要大量贴图采样的计算时,先将贴图数据缓存到Group Shared Memory中,再多次访问Group Shared Memory,这样运行的效率会高得多。

相关的一些参考可以在英伟达的PPT里找到。

Compute Shader进行高斯模糊的具体操作

  1. 这里暂时不使用半分辨率的优化方法,目的是把SrcIden对应的RenderTexture经过高斯模糊储存到DestIden中,这时我们需要一张临时的相同大小的RenderTextureBlurIden
  2. 确定需要的最大的高斯模糊的像素宽度MAX_RADIUS,越大的高斯模糊宽度,需要越大的Group Shared Memory,而Group Shared Memory的大小是有上限的(cs_5.0是32768 bytes)。当然也没有必要在全分辨率的情况下做特别大的高斯模糊就是了。这里我设置最大的高斯模糊像素数是32,即高斯模糊当前像素点和最远采样点之间的距离不能超过32(双线性采样的话还要缩小一个像素)。
  3. 高斯模糊往往使用水平和竖直两个高斯核心进行模糊,对应的需要两个Compute Shader Kernel(也可以写成一个,不过思考起来有点绕),我们这里设置两个Kernel,对应水平和竖直两个pass。水平Kernel使用[numthreads(64, 1, 1)],最高可以是1024,竖直Kernel则使用[numthreads(1, 64, 1)]
  4. 由于高斯模糊中会采样像素点的左右(上下)两侧的像素,Group Shared Memory需要在GroupThreads的基础上向两侧扩大最大高斯模糊像素数MAX_RADIUS,用于保存额外的像素数据。这时我们需要的Group Shared Memory的大小是numthreads + 2 * MAX_RADIUS个,在本文章中是64 + 2 * 32个float3的数据。

GaussianBlur.cs

这里略去Unity SRP的设置pass的操作,仅展示高斯模糊相关的操作。当blurRadius过大时,就能看到明显的多重采样的痕迹了。

private void DoGaussianBlurHorizontal(CommandBuffer cmd, RenderTargetIdentifier srcid, RenderTargetIdentifier dstid, ComputeShader computeShader, float blurRadius)
{
    int gaussianBlurKernel = computeShader.FindKernel("GaussianBlurHorizontalMain");

    computeShader.GetKernelThreadGroupSizes(gaussianBlurKernel, out uint x, out uint y, out uint z);
    cmd.SetComputeTextureParam(computeShader, gaussianBlurKernel, "_InputTexture", srcid);
    cmd.SetComputeTextureParam(computeShader, gaussianBlurKernel, "_OutputTexture", dstid);
    cmd.SetComputeFloatParam(computeShader, "_BlurRadius", blurRadius);
    cmd.SetComputeVectorParam(computeShader, "_TextureSize", new Vector4(halfRes.x, halfRes.y, 1f / halfRes.x, 1f / halfRes.y));
    cmd.DispatchCompute(computeShader, gaussianBlurKernel,
        Mathf.CeilToInt((float)halfRes.x / x),
        Mathf.CeilToInt((float)halfRes.y / y),
        1);
}

private void DoGaussianBlurVertical(CommandBuffer cmd, RenderTargetIdentifier srcid, RenderTargetIdentifier dstid, ComputeShader computeShader, float blurRadius)
{
    int gaussianBlurKernel = computeShader.FindKernel("GaussianBlurVerticalMain");

    computeShader.GetKernelThreadGroupSizes(gaussianBlurKernel, out uint x, out uint y, out uint z);
    cmd.SetComputeTextureParam(computeShader, gaussianBlurKernel, "_InputTexture", srcid);
    cmd.SetComputeTextureParam(computeShader, gaussianBlurKernel, "_OutputTexture", dstid);
    cmd.SetComputeFloatParam(computeShader, "_BlurRadius", blurRadius);
    cmd.SetComputeVectorParam(computeShader, "_TextureSize", new Vector4(halfRes.x, halfRes.y, 1f / halfRes.x, 1f / halfRes.y));
    cmd.DispatchCompute(computeShader, gaussianBlurKernel,
        Mathf.CeilToInt((float)halfRes.x / x),
        Mathf.CeilToInt((float)halfRes.y / y),
        1);
}

public override void Execute(ScriptableRenderContext context, ref RenderingData renderingData)
{
    CommandBuffer cmd = CommandBufferPool.Get(profilerTag); 
    context.ExecuteCommandBuffer(cmd);
    cmd.Clear();

    using (new ProfilingScope(cmd, gaussianBlurSampler))
    {
        DoGaussianBlurHorizontal(cmd, cameraColorIden, blurIden, depthOfFieldComputeShader, blurRadius);
        DoGaussianBlurVertical(cmd, blurIden, destIden, depthOfFieldComputeShader, blurRadius);
        cmd.Blit(destIden, cameraColorIden);
    }
 
    context.ExecuteCommandBuffer(cmd);
    cmd.Clear();
    CommandBufferPool.Release(cmd);
}

GaussianBlur.compute

在Compute Shader的计算中,很需要注意clamp的操作,对于超出屏幕范围的像素,应该使用屏幕边缘的像素的颜色。同时还要注意由于存在如currentPosition - int2(MAX_RADIUS, 0)这样的减法操作,不能随意的使用uint类型的变量,否则出现意料之外的效果时很难查明。

#pragma kernel GaussianBlurHorizontalMain
#pragma kernel GaussianBlurVerticalMain

float _BlurRadius;
float4 _TextureSize;

Texture2D<float4> _InputTexture;
RWTexture2D<float4> _OutputTexture;

static float gaussian17[] =
{
    0.00002611081194810,
    0.00021522769030413,
    0.00133919168719865,
    0.00628987509902766,
    0.02229954363469697,
    0.05967667338326389,
    0.12055019394312867,
    0.18381709484250766,
    0.21157217927735517,
    0.18381709484250766,
    0.12055019394312867,
    0.05967667338326389,
    0.02229954363469697,
    0.00628987509902766,
    0.00133919168719865,
    0.00021522769030413,
    0.00002611081194810,
};

#define MAX_RADIUS 32
groupshared float3 gs_Color[64 + 2 * MAX_RADIUS];

[numthreads(64, 1, 1)]
void GaussianBlurHorizontalMain(uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID)
{
    int2 currentPosition = dispatchThreadID.xy;
    int2 tempPosition = clamp(currentPosition, 0, _TextureSize.xy - 1);
    gs_Color[groupIndex + MAX_RADIUS] = _InputTexture.Load(uint3(tempPosition, 0)).rgb;

    if (groupIndex < MAX_RADIUS)
    {
        int2 extraSample = currentPosition - int2(MAX_RADIUS, 0);
        extraSample = clamp(extraSample, 0, _TextureSize.xy - 1);
        gs_Color[groupIndex] = _InputTexture.Load(uint3(extraSample, 0)).rgb;
    }

    if(groupIndex >= 64 - MAX_RADIUS)
    {
        int2 extraSample = currentPosition + int2(MAX_RADIUS, 0);
        extraSample = clamp(extraSample, 0, _TextureSize.xy - 1);
        gs_Color[groupIndex + 2 * MAX_RADIUS] = _InputTexture.Load(uint3(extraSample, 0)).rgb;
    }
    GroupMemoryBarrierWithGroupSync();

    float3 color = 0;
    for (uint i = 0; i < 17; i++)
    {
        float weight = gaussian17[i];
        float sampleOffset = ((float)i - 8) * _BlurRadius * 0.125;
        int floorInt = floor(sampleOffset);
        float lerpValue = sampleOffset - floorInt;
        float3 sampleColorFloor = gs_Color[groupIndex + MAX_RADIUS + floorInt] ;
        float3 sampleColorCeil = gs_Color[groupIndex + MAX_RADIUS + floorInt + 1];
        float3 sampleColor = lerp(sampleColorFloor, sampleColorCeil, lerpValue);
        color += sampleColor * weight;
    }

    _OutputTexture[dispatchThreadID.xy] = float4(color, 1);
}

[numthreads(1, 64, 1)]
void GaussianBlurVerticalMain(uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID)
{
    int2 currentPosition = dispatchThreadID.xy;
    int2 tempPosition = clamp(currentPosition, 0, _TextureSize.xy - 1);
    gs_Color[groupIndex + MAX_RADIUS] = _InputTexture.Load(uint3(tempPosition, 0)).rgb;

    if (groupIndex < MAX_RADIUS)
    {
        int2 extraSample = currentPosition - int2(0, MAX_RADIUS);
        extraSample = clamp(extraSample, 0, _TextureSize.xy - 1);
        gs_Color[groupIndex] = _InputTexture.Load(uint3(extraSample, 0)).rgb;
    }

    if (groupIndex >= 64 - MAX_RADIUS)
    {
        int2 extraSample = currentPosition + int2(0, MAX_RADIUS);
        extraSample = clamp(extraSample, 0, _TextureSize.xy - 1);
        gs_Color[groupIndex + 2 * MAX_RADIUS] = _InputTexture.Load(uint3(extraSample, 0)).rgb;
    }
    GroupMemoryBarrierWithGroupSync();

    float3 color = 0;
    for (uint i = 0; i < 17; i++)
    {
        float weight = gaussian17[i];
        float sampleOffset = ((float)i - 8) * _BlurRadius * 0.125;
        int floorInt = floor(sampleOffset);
        float lerpValue = sampleOffset - floorInt;
        float3 sampleColorFloor = gs_Color[groupIndex + MAX_RADIUS + floorInt];
        float3 sampleColorCeil = gs_Color[groupIndex + MAX_RADIUS + floorInt + 1];
        float3 sampleColor = lerp(sampleColorFloor, sampleColorCeil, lerpValue);
        color += sampleColor* weight;
    }

    _OutputTexture[dispatchThreadID.xy] = float4(color, 1);
}

一些限制和思考

Group Shared Memory确实在Compute Shader的优化中有很关键的地位,但是一旦使用了Group shared Memory,会一定程度上降低代码的可阅读性,尤其是看到GroupMemoryBarrierWithGroupSync()的时候,会有莫名的恐慌。但是确实也拓宽了代码设计的思路,有一种透过小洞看到新的一片天地的感觉。在高斯模糊的时候,还会有双线性采样的问题,特别是使用2D的kernel的时候,需要自己来做双线性的采样,不太好debug。