支持Animator Controller的实时GPU蒙皮

为什么要用GPU来进行蒙皮

对于一个SkinnedMeshRenderer，在做蒙皮的时候，对于每一个顶点，会先计算出这个顶点对应的四根骨骼的从骨骼空间到物体空间的矩阵\(M_{bone\localtoobject}\)，然后使用\(M{bone\localtoobject} * M{bone\bindpose} * Vertex{objectspace}\)得到经过骨骼平移旋转缩放后的四个带权重的顶点数据位置和切线，对于法线则是使用上面矩阵的逆矩阵的转置。然后对获得的位置、法线和切线，用权重计算得到经过骨骼平移旋转缩放后的实际的顶点信息。在通常的渲染过程中，上述操作是在CPU中进行的，最后把顶点数据传递给GPU中进行渲染。在顶点数较多且主要是矩阵运算的情况下，CPU进行蒙皮的效率就不如高并行的GPU了，因此会考虑到在GPU中进行蒙皮处理。

GPU蒙皮的一些想法

从上面可以看到，要从CPU中传给GPU的数据有以下几种：一是\(M_{bone\localtoobject} * M{bone\_bindpose}\)这样骨骼数个float4x4的矩阵，但是由于其最后一行是(0, 0, 0, 1)，在传递时可以简化成骨骼数个float3x4矩阵，这些矩阵每一帧都要传递一次；二是每个顶点对应的骨骼编号和骨骼的权重，骨骼编号用来查询骨骼矩阵中对应的矩阵，是一个整型的数据，骨骼权重是一个[0, 1]的小数，可以用\(BoneIndex + BoneWeight * 0.5\)的方式，把编号和权重结合成一个float的数据，这样每个顶点的骨骼编号和权重数据是一个float4的数据，可以保存在UV中，也可以用数组的方式传递给GPU，这些顶点数个float4的数据，只需要传递一次就可以了；再有就是模型本身的顶点位置、法线和切线，这些引擎会自动为我们传递给GPU。

在实际操作中，网上通常找到的方案是把动画保存在一张贴图或者是一个自定义的数据结构中，这里可以直接保存顶点数据，甚至不需要在GPU中做蒙皮的操作，但是随着顶点数增加会占用大量的空间；或者是保存骨骼的变换矩阵，在GPU中进行蒙皮，相对来说储存空间会小很多。然而我认为这两种都不是很好的做GPU skinning的方法，将动画信息保存到贴图或者数据结构中，会很大程度上失去Animator Controller的功能，如两个动作之间的插值、触发事件等，对于动画来说甚至是得不偿失的一种效果。因此，我希望能够保留Animator Controller的特性，实时的把骨骼数据传送给GPU，在GPU中进行蒙皮操作。

GPU蒙皮的操作

我的想法是，先离线从SkinnedMeshRenderer中获得骨骼的ID和权重，然后实时的从Animator Controller对应的骨骼中获取每根骨骼的骨骼矩阵，再统一传给一个普通的MeshRenderer，在GPU中进行蒙皮的操作。这中间有一个小坑，Unity同一个模型的SkinnedMeshRenderer和MeshRenderer，他们虽然都能获取到boneweight和bindpose，但是SkinnedMeshRenderer和MeshRnederer的骨骼的顺序有时候会有一些差异，因此最好的办法是，抛弃这两者的骨骼顺序，用Hierarchy中的骨骼顺序来确定我们传给GPU的boneindex，boneweight和bonematrix是一致的。

这里使用的模型及动作是mixamo的hip hop dancing资源。

BoneMatchInfo.cs

这个脚本的作用是，在离线时把一个GameObjectRoot下的所有SkinnedMeshRenderer和Hierarchy中的骨骼的信息结合起来，保存成一个Asset，用于实时的GPU Skinning。这个Asset包含两部分的信息，一个是BoneMatchNode用于记录Hierarchy骨骼列表中骨骼的名称和其bindpose，另一个是BindIndex用于记录所有SkinnedMeshRenderer的骨骼在Hierarchy骨骼列表中的顺序。

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEditor;
using System;
using System.IO;

namespace GPUSkinning
{
    [System.Serializable]
    public class BoneMatchNode
    {
        //[HideInInspector]
        public string boneName;
        //在查找位于所有Transfom的位置时，设置并使用boneIndex
        public int boneIndex = 0;
        public Matrix4x4 bindPose;

        public BoneMatchNode(string _boneName)
        {
            boneName = _boneName;
            bindPose = Matrix4x4.identity;
        }
    }

    [System.Serializable]
    public class BindList
    {
        //[HideInInspector]
        public string skinnedMeshName;
        public int[] bindIndexs;

        public BindList(string _skinnedMeshName)
        {
            skinnedMeshName = _skinnedMeshName;
        }
    }


    public class BoneMatchInfo : ScriptableObject
    {   
        public BoneMatchNode[] boneMatchNodes;
        public BindList[] bindLists;
    }


    public class GenerateBoneMatchInfo : EditorWindow
    {
        public Transform rootBone;
        public Transform skinnedParent;
        public int skinnedMeshCount = 0;
        public SkinnedMeshRenderer[] smrArray;
        public BoneMatchInfo boneMatchInfo;

        private Rect topToolBarRect
        {
            get { return new Rect(20, 10, position.width - 40, position.height - 20); }
        }

        [MenuItem("zznewclear13/Generate Bone Match Info")]
        public static GenerateBoneMatchInfo GetWindow()
        {
            GenerateBoneMatchInfo window = GetWindow<GenerateBoneMatchInfo>();
            window.titleContent = new GUIContent("Generate Bone Match Info");
            window.Focus();
            window.Repaint();
            return window;
        }

        void OnGUI()
        {
            TopToolBar(topToolBarRect);
        }

        private void TopToolBar(Rect rect)
        {
            GUILayout.BeginArea(rect);
            rootBone = (Transform)EditorGUILayout.ObjectField("Root Bone Transform", rootBone, typeof(Transform), true);
            skinnedParent = (Transform)EditorGUILayout.ObjectField("Skinned Parent Transform", skinnedParent, typeof(Transform), true);

            if(skinnedParent!=null)
            {
                smrArray = skinnedParent.GetComponentsInChildren<SkinnedMeshRenderer>();
                if(smrArray != null)
                {
                    using (new EditorGUI.DisabledGroupScope(true))
                    {
                        EditorGUILayout.ObjectField("Skinned Mesh Renderers", smrArray[0], typeof(SkinnedMeshRenderer), false);
                        for (int i = 1; i < smrArray.Length; i++)
                        {
                            EditorGUILayout.ObjectField(" ", smrArray[i], typeof(SkinnedMeshRenderer), false);
                        }
                    }

                    using (new EditorGUI.DisabledGroupScope(smrArray.Length <= 0))
                    {
                        if (GUILayout.Button("Generate Animator Map", new GUILayoutOption[] { GUILayout.Height(30f) }))
                        {                           
                            boneMatchInfo = CompareBones();
                            //LogBindPoses();
                            Save();
                        }
                    }
                }
            }

            GUILayout.EndArea();
        }

        private BoneMatchInfo CompareBones()
        {
            BoneMatchInfo tempInfo = new BoneMatchInfo();

            Transform[] boneTrans = rootBone.GetComponentsInChildren<Transform>();
            List<BoneMatchNode> boneMatchNodeList = new List<BoneMatchNode>();
            List<BindList> bindLists = new List<BindList>();
            List<int[]> tempIntLists = new List<int[]>();

            List<Transform[]> smrBoneTransList = new List<Transform[]>();
            List<Matrix4x4[]> smrBindPoseList = new List<Matrix4x4[]>();
            foreach (SkinnedMeshRenderer smr in smrArray)
            {
                Transform[] smrBoneTrans = smr.bones;
                Matrix4x4[] smrBindPos = smr.sharedMesh.bindposes;
                smrBoneTransList.Add(smrBoneTrans);
                smrBindPoseList.Add(smrBindPos);
                tempIntLists.Add(new int[smr.bones.Length]);
            }

            int boneTranIndex = 0;
            foreach (Transform boneTran in boneTrans)
            {
                BoneMatchNode bmn = new BoneMatchNode(boneTran.name);
                bool isInSMRBones = false;
                for (int i = 0; i < smrBoneTransList.Count; i++)
                {
                    int index = Array.IndexOf(smrBoneTransList[i], boneTran);
                    if (index >= 0)
                    {
                        isInSMRBones = true;
                        bmn.bindPose = smrBindPoseList[i][index];
                        tempIntLists[i][index] = boneTranIndex;
                    }
                }

                if (isInSMRBones)
                {
                    bmn.boneIndex = boneTranIndex;
                    boneMatchNodeList.Add(bmn);
                    boneTranIndex++;
                }
            }

            for (int i = 0; i < smrArray.Length; i++)
            {
                bindLists.Add(new BindList(smrArray[i].name));
                bindLists[i].bindIndexs = tempIntLists[i];
            }

            tempInfo.boneMatchNodes = boneMatchNodeList.ToArray();
            tempInfo.bindLists = bindLists.ToArray();
            return tempInfo;
        }

        private void LogBindPoses()
        {
            using(StreamWriter sw = new StreamWriter("Assets/GPUSkinning/BindPoses.txt"))
            {
                foreach (SkinnedMeshRenderer smr in smrArray)
                {
                    Transform[] smrBoneTrans = smr.bones;
                    Matrix4x4[] smrBindPos = smr.sharedMesh.bindposes;
                    for (int j = 0; j < smrBoneTrans.Length; j++)
                    {
                        sw.WriteLine(smr.name + "\t" + smrBoneTrans[j].name + "\r\n" + smrBindPos[j].ToString());
                    }
                }
            }
        }

        private void Save()
        {
            AssetDatabase.CreateAsset(boneMatchInfo, "Assets/GPUSkinning/BoneMatchInfo.asset");
            AssetDatabase.Refresh();
            Debug.Log("<color=blue>Bone Match Info has been saved to Assets/GPUSkinning/BoneMatchInfo.asset.</color>");
        }
    }
}

最后得到的BoneMatchInfo.asset和Hierarchy的关系如图所示，部分不参与实际蒙皮的骨骼，就不需要记录到BoneMatchInfo.asset中了：

BoneMatchInfo

BoneGPUSkinning.cs

在BoneGPUSkinning.cs这个脚本中，要做的事情是：把骨骼的编号和权重写到UV中，只用执行一次；把骨骼矩阵和bindpose的乘积传到GPU中，每帧执行一次，我把这个操作放在了compute shader中进行计算。根据前面的描述，我们需要获取\(M_{bone\localtoobject}\)，这个值等价于\(M{object\worldtolocal} * M{bone\localtoworld}\)。但是在实际的操作中，获取一根骨骼的\(M{bone\_localtoworld}\)矩阵会导致额外的运算，使得GPU Skinning的效率受到了很大的限制。这里有可能是我不够熟悉Unity的API的原因，当然也有可能是Unity本身就没开放相关的API的原因。照理来说，Unity要把Animator的平移旋转缩放动画应用到每一个骨骼上时，已经计算过了每根骨骼的localToWorldMatrix，获取这个矩阵应该能做到没有任何消耗的。但是没有办法，我只能试图使用Unity Jobs和Unity Burst来加速获取localToWorldMatrix的过程，在我的测试中，相比于直接用for循环获取大概能有至少50%的速度提升（记不太清了），然而对于整个GPU Skinning的过程来说，消耗还是太高了。

因为是比较久之前写的代码了，也懒得去再仔细地修正，[ExecuteInEditMode]在设置好各个引用之前会疯狂的报错，不过在设置好正确的引用之后重新启用脚本就不会有任何的问题了。似乎操作不当也会出现内存泄漏的问题，不过无伤大雅。

using System.Collections.Generic;
using UnityEngine;
using System;
using Unity.Collections;
using Unity.Jobs;
using UnityEngine.Jobs;
using Unity.Burst;


namespace GPUSkinning
{

    [ExecuteInEditMode]
    public class BoneGPUSkinning : MonoBehaviour
    {
        public const int BONE_WEIGHT_DECODE_VALUE = 2;
        public const float BONE_WEIGHT_INVERSE_DECODE_VALUE = 0.5f;

        public ComputeShader computeShader;
        public Transform rootBone;
        public BoneMatchInfo boneMatchInfo;
        public List<MeshRenderer> meshRenderers;

        [SerializeField]
        private int boneSize;
        private Transform[] minBoneTrans;

        private TransformAccessArray transformAccessArray;

        #region ComputeShader
        int kernel;

        private Matrix4x4[] bindPosesArray;
        private Matrix4x4[] LTWMatrixArray;

        readonly int bindPoseID = Shader.PropertyToID("_BoneBindPoseBuffer");
        readonly int LTWMatrixID = Shader.PropertyToID("_BoneLTWMatrixBuffer");
        readonly int outputBufferID = Shader.PropertyToID("_BoneOutputBuffer");
        private ComputeBuffer outputBuffer;
        private ComputeBuffer bindPoseBuffer;
        private ComputeBuffer ltwMatrixBuffer;
        #endregion

        #region InitializeFunction

        void OnEnable()
        {
            Initialize();
        }

        /// <summary>
        /// 全部的初始化
        /// </summary>
        public void Initialize()
        {

            InitializeBoneTrans();

            InitializeBoneUV(1);

            InitializeComputeShader();

        }

        /// <summary>
        /// 整合所有mesh绑定的骨骼，从rootBone的子物件中找出minBoneTrans
        /// </summary>
        private void InitializeBoneTrans()
        {
            //标记每个mesh对应的骨骼
            Transform[] allTrans = rootBone.GetComponentsInChildren<Transform>();
            Dictionary<string, Transform> allTransDict = new Dictionary<string, Transform>();
            foreach (Transform tran in allTrans)
            {
                allTransDict.Add(tran.name, tran);
            }
            boneSize = boneMatchInfo.boneMatchNodes.Length;
            minBoneTrans = new Transform[boneSize];

            for (int i = 0; i < boneSize; i++)
            {
                Transform tempTran = allTransDict[boneMatchInfo.boneMatchNodes[i].boneName];
                boneMatchInfo.boneMatchNodes[i].boneIndex = i;
                minBoneTrans[i] = tempTran;
            }

            transformAccessArray = new TransformAccessArray(minBoneTrans);
        }

        /// <summary>
        /// 把骨骼的编号和权重写入到targetUVIndex对应的UV中
        /// </summary>
        /// <param name="targetUVIndex"></param>
        public void InitializeBoneUV(int targetUVIndex)
        {
            Dictionary<string, BindList> bindListDict = new Dictionary<string, BindList>();
            for (int i = 0; i < boneMatchInfo.bindLists.Length; i++)
            {
                bindListDict.Add(boneMatchInfo.bindLists[i].skinnedMeshName, boneMatchInfo.bindLists[i]);
            }
            for (int i = 0; i < meshRenderers.Count; i++)
            {
                BindList tempBindList;
                bool hasBindList = bindListDict.TryGetValue(meshRenderers[i].name, out tempBindList);
                if(!hasBindList)
                {
                    throw new ArgumentException(String.Format("SkinnedMeshName:{0}在BoneMatchInfo中找不到！",
                                                boneMatchInfo.bindLists[i].skinnedMeshName));
                }

                Mesh mesh = meshRenderers[i].GetComponent<MeshFilter>().sharedMesh;
                BoneWeight[] boneWeights = mesh.boneWeights;

                List<Vector4> boneAndWeights = new List<Vector4>();
                int[] bindIndexes = tempBindList.bindIndexs;
                foreach (BoneWeight weight in boneWeights)
                {
                    Vector4 boneAndWeight = new Vector4(0, 0, 0, 0);

                    //Shader中个BoneUV都会查找全局的骨骼编号
                    boneAndWeight.x = bindIndexes[weight.boneIndex0] + weight.weight0 * BONE_WEIGHT_INVERSE_DECODE_VALUE;
                    boneAndWeight.y = bindIndexes[weight.boneIndex1] + weight.weight1 * BONE_WEIGHT_INVERSE_DECODE_VALUE;
                    boneAndWeight.z = bindIndexes[weight.boneIndex2] + weight.weight2 * BONE_WEIGHT_INVERSE_DECODE_VALUE;
                    boneAndWeight.w = bindIndexes[weight.boneIndex3] + weight.weight3 * BONE_WEIGHT_INVERSE_DECODE_VALUE;
                    boneAndWeights.Add(boneAndWeight);
                }
                mesh.SetUVs(targetUVIndex, boneAndWeights.ToArray());
            }
        }


        private void EnsureComputeBuffer(ref ComputeBuffer buffer, int count, int stride)
        {
            if (buffer != null)
            {
                buffer.Release();
            }

            buffer = new ComputeBuffer(count, stride);
        }

        /// <summary>
        /// 初始化ComputeShader，用于计算每根骨骼的矩阵和bindpos的乘积
        /// </summary>
        private void InitializeComputeShader()
        {
            bindPosesArray = new Matrix4x4[boneSize];
            LTWMatrixArray = new Matrix4x4[boneSize];
            Debug.Log(LTWMatrixArray.Length);
            for (int i = 0; i < boneSize; i++)
            {
                bindPosesArray[i] = boneMatchInfo.boneMatchNodes[i].bindPose;
            }

            kernel = computeShader.FindKernel("MatCompute");

            EnsureComputeBuffer(ref outputBuffer, boneSize, sizeof(float) * 16);
            EnsureComputeBuffer(ref bindPoseBuffer, boneSize, sizeof(float) * 16);
            EnsureComputeBuffer(ref ltwMatrixBuffer, boneSize, sizeof(float) * 16);

            bindPoseBuffer.SetData(bindPosesArray);
            computeShader.SetBuffer(kernel, bindPoseID, bindPoseBuffer);

            computeShader.SetBuffer(kernel, outputBufferID, outputBuffer);
        }

        #endregion

        #region Update Function
        void Update()
        {
            InvokeComputeShader();
            PassMeshRendererMatrix();
        }

        [BurstCompile(CompileSynchronously = true)]
        struct GetLocalToWorldMatrixStructJob : IJobParallelForTransform
        {
            public NativeArray<Matrix4x4> matArray;

            public void Execute(int i, TransformAccess transform)
            {
                matArray[i] = transform.localToWorldMatrix;
            }
        }


        private void InvokeComputeShader()
        {
            if(computeShader)
            {
#if true
                //NativeList<JobHandle> jobHandleList = new NativeList<JobHandle>(Allocator.Temp);
                NativeArray<Matrix4x4> matArray = new NativeArray<Matrix4x4>(boneSize, Allocator.Persistent);

                GetLocalToWorldMatrixStructJob job = new GetLocalToWorldMatrixStructJob
                {
                    matArray = matArray
                };
                JobHandle jobHandle = IJobParallelForTransformExtensions.Schedule(job, transformAccessArray);// job.Schedule(transformAccessArray);
                jobHandle.Complete();

                ltwMatrixBuffer.SetData(matArray);
                matArray.Dispose();
#else
                for (int i = 0; i < boneSize; i++)
                {
                    LTWMatrixArray[i] = minBoneTrans[i].localToWorldMatrix;
                }
                ltwMatrixBuffer.SetData(LTWMatrixArray);
#endif

                computeShader.SetBuffer(kernel, LTWMatrixID, ltwMatrixBuffer);
                int dispatchCount = Mathf.CeilToInt(boneSize / 64f);
                computeShader.Dispatch(kernel, dispatchCount, 1, 1);
                Shader.SetGlobalBuffer("_BoneMatArray", outputBuffer);
            }
        }

        private void PassMeshRendererMatrix()
        {
            for (int i = 0; i < meshRenderers.Count; i++)
            {
                foreach (Material mat in meshRenderers[i].sharedMaterials)
                {
                    mat.SetMatrix("_BoneTransformMatrix", meshRenderers[i].transform.worldToLocalMatrix);
                }
            }

        }

#endregion

        private void OnDisable()
        {
            outputBuffer.Release();
            outputBuffer = null;
            transformAccessArray.Dispose();
        }
    }

}

BoneComputeShader.compute

这个compute shader仅仅做了矩阵的运算，甚至都不见得比在CPU中运算要快，不过这边还是使用了compute shader来做这个运算，稍微还能优化的是float4x4可以改成float3x4，不过这样CPU的代码写起来稍乱一些。

#pragma kernel MatCompute

StructuredBuffer<float4x4> _BoneBindPoseBuffer;
StructuredBuffer<float4x4> _BoneLTWMatrixBuffer;

RWStructuredBuffer<float4x4> _BoneOutputBuffer;

[numthreads(64, 1, 1)]
void MatCompute(uint3 id : SV_DispatchThreadID)
{
    _BoneOutputBuffer[id.x] = mul(_BoneLTWMatrixBuffer[id.x], _BoneBindPoseBuffer[id.x]);
}

BoneGPUSkinning.hlsl

操作流程是这样的，对每一个顶点定义一个结构体VertexInputStructure，读取MeshRenderer的原始数据中的位置、法线、切线和我们传入的骨骼编号和权重。使用编号去寻找_BoneMatArray中对应的\(M_{bone\localtoworld}\)，再左乘_BoneTransformMatrix也就是之前说过\(M{object\_worldtolocal}\)，使用这两个矩阵的乘积就能分别计算蒙皮后的顶点位置、切线和法线了，要注意的是法线需要做一次逆矩阵的转置。最后对四对顶点位置、切线和法线进行加权计算，获得最终的顶点位置、切线和法线。

#ifndef BONE_GPU_SKINNING
#define BONE_GPU_SKINNING

#define BONE_WEIGHT_DECODE_VALUE 2

StructuredBuffer<float4x4> _BoneMatArray;
float4x4 _BoneTransformMatrix;

struct VertexInputStructure
{
    float3 positionOS;
    float3 normalOS;
    float4 tangentOS;
    float4 boneUV;
};

struct VertexOutputStructure
{
    float3 positionOS;
    float3 normalOS;
    float4 tangentOS;
};

inline float3x3 InverseTranspose(float3x3 mat)
{

    float determinant = mat._m00 * (mat._m11 * mat._m22 - mat._m12 * mat._m21)
                        - mat._m01 * (mat._m10 * mat._m22 - mat._m12 * mat._m20)
                        + mat._m02 * (mat._m10 * mat._m21 - mat._m11 * mat._m20);
    float3 vec0 = float3(mat._m11 * mat._m22 - mat._m12 * mat._m21,
                        mat._m12 * mat._m20 - mat._m10 * mat._m22,
                        mat._m10 * mat._m21 - mat._m11 * mat._m20);
    float3 vec1 = float3(mat._m02 * mat._m21 - mat._m01 * mat._m22,
                        mat._m00 * mat._m22 - mat._m02 * mat._m20,
                        mat._m01 * mat._m20 - mat._m00 * mat._m21);
    float3 vec2 = float3(mat._m01 * mat._m12 - mat._m02 * mat._m11,
                        mat._m02 * mat._m10 - mat._m00 * mat._m12,
                        mat._m00 * mat._m11 - mat._m01 * mat._m10);
    float3x3 returnMat;
    returnMat._m00_m01_m02 = vec0;
    returnMat._m10_m11_m12 = vec1;
    returnMat._m20_m21_m22 = vec2;
    return returnMat / determinant;
}

inline float3x3 InverseTransposeVec(float3 vec0, float3 vec1, float3 vec2)
{
    float3x3 mat;
    mat._m00_m01_m02 = vec0;
    mat._m10_m11_m12 = vec1;
    mat._m20_m21_m22 = vec2;
    return InverseTranspose(mat);
}

inline float4x4 ReadBoneInfos(uint boneIndex)
{
    return mul(_BoneTransformMatrix, _BoneMatArray[boneIndex]);
}

inline VertexOutputStructure BlendBonesPosNormalTangent(VertexInputStructure input)
{
    float4 positionOS = float4(input.positionOS, 1);
    float3 normalOS = input.normalOS;
    float4 tangentOS = input.tangentOS;
    float4 boneUV = input.boneUV;

    uint boneIndexOne = floor(boneUV.x);
    float boneWeightOne = BONE_WEIGHT_DECODE_VALUE * frac(boneUV.x);

    uint boneIndexTwo = floor(boneUV.y);
    float boneWeightTwo = BONE_WEIGHT_DECODE_VALUE * frac(boneUV.y);

    uint boneIndexThree = floor(boneUV.z);
    float boneWeightThree = BONE_WEIGHT_DECODE_VALUE * frac(boneUV.z);

    uint boneIndexFour = floor(boneUV.w);
    float boneWeightFour = BONE_WEIGHT_DECODE_VALUE * frac(boneUV.w);

    float4x4 matOne = ReadBoneInfos(boneIndexOne);
    float4x4 matTwo = ReadBoneInfos(boneIndexTwo);
    float4x4 matThree = ReadBoneInfos(boneIndexThree);
    float4x4 matFour = ReadBoneInfos(boneIndexFour);

    //blend position
    float3 posOne = mul(matOne, positionOS).xyz;
    float3 posTwo = mul(matTwo, positionOS).xyz;
    float3 posThree = mul(matThree, positionOS).xyz;
    float3 posFour = mul(matFour, positionOS).xyz;

    float3 returnPos = posOne * boneWeightOne
        + posTwo * boneWeightTwo
        + posThree * boneWeightThree
        + posFour * boneWeightFour;

    //blend normal
    float3x3 newMatOne = InverseTransposeVec(matOne._m00_m01_m02, matOne._m10_m11_m12, matOne._m20_m21_m22);
    float3x3 newMatTwo = InverseTransposeVec(matTwo._m00_m01_m02, matTwo._m10_m11_m12, matTwo._m20_m21_m22);
    float3x3 newMatThree = InverseTransposeVec(matThree._m00_m01_m02, matThree._m10_m11_m12, matThree._m20_m21_m22);
    float3x3 newMatFour = InverseTransposeVec(matFour._m00_m01_m02, matFour._m10_m11_m12, matFour._m20_m21_m22);

    float3 normalOne = mul(newMatOne, normalOS).xyz;
    float3 normalTwo = mul(newMatTwo, normalOS).xyz;
    float3 normalThree = mul(newMatThree, normalOS).xyz;
    float3 normalFour = mul(newMatFour, normalOS).xyz;

    float3 returnNormal = normalOne * boneWeightOne
        + normalTwo * boneWeightTwo
        + normalThree * boneWeightThree
        + normalFour * boneWeightFour;

    returnNormal = normalize(returnNormal);

    //blend tangent
    float3 tangentOne = mul((float3x3)matOne, tangentOS.xyz);
    float3 tangentTwo = mul((float3x3)matTwo, tangentOS.xyz);
    float3 tangentThree = mul((float3x3)matThree, tangentOS.xyz);
    float3 tangentFour = mul((float3x3)matFour, tangentOS.xyz);

    float3 tempTangent = tangentOne * boneWeightOne
                        + tangentTwo * boneWeightTwo
                        + tangentThree * boneWeightThree
                        + tangentFour * boneWeightFour;
    tempTangent = normalize(tempTangent);
    float4 returnTangent = float4(tempTangent, tangentOS.w);

    VertexOutputStructure output;
    output.positionOS = returnPos;
    output.normalOS = returnNormal;
    output.tangentOS = returnTangent;
    return output;
}
#endif

BoneGPUSkinningShader.shader

用于渲染的shader中，要在顶点着色器中调用BoneGPUSkinning.hlsl中的方法获取蒙皮后的顶点位置、切线和法线。我这里使用了一个比较简单的渲染，给模型一点基础的光影。要注意的是，如果需要正确的阴影的话，在ShadowCaster这个pass中还需要计算一遍顶点位置，这里就暂且忽略了。

Shader "zznewclear13/BoneGPUSkinningShader"
{
    Properties
    {
        _Color ("Color", Color) = (1,1,1,1)
        _MainTex ("Albedo (RGB)", 2D) = "white" {}
    }
    SubShader
    {
        Tags { "RenderType" = "Opaque" "RenderPipeline" = "UniversalPipeline" "IgnoreProjector" = "True" }

        HLSLINCLUDE

#include "Packages/com.unity.render-pipelines.universal/ShaderLibrary/Core.hlsl"
#include "Assets/GPUSkinning/BoneGPUSkinning.hlsl"

        sampler2D _MainTex;
        CBUFFER_START(UnityPerMaterial)
        float4 _Color;
        CBUFFER_END

        struct a2v
        {
            float4 vertex   : POSITION;
            float2 uv       : TEXCOORD0;
            float4 boneUV   : TEXCOORD1;
            float3 normal   : NORMAL;
            float4 tangent  : TANGENT;
        };

        struct v2f
        {
            float4 pos          : SV_POSITION;
            float2 uv           : TEXCOORD0;
            float4 tempColor    : TEXCOORD1;
            float3 normalWS     : TEXCOORD2;
            float4 tangentWS    : TEXCOORD3;
            float3 eyeVec       : TEXCOORD4;
        };

        v2f animVert(a2v v)
        {
            v2f o;
            VertexInputStructure inputStructure;
            inputStructure.positionOS = v.vertex;
            inputStructure.normalOS = v.normal;
            inputStructure.tangentOS = v.tangent;
            inputStructure.boneUV = v.boneUV;

            VertexOutputStructure outputStructure = BlendBonesPosNormalTangent(inputStructure);
            float3 vertexPos = outputStructure.positionOS;
            float3 vertexNormal = outputStructure.normalOS;
            float3 vertexTangent = outputStructure.tangentOS;

            o.pos = TransformObjectToHClip(vertexPos);
            o.normalWS = TransformObjectToWorldNormal(vertexNormal);
            o.tangentWS = float4(TransformObjectToWorldDir(vertexTangent), v.tangent.w);

            o.uv = v.uv;
            o.tempColor = float4(1, 1, 1, 1);
            float3 worldPos = TransformObjectToWorld(float4(vertexPos, 1));
            o.eyeVec = GetCameraPositionWS() - worldPos;

            return o;
        }

        float4 animFrag(v2f i) : SV_TARGET
        {
            float3 viewDir = normalize(i.eyeVec);
            float3 lightDir = normalize(float3(1, 1, 1));
            float3 halfVec = normalize(lightDir + viewDir);

            float3 normalWS = normalize(i.normalWS);
            float NdotL = dot(normalWS, lightDir) * 0.6 + 0.4;
            float NdotH = saturate(dot(normalWS, halfVec));

            float3 diffuseColor = _Color.xyz * NdotL;
            float3 specularColor = pow(NdotH, 30);
            return float4(diffuseColor + specularColor, 1);
        }

        ENDHLSL
        pass
        {
            Tags{ "LightMode" = "UniversalForward" }

            HLSLPROGRAM
            #pragma vertex animVert
            #pragma fragment animFrag
            ENDHLSL
        }
    }
}

最后的思考

总的来说，上面的操作已经基本完成了GPU Skinning的需求，而且能够正确的与Animator组件相结合。当然仍有优化的空间，比如说把所有要做GPU Skinning的mesh使用同一个Update的方法来更新，等等。但是美中不足的是之前提到的获取localToWorldMatrix的问题，直接导致了使用这个方法不如Unity自带的GPU Skinning效率高，自带的GPU蒙皮应该是用到了Unity底层的一些优化吧，但是核心的操作应该和我这里做的差不多。

为什么要用GPU来进行蒙皮#

GPU蒙皮的一些想法#

GPU蒙皮的操作#

BoneMatchInfo.cs#

BoneGPUSkinning.cs#

BoneComputeShader.compute#

BoneGPUSkinning.hlsl#

BoneGPUSkinningShader.shader#

最后的思考#