+ Ray-Traced Shadows + +

Shadows are a critical element in video game rendering. Shadow maps are a clever solution but they often introduce problems like acne and aliasing. Fixing these issues requires a mix of simple and complex techniques. I wanted to try something different so I experimented with ray-traced shadows. Here is what I did.

Code available on GitHub or download ZIP.

+ Overview +

This is a basic prototype with four main stages: depth pre-pass, shadow ray tracing, blur pass, and final lighting using the shadow map.

  1. Early Depth Testing
    • Render all geometry into a depth buffer.
    • Alpha-tested surfaces are currently rendered (shadows will be incorrect for these).
  2. Shadow Ray Tracing
    • Sample depth buffer and reconstruct world-space position.
    • Cast a ray toward the sun to test for occlusion.
    • Selectable ray density modes:
      • Full resolution (1 ray/pixel)
      • FullX_HalfY (alternating lines)
      • Half or quarter resolution using a Bayer dither pattern
    • Accumulate results over 1, 2, 4, or 16 frames to build up the final shadow map.
    • The output writes a binary occlusion value into the shadow render target.
  3. Blur Pass
    • Apply a two-pass, depth-aware Gaussian blur to the shadow render target.
    • Used for antialiasing the shadow edges.
    • The kernel size adapts based on depth, so distant pixels receive little to no blur to preserve detail.
  4. Color Pass
    • Sample the blurred shadow map to get the per-pixel shadow factor (0 = fully lit, 1 = fully shadowed).

+ Temporal Accumulation +

The render target that stores the shadows matches the swapchain size, but ray tracing can be performed at a lower resolution to improve performance. To maintain visual quality, the results are accumulated over multiple frames using a Bayer pattern. For full-resolution tracing, there is no accumulation. For FullX_HalfY, results are accumulated over 2 frames; for Half, over 4 frames; and for Quarter, over 16 frames. This approach allows the system to gradually fill in missing shadow information, trading off temporal stability for performance at lower resolutions. The dithering and accumulation logic are handled in the shader using frame indices.

One major issue with temporal accumulation is that the lower the FPS, the worse the visual effect, introducing a ghosting effect on shadows. This is especially problematic when using VSync. For example, if you are running at 60 frames per second with quarter-resolution shadows accumulating over 16 frames, the issue becomes very noticeable. That's why I use FullX_HalfY by default, since accumulating over 2 frames is good enough even when locked at 60 FPS.

+ Implementation +

Please note that this implementation is a prototype. There are many areas for improvement, both in the algorithms and in the implementation, both in execution time and memory usage. I may address some of these after writing this article.

- Raytracing Pipeline -

Here is how the ray tracing pipeline is created, with the recursion depth set to one, since I'm using a depth buffer to start the ray at the object's position, shooting toward the sun.

CD3DX12_STATE_OBJECT_DESC raytracingPipeline{D3D12_STATE_OBJECT_TYPE_RAYTRACING_PIPELINE};
CD3DX12_DXIL_LIBRARY_SUBOBJECT* lib = raytracingPipeline.CreateSubobject<CD3DX12_DXIL_LIBRARY_SUBOBJECT>();

const SharedPtr<Shader> raytracingShader = LoadShader(IE_SHADER_TYPE_LIB, L"rtShadows.hlsl");

const D3D12_SHADER_BYTECODE libdxil{raytracingShader->bytecodes[AlphaMode_Opaque].pShaderBytecode, raytracingShader->bytecodes[AlphaMode_Opaque].BytecodeLength};
lib->SetDXILLibrary(&libdxil);
lib->DefineExport(L"Raygen");
lib->DefineExport(L"ClosestHit");
lib->DefineExport(L"Miss");

CD3DX12_HIT_GROUP_SUBOBJECT* hitGroup = raytracingPipeline.CreateSubobject<CD3DX12_HIT_GROUP_SUBOBJECT>();
hitGroup->SetClosestHitShaderImport(L"ClosestHit");
hitGroup->SetHitGroupExport(L"HitGroup");
hitGroup->SetHitGroupType(D3D12_HIT_GROUP_TYPE_TRIANGLES);

CD3DX12_RAYTRACING_SHADER_CONFIG_SUBOBJECT* shaderConfig = raytracingPipeline.CreateSubobject<CD3DX12_RAYTRACING_SHADER_CONFIG_SUBOBJECT>();
u32 payloadSize = 1 * sizeof(u32);   // struct RayPayload is just one uint
u32 attributeSize = 2 * sizeof(f32); // BuiltInTriangleIntersectionAttributes
shaderConfig->Config(payloadSize, attributeSize);

CD3DX12_GLOBAL_ROOT_SIGNATURE_SUBOBJECT* globalRootSignature = raytracingPipeline.CreateSubobject<CD3DX12_GLOBAL_ROOT_SIGNATURE_SUBOBJECT>();
globalRootSignature->SetRootSignature(m_RaytracingGlobalRootSignature.Get());

CD3DX12_RAYTRACING_PIPELINE_CONFIG_SUBOBJECT* pipelineConfig = raytracingPipeline.CreateSubobject<CD3DX12_RAYTRACING_PIPELINE_CONFIG_SUBOBJECT>();
constexpr u32 maxRecursionDepth = 1;
pipelineConfig->Config(maxRecursionDepth);

IE_Check(m_Device->CreateStateObject(raytracingPipeline, IID_PPV_ARGS(&m_DxrStateObject)));

- Bottom-Level Acceleration Structure (BLAS) -

Each BLAS is built with its geometry description (index and vertex formats, counts, GPU buffer addresses, etc.) and the flag to prefer fast trace over fast build.

I currently don't check for duplicate primitive geometry that could use the same BLAS, this is something I need to implement. As a result, I currently have one BLAS per primitive.

Once the BLAS is built, I create an instance descriptor for it. The descriptor contains the object-to-world transform matrix and the GPU address of the BLAS. After generating all the instance descriptors, I upload them to the GPU to be used when building the TLAS.

const D3D12_RAYTRACING_GEOMETRY_DESC geometryDesc = {
  .Type = D3D12_RAYTRACING_GEOMETRY_TYPE_TRIANGLES,
  .Flags = D3D12_RAYTRACING_GEOMETRY_FLAG_OPAQUE,
  .Triangles =
  {
    .Transform3x4 = 0,
    .IndexFormat = DXGI_FORMAT_R32_UINT,
    .VertexFormat = DXGI_FORMAT_R32G32B32_FLOAT,
    .IndexCount = prim->m_Indices.Size(),
    .VertexCount = prim->m_Vertices.Size(),
    .IndexBuffer = prim->m_IndexBuffer->GetGPUVirtualAddress(),
    .VertexBuffer =
    {
      .StartAddress = prim->m_VertexBuffer->GetGPUVirtualAddress(),
      .StrideInBytes = sizeof(Vertex),
    },
  },
};

D3D12_RAYTRACING_ACCELERATION_STRUCTURE_PREBUILD_INFO blasPrebuildInfo;
const D3D12_BUILD_RAYTRACING_ACCELERATION_STRUCTURE_INPUTS blasInputs = {
    .Type = D3D12_RAYTRACING_ACCELERATION_STRUCTURE_TYPE_BOTTOM_LEVEL,
    .Flags = D3D12_RAYTRACING_ACCELERATION_STRUCTURE_BUILD_FLAG_PREFER_FAST_TRACE,
    .NumDescs = 1,
    .DescsLayout = D3D12_ELEMENTS_LAYOUT_ARRAY,
    .pGeometryDescs = &geometryDesc,
};
m_Device->GetRaytracingAccelerationStructurePrebuildInfo(&blasInputs, &blasPrebuildInfo);
IE_Assert(blasPrebuildInfo.ResultDataMaxSizeInBytes > 0);

AllocateUAVBuffer(static_cast<u32>(blasPrebuildInfo.ResultDataMaxSizeInBytes), prim->m_BLAS, D3D12_RESOURCE_STATE_RAYTRACING_ACCELERATION_STRUCTURE, L"BLAS");
AllocateUAVBuffer(static_cast<u32>(blasPrebuildInfo.ScratchDataSizeInBytes), prim->m_ScratchResource, D3D12_RESOURCE_STATE_UNORDERED_ACCESS, L"ScratchResource");

const D3D12_BUILD_RAYTRACING_ACCELERATION_STRUCTURE_DESC blasBuildDesc = {
    .DestAccelerationStructureData = prim->m_BLAS->GetGPUVirtualAddress(),
    .Inputs = blasInputs,
    .ScratchAccelerationStructureData = prim->m_ScratchResource->GetGPUVirtualAddress(),
};
cmdList->BuildRaytracingAccelerationStructure(&blasBuildDesc, 0, nullptr);

const float4x4 transform = prim->GetTransform();
const D3D12_RAYTRACING_INSTANCE_DESC instanceDesc = {
  .Transform =
  {
    {transform[0][0], transform[1][0], transform[2][0], transform[3][0]},
    {transform[0][1], transform[1][1], transform[2][1], transform[3][1]},
    {transform[0][2], transform[1][2], transform[2][2], transform[3][2]}
  },
  .InstanceMask = 1,
  .AccelerationStructure = prim->m_BLAS->GetGPUVirtualAddress()
};
instanceDescs.Add(instanceDesc);

- Top-Level Acceleration Structure (TLAS) -

The TLAS is easy to build; I currently only have one for my scenes. I set up its inputs to point to the GPU buffer of instance descriptors and also request fast-trace performance.

const D3D12_BUILD_RAYTRACING_ACCELERATION_STRUCTURE_INPUTS topLevelInputs = {
    .Type = D3D12_RAYTRACING_ACCELERATION_STRUCTURE_TYPE_TOP_LEVEL,
    .Flags = D3D12_RAYTRACING_ACCELERATION_STRUCTURE_BUILD_FLAG_PREFER_FAST_TRACE,
    .NumDescs = instanceDescs.Size(),
    .DescsLayout = D3D12_ELEMENTS_LAYOUT_ARRAY,
    .InstanceDescs = m_InstanceDescs->GetGPUVirtualAddress(),
};
D3D12_RAYTRACING_ACCELERATION_STRUCTURE_PREBUILD_INFO topLevelPrebuildInfo;
m_Device->GetRaytracingAccelerationStructurePrebuildInfo(&topLevelInputs, &topLevelPrebuildInfo);
IE_Assert(topLevelPrebuildInfo.ResultDataMaxSizeInBytes > 0);

AllocateUAVBuffer(static_cast<u32>(topLevelPrebuildInfo.ScratchDataSizeInBytes), m_ScratchResource, D3D12_RESOURCE_STATE_UNORDERED_ACCESS, L"ScratchResource");
AllocateUAVBuffer(static_cast<u32>(topLevelPrebuildInfo.ResultDataMaxSizeInBytes), m_TLAS, D3D12_RESOURCE_STATE_RAYTRACING_ACCELERATION_STRUCTURE, L"TLAS");

const D3D12_BUILD_RAYTRACING_ACCELERATION_STRUCTURE_DESC topLevelBuildDesc = {
    .DestAccelerationStructureData = m_TLAS->GetGPUVirtualAddress(),
    .Inputs = topLevelInputs,
    .ScratchAccelerationStructureData = m_ScratchResource->GetGPUVirtualAddress(),
};
cmdList->BuildRaytracingAccelerationStructure(&topLevelBuildDesc, 0, nullptr);

- Shader Table -

For the shader table, I just link each ray tracing function in the table.

ComPtr<ID3D12StateObjectProperties> stateObjectProperties;
IE_Check(m_DxrStateObject.As(&stateObjectProperties));

constexpr u32 shaderRecordSize = (D3D12_SHADER_IDENTIFIER_SIZE_IN_BYTES + (D3D12_RAYTRACING_SHADER_RECORD_BYTE_ALIGNMENT - 1)) & ~(D3D12_RAYTRACING_SHADER_RECORD_BYTE_ALIGNMENT - 1);
const CD3DX12_RESOURCE_DESC bufferDesc = CD3DX12_RESOURCE_DESC::Buffer(shaderRecordSize);
const CD3DX12_HEAP_PROPERTIES uploadHeapProperties = CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_UPLOAD);

// Ray gen shader table
IE_Check(m_Device->CreateCommittedResource(&uploadHeapProperties, D3D12_HEAP_FLAG_NONE, &bufferDesc, D3D12_RESOURCE_STATE_GENERIC_READ, nullptr, IID_PPV_ARGS(&m_RayGenShaderTable)));
IE_Check(m_RayGenShaderTable->SetName(L"RayGenShaderTable"));
SetResourceBufferData(m_RayGenShaderTable, stateObjectProperties->GetShaderIdentifier(L"Raygen"), D3D12_SHADER_IDENTIFIER_SIZE_IN_BYTES, 0);

// Miss shader table
IE_Check(m_Device->CreateCommittedResource(&uploadHeapProperties, D3D12_HEAP_FLAG_NONE, &bufferDesc, D3D12_RESOURCE_STATE_GENERIC_READ, nullptr, IID_PPV_ARGS(&m_MissShaderTable)));
IE_Check(m_MissShaderTable->SetName(L"MissShaderTable"));
SetResourceBufferData(m_MissShaderTable, stateObjectProperties->GetShaderIdentifier(L"Miss"), D3D12_SHADER_IDENTIFIER_SIZE_IN_BYTES, 0);

// Hit group shader table
IE_Check(m_Device->CreateCommittedResource(&uploadHeapProperties, D3D12_HEAP_FLAG_NONE, &bufferDesc, D3D12_RESOURCE_STATE_GENERIC_READ, nullptr, IID_PPV_ARGS(&m_HitGroupShaderTable)));
IE_Check(m_HitGroupShaderTable->SetName(L"HitGroupShaderTable"));
SetResourceBufferData(m_HitGroupShaderTable, stateObjectProperties->GetShaderIdentifier(L"HitGroup"), D3D12_SHADER_IDENTIFIER_SIZE_IN_BYTES, 0);

- Raytracing shader -

Here is the raytracing shader code, nothing too fancy, its exactly what is explained in overview section.

#include "data/shaders/common.hlsli"

RaytracingAccelerationStructure Scene : register(t0, space0);
RWTexture2D<half> ShadowRenderTarget : register(u0);
Texture2D<float> Depth : register(t1);
SamplerState DepthSampler : register(s0);
cbuffer RtRootConstants : register(b1)
{
    row_major float4x4 invViewProj;

    float zNear;
    float zFar;
    float pad0;
    uint resolutionType;

    float3 sunDir;
    uint frameIndex;

    float3 cameraPos;
    float pad1;
};

static const uint invBayer2[4] = { 0, 3, 1, 2 };
static const uint invBayer4[16] = { 0,10,2,8, 5,15,7,13, 1,11,3,9, 4,14,6,12 };

static const uint2 rtFactors[4] =
{
    uint2(1, 1),   // full
    uint2(1, 2),   // fullX_halfY
    uint2(2, 2),   // half
    uint2(4, 4)    // quarter
};
static const uint rtTileCount[4] =
{
    1,  // full
    2,  // fullX_halfY
    4,  // half
    16  // quarter
};

struct RayPayload
{
    uint hitCount;
};

float3 ReconstructWorldPos(float2 uv)
{
    float d = Depth.SampleLevel(DepthSampler, uv, 0).r;
    float2 ndc0  = uv * 2 - 1;
    ndc0.y = -ndc0.y; // D3D
    float4 clip0   = float4(ndc0, d, 1);
    float4 worldH = mul(clip0, invViewProj);
    return worldH.xyz / worldH.w;
}

void GetDitherOffset(out uint2 fullPxOut, out float2 fullDimInvOut)
{
    uint2 px = DispatchRaysIndex().xy;
    float2 dim = DispatchRaysDimensions().xy;

    uint2 factor = rtFactors[resolutionType];
    uint tileCount = rtTileCount[resolutionType];

    uint slot = frameIndex % tileCount;

    uint idx;
    if (tileCount < 4)
    {
        idx = slot;
    }
    else if (tileCount == 4)
    {
        idx = invBayer2[slot];
    }
    else
    {
        idx = invBayer4[slot];
    }

    uint shift = factor.x >> 1;  
    uint mask = factor.x - 1;

    uint2 ditherOffset = uint2(idx & mask, idx >> shift);

    fullPxOut = px * factor + ditherOffset;
    fullDimInvOut = 1.f / (dim * factor); 
}

[shader("raygeneration")]
void Raygen()
{
    uint2 fullPx;
    float2 fullDimInv;
    GetDitherOffset(fullPx, fullDimInv);

    RayPayload payload = { 0 };

    float2 centerUV = (float2(fullPx) + 0.5f) * fullDimInv;
    float3 worldPos = ReconstructWorldPos(centerUV);

    RayDesc ray;
    ray.Origin = worldPos + normalize(cameraPos - worldPos) * 0.025;
    ray.Direction = sunDir;
    ray.TMin = 0.01f;
    ray.TMax = 1e6f;

    TraceRay(Scene, RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH, 0xFF, 0, 1, 0, ray, payload);

    ShadowRenderTarget[fullPx] = payload.hitCount;
}

[shader("closesthit")]
void ClosestHit(inout RayPayload payload, in BuiltInTriangleIntersectionAttributes attr)
{
    payload.hitCount++;
}

- Blur shader -

Again, not much to say, it follow the Blur pass description in the overview section.

Texture2D<float> g_Input : register(t0);
Texture2D<float> g_Depth : register(t1);
RWTexture2D<float> g_Output : register(u0);
cbuffer CameraParams : register(b1)
{
    float zNear;
    float zFar;
};

static const int radius = 1;
static const float fadeStart = 5.f;
static const float fadeEnd = 50.f;
static const float blurW[3] = {0.25, 0.5, 0.25};
static const float idW[3] = {0.0, 1.0, 0.0};
static const float depthDiffThreshold = 0.01f;

float LinearizeDepth(float d, float zNear, float zFar)
{
    float zN = d * 2.0 - 1.0;
    return (2.0 * zNear * zFar) / (zFar + zNear - zN * (zFar - zNear));
}

void Blur(int2 uv, int2 axis)
{
    uint width, height;
    g_Input.GetDimensions(width, height);
    
    float rawD = g_Depth.Load(int3(uv,0));
    float cd = LinearizeDepth(rawD, zNear, zFar);
    float t = saturate((cd - fadeStart) / (fadeEnd - fadeStart));

    float w[3];
    w[0] = lerp(blurW[0], idW[0], t);
    w[1] = lerp(blurW[1], idW[1], t);
    w[2] = lerp(blurW[2], idW[2], t);

    float sum = 0;
    float wsum = 0;

    [unroll]
    for (int i = -radius; i <= radius; i++)
    {
        float weight = w[i + radius];
        int2 suv = uv + axis * i;
        suv = clamp(suv, int2(0,0), int2(width-1, height-1));

        float sd = LinearizeDepth(g_Depth.Load(int3(suv,0)), zNear, zFar);
        if (abs(sd - cd) <= depthDiffThreshold)
        {
            sum += weight * g_Input.Load(int3(suv,0));
            wsum += weight;
        }
    }

    g_Output[uv] = (wsum > 0) ? sum / wsum : g_Input.Load(int3(uv,0));
}

+ Timings +

These timings were measured on a GeForce RTX 5080 at a resolution of 2560 x 1440 in the Bistro Exterior scene.

Shadow Resolution TraceRays (ms) Blur X (ms) Blur Y (ms)
Full 1.23 0.04 0.03
FullX_HalfY 0.49 0.04 0.03
Half 0.26 0.04 0.03
Quarter 0.09 0.04 0.03

Note: The blur is performed on the accumulated shadow render target, which covers the entire swapchain size. This means the blur timings do not depend on the shadow resolution, so the numbers are the same for all cases.

+ Potential Improvements +

Here are some ideas worth exploring. I may experiment with them in the future.

+ Key Takeaways +

+ Screenshots +

+ References +

  1. DirectX Raytracing (DXR) Functional Spec - Microsoft. microsoft.github.io/DirectX-Specs/d3d/Raytracing.html

  2. Direct3D 12 raytracing samples - Microsoft. learn.microsoft.com/en-us/samples/microsoft/directx-graphics-samples/d3d12-raytracing-samples-win32

  3. Compute shaders in graphics: Gaussian blur - lisyarus. lisyarus.github.io/blog/posts/compute-blur.html