Brief guide to Vulkan layers

Vulkan has a lot of really nice concepts, but one that hasn't had as much attention until now is the layer system that's built into the API ecosystem.

For other APIs you generally link directly against a runtime, which is entirely opaque and calls into the driver for any of its work. This means that for tools and add-in functionality there are two options: Either use platform specific interception hacks to slide themslves between the application and the API, or be built by the platform-holder or IHV and write in hooks inside the runtime or driver itself.

This might seem like a niche use-case but it really isn't - all developers use the API's own validation and checking features from time to time and of course for newer explicit APIs like Vulkan this becomes a critical timesaver. Likewise many different tools from debuggers to profilers will need to have their own access to the API. On top of this there are other cases like launcher overlays, performance measurement and video capture that all want access.

Having all of these systems trying to fight to insert themselves with no way of co-operating only leads to trouble, so having a system built-in to the API to add-in these different things saves everyone a lot of hassle.

While most developers will only ever need to use provided layers for validation, profiling and debugging, etc, some out there might be interested in how the system actually works and a handful will want to write their own layers. There is a detailed specification available of how exactly all of the pieces fit together, which you should absolutely read, but as it can be a bit dense I've put together a more friendly introduction.

- baldurk

The Loader

The loader is the central arbiter in the Vulkan runtime. The application talks directly to the loader and only to the loader, which then deals with enumerating and validating the layers requested, enumerating ICDs and organising them and presenting a unified interface to the application.

The loader - as with much in the Vulkan ecosystem - is open source so you can see exactly how it's implemented. It's a good idea to have the source open and trace through some of the concepts I'll talk about here. The specification I linked earlier is also available and is the precise definition of how the loader, layers and ICD interact.

When an application links against vulkan-1.dll or libvulkan.so.1 they are linking against the loader. The loader exports all core (non-extension) vulkan functions like vkCreateImage or vkCmdDraw. When an application calls one of these, they are calling into the loader rather than directly into a Vulkan driver.

The first thing an application will do is query for instance extensions, layers and then call vkCreateInstance. To do this, the loader already needs to know what layers are available, and it has different ways depending on the platform. On windows, layers are registered in a registry key (HKEY_LOCAL_MACHINE\SOFTWARE\Khronos\Vulkan) and on linux there are several system-wide and user-specific search paths. The files registered are .json manifests which describe what layer is available and where to find the actual library.

All of the queries are satisfied by reading the json manifests, without ever loading the layer modules themselves. The extensions that each layer provides are detailed in the manifest, as well as the name and version that are retrieved with the queries. This means that any layers which aren't activated by the application never even get loaded into memory.

The loader does have to load ICDs though to determine which instance extensions they expose. ICDs are registered in a very similar way with manifest files. At this point we can create an instance with whichever layers the application requested, and our layer now starts to see some real action.

I'm only covering the loader in use on desktop platforms, Windows and Linux. The Android loader is more limited and constrained so I won't be covering the extra hoops and limitations to deal with in detail. The overall principles are still the same.

Dispatch Chains

At this point we need to talk about how a single function call (like vkCreateInstance) is propagated to the loader, the ICD or ICDs, and many different layers.

When the application calls into any function that the loader statically exports, it calls into a trampoline function. This trampoline function then calls into what I'll refer to as a dispatch chain.

The idea of the dispatch chains are chains of function pointers along which execution flows. It begins with the loader's trampoline entry point, then the trampoline calls into the first layer, then the first layer calls into the second. This chain continues until it finally reaches the endpoint in the ICD to really do the work.

To be less vague, vkCreateInstance is one of these special functions. In the trampoline function right at the start of the dispatch chain, it first validates that the layers and extensions requested are valid. Once everything looks good, it allocates the Vulkan instance and dispatch chain and then calls into the vkCreateInstance function in the first layer. That first layer will initialise itself and any internal structures, then pass along execution to vkCreateInstance in the next layer, and so on.

Now Vulkan instances from the application's point of view are mostly a loader concept. They represent everything all together for its use of Vulkan, but they also combine all the available ICDs into one unified front. This means that when we reach the end of the dispatch chain for vkCreateInstance, we can't end up in an ICD. There could be several in use at the same time, and ICDs don't know about each other to chain together.

The loader solves this by putting its own terminator function on the end of the dispatch chain, for the final layer to call. This terminator function then calls vkCreateInstance on each available ICD in turn and stores all of them for later use. This diagram from the loader spec illustrates that:

As we were going along, the layers were initialising themselves and preparing for their place in the dispatch chain. In particular, every layer uses vkGetInstanceProcAddr to find all of the entry points that it wants to be able to call in the next layer on. Each layer calls vkGetInstanceProcAddr in the next layer and stores these in a dispatch table. This is just a structure full of function pointers.

By doing this up front at creation it means that when the loader calls the entry point of the first layer in the chain, it will already know where to pass on to next. This also enables a useful feature - layers don't have to hook every Vulkan function, just the ones they're interested in.

This happens when the loader and each layer was calling vkGetInstanceProcAddr to find the next function in the dispatch chain. If a layer doesn't want to intercept a function call, it doesn't have to return its own little stub function. Instead it can just forward the vkGetInstanceProcAddr call to the next layer and return the result. As long as the dispatch table at each point knows the next function to call, it doesn't matter if they skip a layer or two.

Dispatch chains for functions that skip some layers

In this example, neither Layer A nor Layer B intercept all the functions. The loader's dispatch table will go partly to A and partly to B, likewise A's dispatch table will go partly to B and partly directly to the ICD. For simplicity, we assume all these functions have no loader terminator.

You can think of this either as one large dispatch chain which is sparsely populated at each layer, or instead as a dispatch chain per function which may be longer or shorter. Compare the dispatch chain for vkFunctionB to the chain for vkFunctionC.

It's important to note though that just because a layer doesn't participate in the dispatch chain from the application for a given function, doesn't mean it can't still call those function if it wants to. In the above example, Layer A may still be want to call vkFunctionD as part of its functionality even if it has no interest in intercepting that function.

Think for example of a performance overlay which may only want to intercept the call to vkQueuePresentKHR so that it can do its drawing. However, to do that drawing it will still want to be able to call all of the command buffer recording functions. In this way, Layer A would get a pointer to Layer B's implementation, just the same as the loader does.

Layers calling functions that they don't hook

There's one final thing to mention about dispatch chains, and it is that there are in fact two dispatch chains in use during any non-trivial Vulkan application.

The reason becomes clear when you think about it. As we talked about above, Vulkan instances bundle up all the ICDs together, so any function calls going through the instance's dispatch chain need to eventually come out the other side and go to one or more ICDs. However once you have created a VkDevice you have selected a particular ICD (the one that provides that VkPhysicalDevice), and all of your calls are going to that ICD in particular. On top of that, if you create several devices for each physical device, you can have different chains going to different ICDs!

So in the layer specification and anywhere that discusses these dispatch chains, you will see the instance chain and the device chain discussed separately. The instance chain is the one used for any function calls on instances, physical devices, and vkCreateDevice. Once you have created a device, all calls on it and any of its children are on that device's device chain.

Implementation

We now actually have all of the concepts explained about how the layers work - this primary concept of dispatch tables and dispatch chains allows layers to be strung along arbitrarily and gives us a well-defined way of including many different functions all together without worrying about clashes in function hooking.

Let's dive in and look at the nitty-gritty of how this actually gets implemented, and in the process we'll build up an example layer that tracks some simple drawcall statistics for each command buffer.

JSON Manifest

The first thing we'll look at is that json manifest file that we mentioned way back when talking about how the loader finds layers on the system. Since it's straightforward, let's look at a basic manifest for our sample layer:

{
  "file_format_version" : "1.0.0",
  "layer" : {
    "name": "VK_LAYER_SAMPLE_SampleLayer",
    "type": "GLOBAL",
    "library_path": ".\\sample_layer.dll",
    "api_version": "1.0.0",
    "implementation_version": "1",
    "description": "Sample layer - https://renderdoc.org/vulkan-layer-guide.html",
  }
}

Example JSON manifest for our layer.

Many of the fields map directly to those that you retrieve in VkLayerProperties when enumerating. The file_format_version is a version number for the manifest format, and the library_path is either a relative path (relative to the JSON location), absolute path, or plain filename. In the last case, the normal search path logic is used as with any other dynamic module load - searching system paths and so on, as defined by the OS. On Windows this would be the LoadLibrary search path, and on Linux it would be the path from ld.so.conf and LD_LIBRARY_PATH used in dlopen.

The type field is bit of backwards compatibility. In the original Vulkan spec, layers could be instance-only, device-only, or both. This field then would be INSTANCE, DEVICE, or GLOBAL for both. Now layers are always considered to be on both, although they don't have to be on the device chain if they don't want to be. Most will use both chains though, as that's where most of the functionality lives.

This manifest is enough, and with it the loader will look for entry points vkGetInstanceProcAddr and vkGetDeviceProcAddr to use in constructing the dispatch chain. These are the only entry points the module needs to export. Since it can be awkward or inconvenient to have to export functions with the exact same name as the API functions, you can instead export functions like SampleLayer_GetInstanceProcAddr and SampleLayer_GetDeviceProcAddr. You can do this as long as you add a mapping to the manifest:

{
  "file_format_version" : "1.0.0",
  "layer" : {
    "name": "VK_LAYER_SAMPLE_SampleLayer",
    "type": "GLOBAL",
    "library_path": ".\\sample_layer.dll",
    "api_version": "1.0.0",
    "implementation_version": "1",
    "description": "Sample layer - https://renderdoc.org/vulkan-layer-guide.html",
    "functions": {
      "vkGetInstanceProcAddr": "SampleLayer_GetInstanceProcAddr",
      "vkGetDeviceProcAddr": "SampleLayer_GetDeviceProcAddr"
    },
  }
}

Example JSON manifest with function remapping.

Now you will find your layer being enumerated and your GetProcAddr functions will be called when user enables the layer.

vkGetInstanceProcAddr

These functions have the exact same signature as is listed in vulkan.h, and perform just as you would expect - you can strcmp the pName parameter sequentially against each function you export, and then return the address of your entry point. You might use code-generation of some description for this if you're intercepting a lot of functions. In our sample we only intercept a few functions, so a simple macro is sufficient:

#define GETPROCADDR(func) if(!strcmp(pName, "vk" #func)) return (PFN_vkVoidFunction)&SampleLayer_##func;

VK_LAYER_EXPORT PFN_vkVoidFunction VKAPI_CALL SampleLayer_GetInstanceProcAddr(VkInstance instance, const char *pName)
{
  // instance chain functions we intercept
  GETPROCADDR(GetInstanceProcAddr);
  GETPROCADDR(EnumerateInstanceLayerProperties);
  GETPROCADDR(EnumerateInstanceExtensionProperties);
  GETPROCADDR(CreateInstance);
  GETPROCADDR(DestroyInstance);

  return NULL; // ???
}

The beginnings of our vkGetInstanceProcAddr.

Except, oh dear... After we decided how nice it was not to have to intercept every entry point, we've run into a problem - what do we return when it's some other function that we don't care about? We don't yet have any idea what the next layer is or where its vkGetInstanceProcAddr is.

In order to finish this implementation we will need to look at how vkCreateInstance works first.

vkCreateInstance

vkCreateInstance is where we get our initialisation and construct our dispatch table. Even if your layer doesn't care about the instance itself, this is still where you need to perform your initialisation code and it is required that all layers implement the function. Likewise if you want to intercept any device functions at all, you must implement vkCreateDevice and initialise. The function implementations end up almost identical, because the dispatch chain concept is the same in each.

When our layer's vkCreateInstance is called, the loader has inserted extra data that we can use for initialisation. In particular, it makes use of Vulkan's sType/pNext extensibility. In the VkInstanceCreateInfo struct, it begins in exactly the same way as many other Vulkan structures with a VkStructureType sType; and const void *pNext. These two elements define a linked list of extra extension information. You can iterate through the list even if you don't recognise all of the entries in the list because they all start with the sType/pNext.

The loader inserts an extra element in the pNext chain which has the type VK_STRUCTURE_TYPE_LOADER_INSTANCE_CREATE_INFO. This struct is defined in vk_layer.h which is distributed in the same SDK as vulkan.h or on github here. This struct contains the next layer's vkGetInstanceProcAddr which is all that we need to initialise. That header also contains the VkLayerDispatchTable struct which contains all of the function pointers for unextended Vulkan. You don't have to use it, but it's convenient.

Because each layer calls directly into the next with the same creation info struct, we also need a little linked list within this VkLayerInstanceCreateInfo. When we retrieve the next layer's vkGetInstanceProcAddr we essentially pop off the front of this list - take the function pointer, and advance the list head so that the next layer gets the information for the third layer in the chain, and so on.

The loader actually inserts two different structs with this type in the chain with different values of VkLayerFunction - one is for the vkGetInstanceProcAddr. The other contains a function used for initialising dispatchable objects. When creating a dispatchable object within a layer, you need to call into a loader callback to initialise the dispatch table properly - see later in this post about object wrapping. This work is normally done by the trampoline function.

The details of how it works are outside the scope of this post, and it's not needed for the sample layer so we just need to skip it. You can read the specification for precise details on how all of this is set up.

Putting all of that together we can now implement SampleLayer_CreateInstance:

VK_LAYER_EXPORT VkResult VKAPI_CALL SampleLayer_CreateInstance(
    const VkInstanceCreateInfo*                 pCreateInfo,
    const VkAllocationCallbacks*                pAllocator,
    VkInstance*                                 pInstance)
{
  VkLayerInstanceCreateInfo *layerCreateInfo = (VkLayerInstanceCreateInfo *)pCreateInfo->pNext;

  // step through the chain of pNext until we get to the link info
  while(layerCreateInfo && (layerCreateInfo->sType != VK_STRUCTURE_TYPE_LOADER_INSTANCE_CREATE_INFO ||
                            layerCreateInfo->function != VK_LAYER_LINK_INFO))
  {
    layerCreateInfo = (VkLayerInstanceCreateInfo *)layerCreateInfo->pNext;
  }

  if(layerCreateInfo == NULL)
  {
    // No loader instance create info
    return VK_ERROR_INITIALIZATION_FAILED;
  }

  PFN_vkGetInstanceProcAddr gpa = layerCreateInfo->u.pLayerInfo->pfnNextGetInstanceProcAddr;
  // move chain on for next layer
  layerCreateInfo->u.pLayerInfo = layerCreateInfo->u.pLayerInfo->pNext;

  PFN_vkCreateInstance createFunc = (PFN_vkCreateInstance)gpa(VK_NULL_HANDLE, "vkCreateInstance");

  VkResult ret = createFunc(pCreateInfo, pAllocator, pInstance);

  // fetch our own dispatch table for the functions we need, into the next layer
  VkLayerInstanceDispatchTable dispatchTable;
  dispatchTable.GetInstanceProcAddr = (PFN_vkGetInstanceProcAddr)gpa(*pInstance, "vkGetInstanceProcAddr");
  dispatchTable.DestroyInstance = (PFN_vkDestroyInstance)gpa(*pInstance, "vkDestroyInstance");
  dispatchTable.EnumerateDeviceExtensionProperties = (PFN_vkEnumerateDeviceExtensionProperties)gpa(*pInstance, "vkEnumerateDeviceExtensionProperties");

  return VK_SUCCESS;
}

There's one piece of the puzzle missing - now that we've created our dispatch table with onward function pointers, where do we actually store it? We can't store it globally because that breaks as soon as multiple instances are created.

There are two main ways to implement this that I know of. The first is just to create a look-up map, from instance to dispatch table, with a global lock to prevent threaded access causing problems. That's the one we'll use in our implementation because it's nice and simple. I will side-track very briefly to mention the other: object wrapping.

We'll see as we finish the implementation that the burden of all these locks grows quite quickly, and we're going to end up forcing a lot of our Vulkan use to be serial. That kind of defeats the purpose of having a highly parallelisable API. For some layers, maybe you'll consider this an acceptable cost, but there is an alternative.

Object Wrapping

The handles that we are returned by creation functions - the *pInstance in the example above - are entirely opaque and importantly we can return something different up the chain if we want. In fact, there's nothing to stop you from allocating some new bit of memory, storing our dispatch table in there, and returning that pointer further up the chain and pretending it's the real deal. Then whenever you want the dispatch table again you can just look up your custom structure and pull it out.

Well, there's almost nothing. The first problem is that while the handles are opaque they are quite important to someone. If you wrap the handle and return something different up the chain towards the application, you had better catch it every single time it comes back down the chain towards the ICD. You have to replace it with the original before you pass it along, because if you don't bad things happen. In practice this means that you will store the original inside your custom memory, and that you must now implement every Vulkan function that uses that type of object so that you can unwrap it again. No more skipping uninteresting functions for your layer.

In practice most layers that wrap some objects will wrap them all, so this quickly becomes an all-or-nothing approach. It also becomes doubly tricky when you consider that extensions you didn't know about at implementation time could also use the object. You have no way of intercepting those extension function calls to unwrap. This means that object wrapping layers are not compatible with any extensions that add functions.

The second thing is more of an implementation note - the loader doesn't quite treat these handles as opaque. For dispatchable handles - VkInstances, VkPhysicalDevices, VkDevices, VkQueues, VkCommandBuffers - the memory they point to must contain the loader's dispatch table in the first sizeof(void*) bytes. In practice this just means that when you're allocating your custom memory, you now have to first copy the loader's dispatch table into the first pointer, then copy the original object into the second pointer, then store whatever information you want after that.

This concept might sound familiar - it's very close to the idea of a vtable in C++ used for determining where virtual function calls on an object go without needing even when calling through a pointer to the base class. The requirement above is in effect requiring that the vtable is preserved when you wrap an object.

While there are downsides, object wrapping has the big advantage of not requiring any global locking to access object-specific data or the dispatch table. If you're writing a small layer you might not want to go this route, but any layer which intercepts a lot of functions anyway will probably find object wrapping suitable. Just be aware of the responsibilities that object wrapping has.

We can now define a global map and lock, and store the dispatch table for our instance:

std::mutex global_lock;
typedef std::lock_guard<std::mutex> scoped_lock;
std::map<void *, VkLayerInstanceDispatchTable> instance_dispatch;

// in SampleLayer_CreateInstance store the table by key
{
	scoped_lock l(global_lock);
	instance_dispatch[GetKey(*pInstance)] = dispatchTable;
}

I've also snuck in a little detail here - for reasons that will become clear later, we use the loader's dispatch table pointer as the key in our map, not the instance handle itself. The GetKey function just returns that pointer as void *.

Now that we have our dispatch table created, we can also go back to SampleLayer_GetInstanceProcAddr and finish off the code that forwards to the next layer for any other functions that we don't intercept:

{
  scoped_lock l(global_lock);
  return instance_dispatch[GetKey(instance)].GetInstanceProcAddr(instance, pName);
}

With that, we now have our layer on the instance chain! Most layers will want to initialise themselves on the device chain as well because most interesting functions are there, but it's not strictly required.

As you might have noticed in our vkGetInstanceProcAddr implementation, layers also need to export the enumeration functions for layer and extension properties - these either forward themselves on, or return the results for themselves if pLayerName matches their own name. I won't post the code here since it's very simple, but you can see the full details in the sample code at the end. Likewise the implementation of SampleLayer_DestroyInstance just erases the map element in instance_dispatch.

vkGetDeviceProcAddr & vkCreateDevice

Implementing the SampleLayer_GetDeviceProcAddr and SampleLayer_CreateDevice functions follows almost exactly the same pattern as the instance versions, because all of the concepts still apply exactly the same. The only difference is one careful little gotcha - while VkDevice and all of its children are on the device chain, the VkCreateDevice function is being called on the VkInstance and so it lives on the instance chain:

std::map<void *, VkLayerDispatchTable> device_dispatch;

VK_LAYER_EXPORT PFN_vkVoidFunction VKAPI_CALL SampleLayer_GetDeviceProcAddr(VkDevice device, const char *pName)
{
  // device chain functions we intercept
  GETPROCADDR(GetDeviceProcAddr);
  GETPROCADDR(EnumerateDeviceLayerProperties);
  GETPROCADDR(EnumerateDeviceExtensionProperties);
  GETPROCADDR(CreateDevice);
  GETPROCADDR(DestroyDevice);
  GETPROCADDR(BeginCommandBuffer);
  GETPROCADDR(CmdDraw);
  GETPROCADDR(CmdDrawIndexed);
  GETPROCADDR(EndCommandBuffer);
  
  {
    scoped_lock l(global_lock);
    return device_dispatch[GetKey(device)].GetDeviceProcAddr(device, pName);
  }
}

VK_LAYER_EXPORT VkResult VKAPI_CALL SampleLayer_CreateDevice(
    VkPhysicalDevice                            physicalDevice,
    const VkDeviceCreateInfo*                   pCreateInfo,
    const VkAllocationCallbacks*                pAllocator,
    VkDevice*                                   pDevice)
{
  VkLayerDeviceCreateInfo *layerCreateInfo = (VkLayerDeviceCreateInfo *)pCreateInfo->pNext;

  // step through the chain of pNext until we get to the link info
  while(layerCreateInfo && (layerCreateInfo->sType != VK_STRUCTURE_TYPE_LOADER_DEVICE_CREATE_INFO ||
                            layerCreateInfo->function != VK_LAYER_LINK_INFO))
  {
    layerCreateInfo = (VkLayerDeviceCreateInfo *)layerCreateInfo->pNext;
  }

  if(layerCreateInfo == NULL)
  {
    // No loader instance create info
    return VK_ERROR_INITIALIZATION_FAILED;
  }
  
  PFN_vkGetInstanceProcAddr gipa = layerCreateInfo->u.pLayerInfo->pfnNextGetInstanceProcAddr;
  PFN_vkGetDeviceProcAddr gdpa = layerCreateInfo->u.pLayerInfo->pfnNextGetDeviceProcAddr;
  // move chain on for next layer
  layerCreateInfo->u.pLayerInfo = layerCreateInfo->u.pLayerInfo->pNext;

  PFN_vkCreateDevice createFunc = (PFN_vkCreateDevice)gipa(VK_NULL_HANDLE, "vkCreateDevice");

  VkResult ret = createFunc(physicalDevice, pCreateInfo, pAllocator, pDevice);
  
  // fetch our own dispatch table for the functions we need, into the next layer
  VkLayerDispatchTable dispatchTable;
  dispatchTable.GetDeviceProcAddr = (PFN_vkGetDeviceProcAddr)gdpa(*pDevice, "vkGetDeviceProcAddr");
  dispatchTable.DestroyDevice = (PFN_vkDestroyDevice)gdpa(*pDevice, "vkDestroyDevice");
  dispatchTable.BeginCommandBuffer = (PFN_vkBeginCommandBuffer)gdpa(*pDevice, "vkBeginCommandBuffer");
  dispatchTable.CmdDraw = (PFN_vkCmdDraw)gdpa(*pDevice, "vkCmdDraw");
  dispatchTable.CmdDrawIndexed = (PFN_vkCmdDrawIndexed)gdpa(*pDevice, "vkCmdDrawIndexed");
  dispatchTable.EndCommandBuffer = (PFN_vkEndCommandBuffer)gdpa(*pDevice, "vkEndCommandBuffer");
  
  // store the table by key
  {
    scoped_lock l(global_lock);
    device_dispatch[GetKey(*pDevice)] = dispatchTable;
  }

  return VK_SUCCESS;
}

Now we have ourselves initialised on the instance and device chains, and we've filled and stored the dispatch tables we need for calling further on in the dispatch chains. Note that this layer is entirely passive - it doesn't call any Vulkan functions beyond forwarding on - so our dispatch tables are pretty much just the functions that we intercept. However in a real layer, we might want to populate the whole table.

Command buffer statistics

So far we've built a layer that doesn't actually do anything, so let's add some functionality. We'll count the drawcalls, instances and vertices recorded in each command buffer and print it out when command buffer recording ends.

First we'll declare a structure globally, and a map to store the command buffer data:

struct CommandStats
{
  uint32_t drawCount = 0, instanceCount = 0, vertCount = 0;
};

std::map<VkCommandBuffer, CommandStats> commandbuffer_stats;

We'll protect this by the same global lock as before to ensure multithreaded recording doesn't cause problems (see why the lock + map approach starts to have issues?). We'll also just allow this map to stay around, rather than intercepting the command buffer freeing and command pool create/destroy functions to properly track command buffer lifetimes.

In vkBeginCommandBuffer, we'll reset the stats to 0 (in case the command buffer was previously recorded and then reset). In vkEndCommandBuffer we'll print them out. We've also intercepted the vkCmdDraw and vkCmdDrawIndexed commands, to record the data. In each case, after doing our work we forward along to the next layer in the chain:

VK_LAYER_EXPORT VkResult VKAPI_CALL SampleLayer_BeginCommandBuffer(VkCommandBuffer commandBuffer, const VkCommandBufferBeginInfo* pBeginInfo)
{
  scoped_lock l(global_lock);
  commandbuffer_stats[commandBuffer] = CommandStats();
  return device_dispatch[GetKey(commandBuffer)].BeginCommandBuffer(commandBuffer, pBeginInfo);
}

VK_LAYER_EXPORT void VKAPI_CALL SampleLayer_CmdDraw(
    VkCommandBuffer                             commandBuffer,
    uint32_t                                    vertexCount,
    uint32_t                                    instanceCount,
    uint32_t                                    firstVertex,
    uint32_t                                    firstInstance)
{
  scoped_lock l(global_lock);

  commandbuffer_stats[commandBuffer].drawCount++;
  commandbuffer_stats[commandBuffer].instanceCount += instanceCount;
  commandbuffer_stats[commandBuffer].vertCount += instanceCount*vertexCount;

  device_dispatch[GetKey(commandBuffer)].CmdDraw(commandBuffer, vertexCount, instanceCount, firstVertex, firstInstance);
}

VK_LAYER_EXPORT void VKAPI_CALL SampleLayer_CmdDrawIndexed(
    VkCommandBuffer                             commandBuffer,
    uint32_t                                    indexCount,
    uint32_t                                    instanceCount,
    uint32_t                                    firstIndex,
    int32_t                                     vertexOffset,
    uint32_t                                    firstInstance)
{
  scoped_lock l(global_lock);

  commandbuffer_stats[commandBuffer].drawCount++;
  commandbuffer_stats[commandBuffer].instanceCount += instanceCount;
  commandbuffer_stats[commandBuffer].vertCount += instanceCount*indexCount;

  device_dispatch[GetKey(commandBuffer)].CmdDrawIndexed(commandBuffer, indexCount, instanceCount, firstIndex, vertexOffset, firstInstance);
}

VK_LAYER_EXPORT VkResult VKAPI_CALL SampleLayer_EndCommandBuffer(VkCommandBuffer commandBuffer)
{
  scoped_lock l(global_lock);

  CommandStats &s = commandbuffer_stats[commandBuffer];
  printf("Command buffer %p ended with %u draws, %u instances and %u vertices", commandBuffer, s.drawCount, s.instanceCount, s.vertCount);

  return device_dispatch[GetKey(commandBuffer)].EndCommandBuffer(commandBuffer);
}

Here we can see why we used the loader's dispatch table pointer as the key in our instance_dispatch and device_dispatch maps, instead of using the handle itself. We can take advatnage of the fact that there are only two dispatch chains - one for the instance and one for the device and its children. When we have a function on a command buffer and we want to look up the forward pointer in our dispatch table, the pointer for all of the command buffers is exactly the same as the one for the device. This is because they are all on the same device chain!

With that, believe it or not, our layer is actually complete. Aside from some simple destroy functions and the layer enumeration/query functions, I've posted all of the code - all told the layer weighs in at 300 lines.

Explicit and Implicit layers

There's one last concept I want to mention briefly here. Everything I've detailed above will give you what's known as an "explicit" layer. An explicit layer is one that is only ever activated if the user explicitly requests for it, in vkCreateInstance. The manifest is registered under the list of explicit layers and it is enumerated as normal.

In some cases though, you might want to have a layer that is activated without having to modify your application's code to look for it and enable it. Think of those tools like the debugger or profiler which you use to run your application. You don't want to have to figure out what the name of their layer is and compile it in and out.

This change in activation doesn't make any difference to the code, you just need to change how your layer is registered. When registering your JSON manifest you'll be able to choose whether to register it in the explicit search path or the implicit search path. If you register it as implicit you have to define two extra entries in the manifest:

{
  "file_format_version" : "1.0.0",
  "layer" : {
    "name": "VK_LAYER_SAMPLE_SampleLayer",
    "type": "GLOBAL",
    "library_path": ".\\sample_layer.dll",
    "api_version": "1.0.0",
    "implementation_version": "1",
    "description": "Sample layer - https://renderdoc.org/vulkan-layer-guide.html",
    "functions": {
      "vkGetInstanceProcAddr": "SampleLayer_GetInstanceProcAddr",
      "vkGetDeviceProcAddr": "SampleLayer_GetDeviceProcAddr"
    },
    "enable_environment": {
      "ENABLE_SAMPLE_LAYER": "1"
    },
    "disable_environment": {
      "DISABLE_SAMPLE_LAYER": "1"
    }
  }
}

Example JSON manifest for an implicit layer.

The two extra entries - enable_environment and disable_environment - become the triggers for enabling and disabling your layer. A layer can be considered opt-out (i.e. always activated unless disabled) if enable_environment is omitted. Otherwise the layer is only present when that environment variable is set.

It must have an 'off-switch': an environment variable which can overrule the implicit layer and prevent it from being loaded. This is useful if an application finds that there's an incompatibility with an implicit layer, and wants to stop it loading. If both variables are set, the 'off-switch' has priority and the layer will not be loaded.

Conclusion

Hopefully after reading this post you have a better idea of how the loader works, how it initialises and chains together layers, and how to implement a layer yourself. Even if you don't want to implement a layer, you now have an understanding of for instance the difference between calling vkGetInstanceProcAddr and vkGetDeviceProcAddr in application code.

Since the application has nothing to do with either dispatch chain, the real difference is that where possible vkGetDeviceProcAddr returns a function pointer directly to the first entry in the dispatch chain - which if no layers are present is often the function pointer directly into the ICD. vkGetInstanceProcAddr may return the same trampoline function that the loader exports which fetches the dispatch table from the dispatchable handle and jumps to it.

The source for the final layer is up on github for you to take a look at.

If you have any questions feel free to poke me on twitter or over email and I'll do my best to answer, or update this post as necessary.