Welcome to GPUPerfAPI’s documentation!¶
Introduction¶
The GPU Performance API (GPUPerfAPI, or GPA) is a powerful library to help analyze the performance and execution characteristics of applications using the GPU.
This API is designed to:
- Support Vulkan™, DirectX™ 12, DirectX 11, OpenGL, and OpenCL™ on GCN, RDNA, and RDNA2-based Radeon™ graphics cards and APUs.
- Support Microsoft Windows® and Linux®.
- Provide derived counters based on raw hardware performance counters.
- Provide a fine-grained way to collect performance data for an application.
Usage¶
This page provides an overview of the GPUPerfAPI library. Please refer to the API Reference Page for more information.
Loading the GPUPerfAPI Library¶
GPUPerfAPI binary releases include separate library files for each supported API. The following table shows the name of the library files for each API
API | Library Names |
---|---|
Vulkan | 64-bit Windows: GPUPerfAPIVK-x64.dll
32-bit Windows: GPUPerfAPIVK.dll
64-bit Linux: libGPUPerfAPIVK.so
|
DirectX 12 | 64-bit Windows: GPUPerfAPIDX12-x64.dll
32-bit Windows: GPUPerfAPIDX12.dll
|
DirectX 11 | 64-bit Windows: GPUPerfAPIDX11-x64.dll
32-bit Windows: GPUPerfAPIDX11.dll
|
OpenGL | 64-bit Windows: GPUPerfAPIGL-x64.dll
32-bit Windows: GPUPerfAPIGL.dll
64-bit Linux: libGPUPerfAPIGL.so
|
OpenCL | 64-bit Windows: GPUPerfAPICL-x64.dll
32-bit Windows: GPUPerfAPICL.dll
|
To use the GPUPerfAPI library:
- Include the header file gpu_performance_api/gpu_perf_api.h. For Vulkan, include gpu_performance_api/gpu_perf_api_vk.h.
- Declare a variable of type GpaGetFuncTablePtrType
- Load the GPUPerfAPI library
- On Windows, use
LoadLibrary
on the GPUPerfAPI DLL for your chosen API (see above table) - On Linux, use
dlopen
on the GPUPerfAPI shared library for your chosen API (see above table)
- On Windows, use
- Get the address of the
GpaGetFuncTable
function- On Windows, use
GetProcAddres
- On Linux, use
dlsym
- On Windows, use
- Call GpaGetFuncTable to get a table of function pointers for each API.
All of the above can be simplified using the gpu_perf_api_interface_loader.h C++ header file. This header file simplifies the loading and initialization of the GPA entrypoints. The following code shows how to use this header file to load and initialize the DirectX 12 version of GPA:
#include "gpu_performance_api/gpu_perf_api_interface_loader.h"
#ifdef __cplusplus
GpaApiManager* GpaApiManager::gpa_api_manager_ = nullptr;
#endif
GpaFuncTableInfo* gpa_function_table_info = nullptr;
GpaFunctionTable* gpa_function_table = nullptr;
bool InitializeGpa()
{
bool ret_val = false;
if (kGpaStatusOk == GpaApiManager::Instance()->LoadApi(kGpaApiDirectx12))
{
gpa_function_table = GpaApiManager::Instance()->GetFunctionTable(kGpaApiDirectx12);
if (nullptr != gpa_function_table)
{
ret_val = kGpaStatusOk == gpa_function_table->GpaInitialize(kGpaInitializeDefaultBit);
}
}
return ret_val;
}
Registering a Logging Callback¶
An entrypoint is available for registering a callback function which GPUPerfAPI will use to report back additional information about errors and general API usage. It is recommended that all GPUPerfAPI clients register a logging callback for error messages at a minimum. Any time a GPUPerfAPI function returns an error, it will output a log message with more information about the condition that caused the error.
In order to use this feature, you must define a static function with the following signature:
void MyLoggingFunction(GpaLoggingType message_type, const char* message)
The function is registered using the GpaRegisterLoggingCallback entrypoint.
The function registered will receive callbacks for message types registered.
The message type is passed into the logging function so that different message
types can be handled differently if desired. For instance, errors could be
output to stderr
or be used to raise an assert, while messages and trace
information could be output to an application’s or tool’s normal log file. A
tool may also want to prefix log messages with a string representation of the
log type before writing the message. The messages passed into the logging
function will not have a newline at the end, allowing for more flexible
handling of the message.
Initializing and Destroying a GPUPerfAPI Instance¶
GPUPerfAPI must be initialized before the rendering context or device is created, so that the driver can be prepared for accessing hardware data. In the case of DirectX 12 or Vulkan, initialization must be done before a queue is created. Once you are done using GPUPerfAPI, you should destroy the GPUPerfAPI instance. In the case of DirectX 12, destruction must be done before the device is destroyed.
The following methods can be used to initialize and destroy GPUPerfAPI:
GPA Initialization/Destruction Method | Brief Description |
---|---|
GpaInitialize | Initializes the driver so that counters are exposed. |
GpaDestroy | Undoes any initialization to ensure proper behavior in applications that are not being profiled. |
An example of the code used to initialize a GPUPerfAPI instance can be seen above in the GpaInterfaceLoader sample code
Opening and Closing a Context¶
After initializing a GPUPerfAPI instance and after the necessary API-specific construct has been created, a context can be opened using the GpaOpenContext function. Once a context is open, you can query the available performance counters and create and begin a session. After you are done using GPUPerfAPI, you should close the context.
The following methods can be used to open and close contexts:
Context Handling Method | Brief Description |
---|---|
GpaOpenContext | Opens the counters in the specified context for reading. |
GpaCloseContext | Closes the counters in the specified context. |
When calling GpaOpenContext, the type of the supplied context
is
different depending on which API is being used. See the table below for the
required type which should be passed to GpaOpenContext:
API | GpaOpenContext context Parameter Type |
---|---|
Vulkan | GpaVkContextOpenInfo* (defined in gpu_perf_api_vk.h)
|
DirectX 12 | ID3D12Device* |
DirectX 11 | ID3D11Device* |
OpenGL | Windows:
HGLRC Linux:
GLXContext |
OpenCL | cl_command_queue* |
Querying a Context and Counters¶
After creating a context, you can use the returned GpaContextId to query information about the context and the performance counters exposed by the context.
The following methods can be used to query information about the context:
Context Query Method | Brief Description |
---|---|
GpaGetSupportedSampleTypes | Gets a mask of the sample types supported by the specified context. |
GpaGetDeviceAndRevisionId | Gets the GPU device and revision id associated with the specified context. |
GpaGetDeviceName | Gets the device name of the GPU associated with the specified context. |
GpaGetDeviceGeneration | Gets the device generation of the GPU associated with the specified context. |
The following methods can be used to query information about performance counters:
Counter Query Method | Brief Description |
---|---|
GpaGetNumCounters | Gets the number of counters available. |
GpaGetCounterName | Gets the name of the specified counter. |
GpaGetCounterIndex | Gets index of a counter given its name (case insensitive). |
GpaGetCounterGroup | Gets the group of the specified counter. |
GpaGetCounterDescription | Gets the description of the specified counter. |
GpaGetCounterDataType | Gets the data type of the specified counter. |
GpaGetCounterUsageType | Gets the usage type of the specified counter. |
GpaGetCounterUuid | Gets the UUID of the specified counter. |
GpaGetCounterSampleType | Gets the supported sample type of the specified counter. |
GpaGetDataTypeAsStr | Gets a string with the name of the specified counter data type. |
GpaGetUsageTypeAsStr | Gets a string with the name of the specified counter usage type. |
Creating and Using a Session¶
After creating a context, a session can be created. A session is the container for enabling counters, sampling GPU workloads and storing results.
The following methods can be used to manage sessions:
Session Handling Method | Brief Description |
---|---|
GpaCreateSession | Creates a session. |
GpaDeleteSession | Deletes a session object. |
GpaBeginSession | Begins sampling with the currently enabled set of counters. |
GpaEndSession | Ends sampling with the currently enabled set of counters. |
Enabling Counters on a Session¶
After creating a session but before sampling on that session, counters should be enabled. This must be done after GpaCreateSession is called, but before GpaBeginSession is called.
The following methods can be used to enable/disable counters on a session:
Counter Enable/Disable Method | Brief Description |
---|---|
GpaEnableCounter | Enables a specified counter. |
GpaDisableCounter | Disables a specified counter. |
GpaEnableCounterByName | Enables a specified counter using the counter name (case insensitive). |
GpaDisableCounterByName | Disables a specified counter using the counter name (case insensitive). |
GpaEnableAllCounters | Enables all counters. |
GpaDisableAllCounters | Disables all counters. |
Querying Enabled Counters and Counter Scheduling¶
A session can be also queried for information about which counters are enabled as well as information on the number of passes required for the current set of enabled counters.
The following methods can be used to query enabled counters and counter scheduling on a session:
Counter Scheduling Query Method | Brief Description |
---|---|
GpaGetPassCount | Gets the number of passes required for the currently enabled set of counters. |
GpaGetNumEnabledCounters | Gets the number of enabled counters. |
GpaGetEnabledIndex | Gets the counter index for an enabled counter. |
GpaIsCounterEnabled | Checks whether or not a counter is enabled. |
Creating and Managing Samples¶
After counters are enabled on a session and the session has been started, GPA command lists and samples can be created. A sample is the GPU workload for which performance counters are to be collected. All enabled counters will be collected for each sample. For DirectX 12 and Vulkan, samples can start on one command list and end on another. There is also special handling needed for DirectX 12 bundles and Vulkan secondary command buffers.
The following methods can be used to create and manage samples on a session:
Sample Handling Method | Brief Description |
---|---|
GpaBeginCommandList | Begins command list for sampling. |
GpaEndCommandList | Ends command list for sampling. |
GpaBeginSample | Begins a sample in a command list. |
GpaEndSample | Ends a sample in a command list. |
GpaContinueSampleOnCommandList | Continues a primary command list sample on another primary command list. |
GpaCopySecondarySamples | Copies a set of samples from a secondary command list back to the primary command list that executed the secondary command list. |
GpaGetSampleCount | Returns the number of samples created for the specified session. |
Querying Results¶
Once sampling is complete and the session has been ended, the sample results can be read. For DirectX 12 and Vulkan, the command list or command buffer which contains the samples must have been fully executed before results will be available.
The following methods can be used to check if results are available and to read the results for samples:
Results Querying Method | Brief Description |
---|---|
GpaIsPassComplete | Checks whether or not a pass has finished. |
GpaIsSessionComplete | Checks if results for all samples within a session are available. |
GpaGetSampleResultSize | Gets the result size for a given sample. |
GpaGetSampleResult | Gets the result data for a given sample. |
Displaying Status/Error¶
All GPUPerfAPI functions return a GpaStatus code to indicate success or failure. A simple string representation of the status or error codes can be retrieved using the following method:
Status/Error Helper Method | Brief Description |
---|---|
GpaGetStatusAsStr | Gets a string representation of a GpaStatus value. |
Multi-pass Counter Collection¶
Collection of some individual counters and some combinations of counters will require more than one pass. After enabling counters, you can query the number of passes required. If the number of passes is greater than one, you will need to execute an identical GPU workload once for each pass. For DirectX 12 and Vulkan, this typically means recording the same command list or command buffer more than once, calling GpaBeginCommandList on each command list for each pass, and beginning and ending samples for the same workloads within the command lists. For other graphics and compute APIs, this means making the same draw calls or dispatching the same kernels in the same sequence multiple times. The same sample id must be found in every pass, and that sample id must be used for the same workload within each pass. If it is impossible or impractical to repeat the operations to be profiled, select a counter set requiring only a single pass. For sets requiring more than one pass, results are available only after all passes are complete.
Specific Usage Note for Vulkan¶
In order to enable counter collection in the Vulkan driver, several Vulkan
extensions are required. The application being profiled with GPUPerfAPI will
need to request those extensions as part of the Vulkan instance and device
initialization. GPUPerfAPI simplifies this by defining three macros in the
gpu_performance_api/gpu_perf_api_vk.h header file: AMD_GPA_REQUIRED_INSTANCE_EXTENSION_NAME_LIST
for the required instance extensions,
AMD_GPA_REQUIRED_DEVICE_EXTENSION_NAME_LIST
for the required device
extensions and AMD_GPA_OPTIONAL_DEVICE_EXTENSION_NAME_LIST
for optional,
but recommended, device extensions. The extensions defined in
AMD_GPA_REQUIRED_INSTANCE_EXTENSION_NAME_LIST
should be included in the
VkInstanceCreateInfo
structure that is passed to the vkCreateInstance
function. Similarly, the extensions defined in
AMD_GPA_REQUIRED_DEVICE_EXTENSION_NAME_LIST
and
AMD_GPA_OPTIONAL_DEVICE_EXTENSION_NAME_LIST
should be included in the
VkDeviceCreateInfo
structure that is passed to VkCreateDevice
function.
Specific Usage Note for Bundles (DirectX 12) and Secondary Command Buffers (Vulkan)¶
While samples within a Bundle or Secondary Command Buffer (both referred to here as “secondary command lists”) are supported by GPUPerfAPI, they require special handling. Both the primary and secondary command list must be started using GpaBeginCommandList. Samples can be created on both types of command lists; however, the samples on the secondary command list must be copied back to the primary command list. This is done using the GpaCopySecondarySamples function. Once samples are copied back to the primary command list, results will be available after the primary command list has been executed. Bundles or secondary command buffers must be re-recorded for each counter pass. This also means that extra GpaCommandListId instances must be created (one per pass for each bundle or secondary command buffer) in order to support copying the results from the bundles or secondary command buffers after execution.
Specific Usage Note for Samples that Start and End on Different Command Lists¶
For DirectX 12 and Vulkan, GPUPerfAPI supports starting a sample on one command list and ending it on another. For this to work properly, the command lists must be executed in the correct order by the application – the command list which ends the sample must be executed after the command list which begins the sample. Both the command list where the sample starts and the command list where the sample ends must be started using GpaBeginCommandList. After the sample has been started on the first command list using GpaBeginSample, it can be continued on another command list by calling GpaContinueSampleOnCommandList. After it has been continued, the sample can be ended using GpaEndSample and specifying the second command list.
Deploying GPUPerfAPI¶
To deploy an application that uses GPUPerfAPI, simply make sure that the necessary GPUPerfAPI library is available and can be loaded using the normal library search mechanism for the host operating system (i.e. in the PATH on Windows and LD_LIBRARY_PATH on Linux).
When deploying the DirectX 11 version on Windows, you will also need to deploy GPUPerfAPIDXGetAMDDeviceInfo.dll or GPUPerfAPIDXGetAMDDeviceInfo-x64.dll, if you need to support systems with multiple AMD GPUs. This library is used by GPA to determine which GPU is being used for rendering at runtime. For single-GPU systems, this library is not required.
Performance Counters¶
The performance counters exposed through GPU Performance API are organized into groups to help provide clarity and organization to all the available data. Below is a collective list of counters from all the supported hardware generations. Some of the counters may not be available depending on the hardware being profiled. To view which GPUs belong to which hardware generations, the best reference is the gs_cardInfo array in the common-src-DeviceInfo repository on GitHub. You can see how the various cards map to hardware generations by looking at the GDT_HW_GENERATION enum
For Graphics workloads, it is recommended that you initially profile with counters from the Timing group to determine whether the profiled calls are worth optimizing (based on GPUTime value), and which parts of the pipeline are performing the most work. Note that because the GPU is highly parallelized, various parts of the pipeline can be active at the same time; thus, the “Busy” counters probably will sum over 100 percent. After identifying one or more stages to investigate further, enable the corresponding counter groups for more information on the stage and whether or not potential optimizations exist.
Pipeline-Based Counter Groups¶
On Vega, RDNA, RDNA2, and RDNA3 hardware, certain use cases allow the driver to make optimizations by combining two shader stages together. For example, in a Vertex + Geometry + Pixel Shader pipeline (VS-GS-PS), the Vertex and Geometry Shaders get combined and GPUPerfAPI exposes them in the “VertexGeometry” group (counters with the “VsGs” prefix). In pipelines that use tessellation, the Vertex and Hull Shaders are combined and exposed as the “PreTessellation” group (with “PreTess” prefix), and the Domain and Geometry Shaders (if GS is used) are combined into the the “PostTessellation” group (with “PostTess” prefix). Pixel Shaders and Compute Shaders are always exposed as their respective types. The table below may help to visualize the mapping between the API-level shaders (across the top), and which prefixes to look for in the GPUPerfAPI counters.
Pipeline | Vertex | Hull | Domain | Geometry | Pixel | Compute |
---|---|---|---|---|---|---|
VS-PS | VsGs | PS | ||||
VS-GS-PS | VsGs | VsGs | PS | |||
VS-HS-DS-PS | PreTess | PreTess | PostTess | PostTess | PS | |
VS-HS-DS-GS-PS | PreTess | PreTess | PostTess | PostTess | PS | |
CS | CS |
A Note About Third-Party Applications¶
Several third-party applications (such as RenderDoc and Microsoft PIX) integrate GPUPerfAPI as part of their profiling feature set. These applications may choose to expose only a subset of the counters supported by GPUPerfAPI, especially in cases where the counters do not support the design goals of the application. Specifically, it is known that the counters reporting a percentage are not exposed. This is due to the way that these tools collect and report aggregate performance counter values for groups of draw calls. For instance, if a set of draw calls is grouped together by a User Marker, a tool may report performance counter values for the User Marker by simply summing up the counter values for the individual draw calls. While this may be valid for many counters, it does not work well for percentage-based counters. Even if the tools were to perform a simple average of the percent values, it still may not provide an accurate reflection of the actual performance. For most of the percentage-based counters, GPUPerfAPI also exposes counters representing the components used to calculate the percentage. One example of this is the cache hit counters – these are exposed both as a cache hit percentage and as individual counters representing the number of cache requests, the number of hits and the number of misses. Please reference the Usage column of the tables below to know which counters will not be exposed by these applications.
Counters Exposed for Graphics Performance Analysis¶
The following tables show the set of counters exposed for analysis of GPU Graphics workloads, as well the family of GPUs and APUs on which each counter is available:
RDNA3 Counters¶
Timing Group¶
Counter Name | Usage | Brief Description |
---|---|---|
GPUTime | Nanoseconds | Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionDuration | Nanoseconds | GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionStart | Nanoseconds | GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP). |
ExecutionEnd | Nanoseconds | GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP). |
GPUBusy | Percentage | The percentage of time the GPU command processor was busy. |
GPUBusyCycles | Cycles | Number of GPU cycles that the GPU command processor was busy. |
TessellatorBusy | Percentage | The percentage of time the tessellation engine is busy. |
TessellatorBusyCycles | Cycles | Number of GPU cycles that the tessellation engine is busy. |
VsGsBusy | Percentage | The percentage of time the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline. |
VsGsBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline. |
VsGsTime | Nanoseconds | Time VS or GS are busy in nanoseconds in a VS-[GS-]PS pipeline. |
PreTessellationBusy | Percentage | The percentage of time the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation. |
PreTessellationBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation. |
PreTessellationTime | Nanoseconds | Time VS and HS are busy in nanoseconds in a pipeline that uses tessellation. |
PostTessellationBusy | Percentage | The percentage of time the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation. |
PostTessellationBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation. |
PostTessellationTime | Nanoseconds | Time DS or GS are busy in nanoseconds in a pipeline that uses tessellation. |
PSBusy | Percentage | The percentage of time the ShaderUnit has pixel shader work to do. |
PSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has pixel shader work to do. |
PSTime | Nanoseconds | Time pixel shaders are busy in nanoseconds. |
CSBusy | Percentage | The percentage of time the ShaderUnit has compute shader work to do. |
CSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has compute shader work to do. |
CSTime | Nanoseconds | Time compute shaders are busy in nanoseconds. |
PrimitiveAssemblyBusy | Percentage | The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
PrimitiveAssemblyBusyCycles | Cycles | Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
TexUnitBusy | Percentage | The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
TexUnitBusyCycles | Cycles | Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
DepthStencilTestBusy | Percentage | Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy. |
DepthStencilTestBusyCycles | Cycles | Number of GPU cycles spent performing depth and stencil tests. |
VertexGeometry Group¶
Counter Name | Usage | Brief Description |
---|---|---|
VsGsVerticesIn | Items | The number of unique vertices processed by the VS and GS. |
VsGsPrimsIn | Items | The number of primitives passed into the GS. |
PreTessellation Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PreTessVerticesIn | Items | The number of unique vertices processed by the VS and HS when using tessellation. |
PostTessellation Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PostTessPrimsOut | Items | The number of primitives output by the DS and GS when using tessellation. |
PrimitiveAssembly Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PrimitivesIn | Items | The number of primitives received by the hardware. This includes primitives generated by tessellation. |
CulledPrims | Items | The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling. |
ClippedPrims | Items | The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes. |
PAStalledOnRasterizer | Percentage | Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations. |
PAStalledOnRasterizerCycles | Cycles | Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations. |
PixelShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PSPixelsOut | Items | Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs. |
PSExportStalls | Percentage | Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
PSExportStallsCycles | Cycles | Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
ComputeShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CSThreadGroups | Items | Total number of thread groups. |
CSWavefronts | Items | The total number of wavefronts used for the CS. |
CSThreads | Items | The number of CS threads processed by the hardware. |
CSThreadGroupSize | Items | The number of CS threads within each thread group. |
CSMemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
CSMemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
CSMemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
CSMemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. |
CSWriteUnitStalled | Percentage | The percentage of GPUTime the write unit is stalled. |
CSWriteUnitStalledCycles | Cycles | Number of GPU cycles the write unit is stalled. |
CSALUStalledByLDS | Percentage | The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
CSALUStalledByLDSCycles | Cycles | Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. |
CSLDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
CSLDSBankConflictCycles | Cycles | Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad). |
TextureUnit Group¶
Counter Name | Usage | Brief Description |
---|---|---|
TexTriFilteringPct | Percentage | Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
TexTriFilteringCount | Items | Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
NoTexTriFilteringCount | Items | Count of pixels that did not receive trilinear filtering. |
TexVolFilteringPct | Percentage | Percentage of pixels that received volume filtering. |
TexVolFilteringCount | Items | Count of pixels that received volume filtering. |
NoTexVolFilteringCount | Items | Count of pixels that did not receive volume filtering. |
TexAveAnisotropy | Items | The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy. |
DepthAndStencil Group¶
Counter Name | Usage | Brief Description |
---|---|---|
HiZTilesAccepted | Percentage | Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesAcceptedCount | Items | Count of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesRejectedCount | Items | Count of tiles not accepted by HiZ. |
PreZTilesDetailCulled | Percentage | Percentage of tiles rejected because the associated prim had no contributing area. |
PreZTilesDetailCulledCount | Items | Count of tiles rejected because the associated primitive had no contributing area. |
PreZTilesDetailSurvivingCount | Items | Count of tiles surviving because the associated primitive had contributing area. |
HiZQuadsCulled | Percentage | Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsCulledCount | Items | Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsAcceptedCount | Items | Count of quads that did continue on in the pipeline after HiZ. |
PreZQuadsCulled | Percentage | Percentage of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsCulledCount | Items | Count of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsSurvivingCount | Items | Count of quads surviving detailZ and earlyZ tests. |
PostZQuads | Percentage | Percentage of quads for which the pixel shader will run and may be postZ tested. |
PostZQuadCount | Items | Count of quads for which the pixel shader will run and may be postZ tested. |
PreZSamplesPassing | Items | Number of samples tested for Z before shading and passed. |
PreZSamplesFailingS | Items | Number of samples tested for Z before shading and failed stencil test. |
PreZSamplesFailingZ | Items | Number of samples tested for Z before shading and failed Z test. |
PostZSamplesPassing | Items | Number of samples tested for Z after shading and passed. |
PostZSamplesFailingS | Items | Number of samples tested for Z after shading and failed stencil test. |
PostZSamplesFailingZ | Items | Number of samples tested for Z after shading and failed Z test. |
ZUnitStalled | Percentage | The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations. |
ZUnitStalledCycles | Cycles | Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations. |
DBMemRead | Bytes | Number of bytes read from the depth buffer. |
DBMemWritten | Bytes | Number of bytes written to the depth buffer. |
ColorBuffer Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CBMemRead | Bytes | Number of bytes read from the color buffer. |
CBColorAndMaskRead | Bytes | Total number of bytes read from the color and mask buffers. |
CBMemWritten | Bytes | Number of bytes written to the color buffer. |
CBColorAndMaskWritten | Bytes | Total number of bytes written to the color and mask buffers. |
MemoryCache Group¶
Counter Name | Usage | Brief Description |
---|---|---|
L0CacheHit | Percentage | The percentage of read requests that hit the data in the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L0CacheRequestCount | Items | The number of read requests made to the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
L0CacheHitCount | Items | The number of read requests which result in a cache hit from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
L0CacheMissCount | Items | The number of read requests which result in a cache miss from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
ScalarCacheHit | Percentage | The percentage of read requests made from executing shader code that hit the data in the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
ScalarCacheRequestCount | Items | The number of read requests made from executing shader code to the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
ScalarCacheHitCount | Items | The number of read requests made from executing shader code which result in a cache hit from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
ScalarCacheMissCount | Items | The number of read requests made from executing shader code which result in a cache miss from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
InstCacheHit | Percentage | The percentage of read requests made that hit the data in the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
InstCacheRequestCount | Items | The number of read requests made to the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
InstCacheHitCount | Items | The number of read requests which result in a cache hit from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
InstCacheMissCount | Items | The number of read requests which result in a cache miss from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
L1CacheHit | Percentage | The percentage of read or write requests that hit the data in the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L1CacheRequestCount | Items | The number of read or write requests made to the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L1CacheHitCount | Items | The number of read or write requests which result in a cache hit from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L1CacheMissCount | Items | The number of read or write requests which result in a cache miss from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L2CacheHit | Percentage | The percentage of read or write requests that hit the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L2CacheMiss | Percentage | The percentage of read or write requests that miss the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (optimal) to 100% (all miss). |
L2CacheRequestCount | Items | The number of read or write requests made to the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L2CacheHitCount | Items | The number of read or write requests which result in a cache hit from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L2CacheMissCount | Items | The number of read or write requests which result in a cache miss from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L0TagConflictReadStalledCycles | Items | The number of cycles read operations from the L0 cache are stalled due to tag conflicts. |
L0TagConflictWriteStalledCycles | Items | The number of cycles write operations to the L0 cache are stalled due to tag conflicts. |
L0TagConflictAtomicStalledCycles | Items | The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts. |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Bytes | The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Bytes | The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
MemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
WriteUnitStalledCycles | Cycles | Number of GPU cycles the Write unit is stalled. |
LocalVidMemBytes | Bytes | Number of bytes read from or written to local video memory |
PcieBytes | Bytes | Number of bytes sent and received over the PCIe bus |
RayTracing Group¶
Counter Name | Usage | Brief Description |
---|---|---|
RayTriTests | Items | The number of ray triangle intersection tests. |
RayBoxTests | Items | The number of ray box intersection tests. |
TotalRayTests | Items | Total number of ray intersection tests, includes both box and triangle intersections. |
RayTestsPerWave | Items | The number of ray intersection tests per wave. |
RDNA2 Counters¶
Timing Group¶
Counter Name | Usage | Brief Description |
---|---|---|
GPUTime | Nanoseconds | Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionDuration | Nanoseconds | GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionStart | Nanoseconds | GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP). |
ExecutionEnd | Nanoseconds | GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP). |
GPUBusy | Percentage | The percentage of time the GPU command processor was busy. |
GPUBusyCycles | Cycles | Number of GPU cycles that the GPU command processor was busy. |
TessellatorBusy | Percentage | The percentage of time the tessellation engine is busy. |
TessellatorBusyCycles | Cycles | Number of GPU cycles that the tessellation engine is busy. |
VsGsBusy | Percentage | The percentage of time the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline. |
VsGsBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline. |
VsGsTime | Nanoseconds | Time VS or GS are busy in nanoseconds in a VS-[GS-]PS pipeline. |
PreTessellationBusy | Percentage | The percentage of time the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation. |
PreTessellationBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation. |
PreTessellationTime | Nanoseconds | Time VS and HS are busy in nanoseconds in a pipeline that uses tessellation. |
PostTessellationBusy | Percentage | The percentage of time the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation. |
PostTessellationBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation. |
PostTessellationTime | Nanoseconds | Time DS or GS are busy in nanoseconds in a pipeline that uses tessellation. |
PSBusy | Percentage | The percentage of time the ShaderUnit has pixel shader work to do. |
PSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has pixel shader work to do. |
PSTime | Nanoseconds | Time pixel shaders are busy in nanoseconds. |
CSBusy | Percentage | The percentage of time the ShaderUnit has compute shader work to do. |
CSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has compute shader work to do. |
CSTime | Nanoseconds | Time compute shaders are busy in nanoseconds. |
PrimitiveAssemblyBusy | Percentage | The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
PrimitiveAssemblyBusyCycles | Cycles | Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
TexUnitBusy | Percentage | The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
TexUnitBusyCycles | Cycles | Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
DepthStencilTestBusy | Percentage | Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy. |
DepthStencilTestBusyCycles | Cycles | Number of GPU cycles spent performing depth and stencil tests. |
VertexGeometry Group¶
Counter Name | Usage | Brief Description |
---|---|---|
VsGsVerticesIn | Items | The number of unique vertices processed by the VS and GS. |
VsGsPrimsIn | Items | The number of primitives passed into the VS and GS. |
GSVerticesOut | Items | The number of vertices output by the GS. |
PreTessellation Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PreTessVerticesIn | Items | The number of vertices processed by the VS and HS when using tessellation. |
PostTessellation Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PostTessPrimsOut | Items | The number of primitives output by the DS and GS when using tessellation. |
PrimitiveAssembly Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PrimitivesIn | Items | The number of primitives received by the hardware. This includes primitives generated by tessellation. |
CulledPrims | Items | The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling. |
ClippedPrims | Items | The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes. |
PAStalledOnRasterizer | Percentage | Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations. |
PAStalledOnRasterizerCycles | Cycles | Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations. |
PixelShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PSPixelsOut | Items | Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs. |
PSExportStalls | Percentage | Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
PSExportStallsCycles | Cycles | Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
ComputeShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CSThreadGroups | Items | Total number of thread groups. |
CSWavefronts | Items | The total number of wavefronts used for the CS. |
CSThreads | Items | The number of CS threads processed by the hardware. |
CSThreadGroupSize | Items | The number of CS threads within each thread group. |
CSMemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
CSMemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
CSMemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
CSMemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. |
CSWriteUnitStalled | Percentage | The percentage of GPUTime the write unit is stalled. |
CSWriteUnitStalledCycles | Cycles | Number of GPU cycles the write unit is stalled. |
CSGDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
CSLDSInsts | Items | The average number of LDS read/write instructions executed per work-item (affected by flow control). |
CSALUStalledByLDS | Percentage | The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
CSALUStalledByLDSCycles | Cycles | Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. |
CSLDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
CSLDSBankConflictCycles | Cycles | Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad). |
TextureUnit Group¶
Counter Name | Usage | Brief Description |
---|---|---|
TexTriFilteringPct | Percentage | Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
TexTriFilteringCount | Items | Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
NoTexTriFilteringCount | Items | Count of pixels that did not receive trilinear filtering. |
TexVolFilteringPct | Percentage | Percentage of pixels that received volume filtering. |
TexVolFilteringCount | Items | Count of pixels that received volume filtering. |
NoTexVolFilteringCount | Items | Count of pixels that did not receive volume filtering. |
TexAveAnisotropy | Items | The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy. |
DepthAndStencil Group¶
Counter Name | Usage | Brief Description |
---|---|---|
HiZTilesAccepted | Percentage | Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesAcceptedCount | Items | Count of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesRejectedCount | Items | Count of tiles not accepted by HiZ. |
PreZTilesDetailCulled | Percentage | Percentage of tiles rejected because the associated prim had no contributing area. |
PreZTilesDetailCulledCount | Items | Count of tiles rejected because the associated primitive had no contributing area. |
PreZTilesDetailSurvivingCount | Items | Count of tiles surviving because the associated primitive had contributing area. |
HiZQuadsCulled | Percentage | Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsCulledCount | Items | Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsAcceptedCount | Items | Count of quads that did continue on in the pipeline after HiZ. |
PreZQuadsCulled | Percentage | Percentage of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsCulledCount | Items | Count of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsSurvivingCount | Items | Count of quads surviving detailZ and earlyZ tests. |
PostZQuads | Percentage | Percentage of quads for which the pixel shader will run and may be postZ tested. |
PostZQuadCount | Items | Count of quads for which the pixel shader will run and may be postZ tested. |
PreZSamplesPassing | Items | Number of samples tested for Z before shading and passed. |
PreZSamplesFailingS | Items | Number of samples tested for Z before shading and failed stencil test. |
PreZSamplesFailingZ | Items | Number of samples tested for Z before shading and failed Z test. |
PostZSamplesPassing | Items | Number of samples tested for Z after shading and passed. |
PostZSamplesFailingS | Items | Number of samples tested for Z after shading and failed stencil test. |
PostZSamplesFailingZ | Items | Number of samples tested for Z after shading and failed Z test. |
ZUnitStalled | Percentage | The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations. |
ZUnitStalledCycles | Cycles | Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations. |
DBMemRead | Bytes | Number of bytes read from the depth buffer. |
DBMemWritten | Bytes | Number of bytes written to the depth buffer. |
ColorBuffer Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CBMemRead | Bytes | Number of bytes read from the color buffer. |
CBColorAndMaskRead | Bytes | Total number of bytes read from the color and mask buffers. |
CBMemWritten | Bytes | Number of bytes written to the color buffer. |
CBColorAndMaskWritten | Bytes | Total number of bytes written to the color and mask buffers. |
CBSlowPixelPct | Percentage | Percentage of pixels written to the color buffer using a half-rate or quarter-rate format. |
CBSlowPixelCount | Items | Number of pixels written to the color buffer using a half-rate or quarter-rate format. |
MemoryCache Group¶
Counter Name | Usage | Brief Description |
---|---|---|
L0CacheHit | Percentage | The percentage of read requests that hit the data in the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L0CacheRequestCount | Items | The number of read requests made to the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
L0CacheHitCount | Items | The number of read requests which result in a cache hit from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
L0CacheMissCount | Items | The number of read requests which result in a cache miss from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
ScalarCacheHit | Percentage | The percentage of read requests made from executing shader code that hit the data in the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
ScalarCacheRequestCount | Items | The number of read requests made from executing shader code to the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
ScalarCacheHitCount | Items | The number of read requests made from executing shader code which result in a cache hit from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
ScalarCacheMissCount | Items | The number of read requests made from executing shader code which result in a cache miss from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
InstCacheHit | Percentage | The percentage of read requests made that hit the data in the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
InstCacheRequestCount | Items | The number of read requests made to the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
InstCacheHitCount | Items | The number of read requests which result in a cache hit from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
InstCacheMissCount | Items | The number of read requests which result in a cache miss from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
L1CacheHit | Percentage | The percentage of read or write requests that hit the data in the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L1CacheRequestCount | Items | The number of read or write requests made to the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L1CacheHitCount | Items | The number of read or write requests which result in a cache hit from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L1CacheMissCount | Items | The number of read or write requests which result in a cache miss from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L2CacheHit | Percentage | The percentage of read or write requests that hit the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L2CacheMiss | Percentage | The percentage of read or write requests that miss the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (optimal) to 100% (all miss). |
L2CacheRequestCount | Items | The number of read or write requests made to the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L2CacheHitCount | Items | The number of read or write requests which result in a cache hit from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L2CacheMissCount | Items | The number of read or write requests which result in a cache miss from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L0TagConflictReadStalledCycles | Items | The number of cycles read operations from the L0 cache are stalled due to tag conflicts. |
L0TagConflictWriteStalledCycles | Items | The number of cycles write operations to the L0 cache are stalled due to tag conflicts. |
L0TagConflictAtomicStalledCycles | Items | The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts. |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Bytes | The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Bytes | The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
MemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
WriteUnitStalledCycles | Cycles | Number of GPU cycles the Write unit is stalled. |
LocalVidMemBytes | Bytes | Number of bytes read from or written to local video memory |
PcieBytes | Bytes | Number of bytes sent and received over the PCIe bus |
RayTracing Group¶
Counter Name | Usage | Brief Description |
---|---|---|
RayTriTests | Items | The number of ray triangle intersection tests. |
RayBoxTests | Items | The number of ray box intersection tests. |
TotalRayTests | Items | Total number of ray intersection tests, includes both box and triangle intersections. |
RayTestsPerWave | Items | The number of ray intersection tests per wave. |
RDNA Counters¶
Timing Group¶
Counter Name | Usage | Brief Description |
---|---|---|
GPUTime | Nanoseconds | Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionDuration | Nanoseconds | GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionStart | Nanoseconds | GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP). |
ExecutionEnd | Nanoseconds | GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP). |
GPUBusy | Percentage | The percentage of time the GPU command processor was busy. |
GPUBusyCycles | Cycles | Number of GPU cycles that the GPU command processor was busy. |
TessellatorBusy | Percentage | The percentage of time the tessellation engine is busy. |
TessellatorBusyCycles | Cycles | Number of GPU cycles that the tessellation engine is busy. |
VsGsBusy | Percentage | The percentage of time the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline. |
VsGsBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline. |
VsGsTime | Nanoseconds | Time VS or GS are busy in nanoseconds in a VS-[GS-]PS pipeline. |
PreTessellationBusy | Percentage | The percentage of time the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation. |
PreTessellationBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation. |
PreTessellationTime | Nanoseconds | Time VS and HS are busy in nanoseconds in a pipeline that uses tessellation. |
PostTessellationBusy | Percentage | The percentage of time the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation. |
PostTessellationBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation. |
PostTessellationTime | Nanoseconds | Time DS or GS are busy in nanoseconds in a pipeline that uses tessellation. |
PSBusy | Percentage | The percentage of time the ShaderUnit has pixel shader work to do. |
PSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has pixel shader work to do. |
PSTime | Nanoseconds | Time pixel shaders are busy in nanoseconds. |
CSBusy | Percentage | The percentage of time the ShaderUnit has compute shader work to do. |
CSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has compute shader work to do. |
CSTime | Nanoseconds | Time compute shaders are busy in nanoseconds. |
PrimitiveAssemblyBusy | Percentage | The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
PrimitiveAssemblyBusyCycles | Cycles | Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
TexUnitBusy | Percentage | The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
TexUnitBusyCycles | Cycles | Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
DepthStencilTestBusy | Percentage | Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy. |
DepthStencilTestBusyCycles | Cycles | Number of GPU cycles spent performing depth and stencil tests. |
VertexGeometry Group¶
Counter Name | Usage | Brief Description |
---|---|---|
VsGsVerticesIn | Items | The number of unique vertices processed by the VS and GS. |
VsGsPrimsIn | Items | The number of primitives passed into the VS and GS. |
GSVerticesOut | Items | The number of vertices output by the GS. |
VsGsVALUInstCount | Items | Average number of vector ALU instructions executed for the VS and GS in a VS-[GS-]PS pipeline. Affected by flow control. |
VsGsSALUInstCount | Items | Average number of scalar ALU instructions executed for the VS and GS. Affected by flow control. |
VsGsVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed for the VS and GS. |
VsGsVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are being processed for the VS and GS. |
VsGsSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed for the VS and GS. |
VsGsSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed for the VS and GS. |
PreTessellation Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PreTessVerticesIn | Items | The number of vertices processed by the VS and HS when using tessellation. |
PreTessVALUInstCount | Items | Average number of vector ALU instructions executed for the VS and HS in a pipeline that uses tessellation. Affected by flow control. |
PreTessSALUInstCount | Items | Average number of scalar ALU instructions executed for the VS and HS in a pipeline that uses tessellation. Affected by flow control. |
PreTessVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation. |
PreTessVALUBusyCycles | Cycles | Number of GPU cycles vector where ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation. |
PreTessSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation. |
PreTessSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation. |
PostTessellation Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PostTessPrimsOut | Items | The number of primitives output by the DS and GS when using tessellation. |
PostTessVALUInstCount | Items | Average number of vector ALU instructions executed for the DS and GS in a pipeline that uses tessellation. Affected by flow control. |
PostTessSALUInstCount | Items | Average number of scalar ALU instructions executed for the DS and GS in a pipeline that uses tessellation. Affected by flow control. |
PostTessVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation. |
PostTessVALUBusyCycles | Cycles | Number of GPU cycles vector where ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation. |
PostTessSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation. |
PostTessSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation. |
PrimitiveAssembly Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PrimitivesIn | Items | The number of primitives received by the hardware. This includes primitives generated by tessellation. |
CulledPrims | Items | The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling. |
ClippedPrims | Items | The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes. |
PAStalledOnRasterizer | Percentage | Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations. |
PAStalledOnRasterizerCycles | Cycles | Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations. |
PixelShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PSPixelsOut | Items | Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs. |
PSExportStalls | Percentage | Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
PSExportStallsCycles | Cycles | Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
PSVALUInstCount | Items | Average number of vector ALU instructions executed in the PS. Affected by flow control. |
PSSALUInstCount | Items | Average number of scalar ALU instructions executed in the PS. Affected by flow control. |
PSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed by the PS. |
PSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are being processed by the PS. |
PSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed by the PS. |
PSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed by the PS. |
ComputeShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CSThreadGroups | Items | Total number of thread groups. |
CSWavefronts | Items | The total number of wavefronts used for the CS. |
CSThreads | Items | The number of CS threads processed by the hardware. |
CSThreadGroupSize | Items | The number of CS threads within each thread group. |
CSVALUInsts | Items | The average number of vector ALU instructions executed per work-item (affected by flow control). |
CSVALUUtilization | Percentage | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of the wave size. Value range: 0% (bad), 100% (ideal - no thread divergence). |
CSSALUInsts | Items | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
CSVFetchInsts | Items | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). |
CSSFetchInsts | Items | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
CSVWriteInsts | Items | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). |
CSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
CSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are processed. |
CSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
CSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are processed. |
CSMemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
CSMemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
CSMemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
CSMemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. |
CSWriteUnitStalled | Percentage | The percentage of GPUTime the write unit is stalled. |
CSWriteUnitStalledCycles | Cycles | Number of GPU cycles the write unit is stalled. |
CSGDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
CSLDSInsts | Items | The average number of LDS read/write instructions executed per work-item (affected by flow control). |
CSALUStalledByLDS | Percentage | The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
CSALUStalledByLDSCycles | Cycles | Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. |
CSLDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
CSLDSBankConflictCycles | Cycles | Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad). |
TextureUnit Group¶
Counter Name | Usage | Brief Description |
---|---|---|
TexTriFilteringPct | Percentage | Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
TexTriFilteringCount | Items | Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
NoTexTriFilteringCount | Items | Count of pixels that did not receive trilinear filtering. |
TexVolFilteringPct | Percentage | Percentage of pixels that received volume filtering. |
TexVolFilteringCount | Items | Count of pixels that received volume filtering. |
NoTexVolFilteringCount | Items | Count of pixels that did not receive volume filtering. |
TexAveAnisotropy | Items | The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy. |
DepthAndStencil Group¶
Counter Name | Usage | Brief Description |
---|---|---|
HiZTilesAccepted | Percentage | Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesAcceptedCount | Items | Count of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesRejectedCount | Items | Count of tiles not accepted by HiZ. |
PreZTilesDetailCulled | Percentage | Percentage of tiles rejected because the associated prim had no contributing area. |
PreZTilesDetailCulledCount | Items | Count of tiles rejected because the associated primitive had no contributing area. |
PreZTilesDetailSurvivingCount | Items | Count of tiles surviving because the associated primitive had contributing area. |
HiZQuadsCulled | Percentage | Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsCulledCount | Items | Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsAcceptedCount | Items | Count of quads that did continue on in the pipeline after HiZ. |
PreZQuadsCulled | Percentage | Percentage of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsCulledCount | Items | Count of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsSurvivingCount | Items | Count of quads surviving detailZ and earlyZ tests. |
PostZQuads | Percentage | Percentage of quads for which the pixel shader will run and may be postZ tested. |
PostZQuadCount | Items | Count of quads for which the pixel shader will run and may be postZ tested. |
PreZSamplesPassing | Items | Number of samples tested for Z before shading and passed. |
PreZSamplesFailingS | Items | Number of samples tested for Z before shading and failed stencil test. |
PreZSamplesFailingZ | Items | Number of samples tested for Z before shading and failed Z test. |
PostZSamplesPassing | Items | Number of samples tested for Z after shading and passed. |
PostZSamplesFailingS | Items | Number of samples tested for Z after shading and failed stencil test. |
PostZSamplesFailingZ | Items | Number of samples tested for Z after shading and failed Z test. |
ZUnitStalled | Percentage | The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations. |
ZUnitStalledCycles | Cycles | Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations. |
DBMemRead | Bytes | Number of bytes read from the depth buffer. |
DBMemWritten | Bytes | Number of bytes written to the depth buffer. |
ColorBuffer Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CBMemRead | Bytes | Number of bytes read from the color buffer. |
CBColorAndMaskRead | Bytes | Total number of bytes read from the color and mask buffers. |
CBMemWritten | Bytes | Number of bytes written to the color buffer. |
CBColorAndMaskWritten | Bytes | Total number of bytes written to the color and mask buffers. |
CBSlowPixelPct | Percentage | Percentage of pixels written to the color buffer using a half-rate or quarter-rate format. |
CBSlowPixelCount | Items | Number of pixels written to the color buffer using a half-rate or quarter-rate format. |
MemoryCache Group¶
Counter Name | Usage | Brief Description |
---|---|---|
L0CacheHit | Percentage | The percentage of read requests that hit the data in the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L0CacheRequestCount | Items | The number of read requests made to the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
L0CacheHitCount | Items | The number of read requests which result in a cache hit from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
L0CacheMissCount | Items | The number of read requests which result in a cache miss from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. |
ScalarCacheHit | Percentage | The percentage of read requests made from executing shader code that hit the data in the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
ScalarCacheRequestCount | Items | The number of read requests made from executing shader code to the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
ScalarCacheHitCount | Items | The number of read requests made from executing shader code which result in a cache hit from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
ScalarCacheMissCount | Items | The number of read requests made from executing shader code which result in a cache miss from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. |
InstCacheHit | Percentage | The percentage of read requests made that hit the data in the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
InstCacheRequestCount | Items | The number of read requests made to the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
InstCacheHitCount | Items | The number of read requests which result in a cache hit from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
InstCacheMissCount | Items | The number of read requests which result in a cache miss from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. |
L1CacheHit | Percentage | The percentage of read or write requests that hit the data in the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L1CacheRequestCount | Items | The number of read or write requests made to the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L1CacheHitCount | Items | The number of read or write requests which result in a cache hit from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L1CacheMissCount | Items | The number of read or write requests which result in a cache miss from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. |
L2CacheHit | Percentage | The percentage of read or write requests that hit the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal). |
L2CacheMiss | Percentage | The percentage of read or write requests that miss the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (optimal) to 100% (all miss). |
L2CacheRequestCount | Items | The number of read or write requests made to the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L2CacheHitCount | Items | The number of read or write requests which result in a cache hit from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L2CacheMissCount | Items | The number of read or write requests which result in a cache miss from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. |
L0TagConflictReadStalledCycles | Items | The number of cycles read operations from the L0 cache are stalled due to tag conflicts. |
L0TagConflictWriteStalledCycles | Items | The number of cycles write operations to the L0 cache are stalled due to tag conflicts. |
L0TagConflictAtomicStalledCycles | Items | The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts. |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Bytes | The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Bytes | The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
MemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
WriteUnitStalledCycles | Cycles | Number of GPU cycles the Write unit is stalled. |
LocalVidMemBytes | Bytes | Number of bytes read from or written to local video memory |
PcieBytes | Bytes | Number of bytes sent and received over the PCIe bus |
Vega Counters¶
Timing Group¶
Counter Name | Usage | Brief Description |
---|---|---|
GPUTime | Nanoseconds | Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionDuration | Nanoseconds | GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionStart | Nanoseconds | GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP). |
ExecutionEnd | Nanoseconds | GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP). |
GPUBusy | Percentage | The percentage of time the GPU command processor was busy. |
GPUBusyCycles | Cycles | Number of GPU cycles that the GPU command processor was busy. |
TessellatorBusy | Percentage | The percentage of time the tessellation engine is busy. |
TessellatorBusyCycles | Cycles | Number of GPU cycles that the tessellation engine is busy. |
VsGsBusy | Percentage | The percentage of time the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline. |
VsGsBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline. |
VsGsTime | Nanoseconds | Time VS or GS are busy in nanoseconds in a VS-[GS-]PS pipeline. |
PreTessellationBusy | Percentage | The percentage of time the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation. |
PreTessellationBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation. |
PreTessellationTime | Nanoseconds | Time VS and HS are busy in nanoseconds in a pipeline that uses tessellation. |
PostTessellationBusy | Percentage | The percentage of time the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation. |
PostTessellationBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation. |
PostTessellationTime | Nanoseconds | Time DS or GS are busy in nanoseconds in a pipeline that uses tessellation. |
PSBusy | Percentage | The percentage of time the ShaderUnit has pixel shader work to do. |
PSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has pixel shader work to do. |
PSTime | Nanoseconds | Time pixel shaders are busy in nanoseconds. |
CSBusy | Percentage | The percentage of time the ShaderUnit has compute shader work to do. |
CSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has compute shader work to do. |
CSTime | Nanoseconds | Time compute shaders are busy in nanoseconds. |
PrimitiveAssemblyBusy | Percentage | The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
PrimitiveAssemblyBusyCycles | Cycles | Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
TexUnitBusy | Percentage | The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
TexUnitBusyCycles | Cycles | Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
DepthStencilTestBusy | Percentage | Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy. |
DepthStencilTestBusyCycles | Cycles | Number of GPU cycles spent performing depth and stencil tests. |
VertexGeometry Group¶
Counter Name | Usage | Brief Description |
---|---|---|
VsGsVerticesIn | Items | The number of unique vertices processed by the VS and GS. |
VsGsPrimsIn | Items | The number of primitives passed into the VS and GS. |
GSVerticesOut | Items | The number of vertices output by the GS. |
VsGsVALUInstCount | Items | Average number of vector ALU instructions executed in the VS and GS in a VS-[GS-]PS pipeline. Affected by flow control. |
VsGsSALUInstCount | Items | Average number of scalar ALU instructions executed in the VS and GS in a VS-[GS-]PS pipeline. Affected by flow control. |
VsGsVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed by the VS and GS in a VS-[GS-]PS pipeline. |
VsGsVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are being processed by the VS and GS in a VS-[GS-]PS pipeline. |
VsGsSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed by the VS and GS in a VS-[GS-]PS pipeline. |
VsGsSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed by the VS and GS in a VS-[GS-]PS pipeline. |
PreTessellation Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PreTessVerticesIn | Items | The number of vertices processed by the VS and HS when using tessellation. |
PreTessVALUInstCount | Items | Average number of vector ALU instructions executed for the VS and HS in a pipeline that uses tessellation. Affected by flow control. |
PreTessSALUInstCount | Items | Average number of scalar ALU instructions executed for the VS and HS in a pipeline that uses tessellation. Affected by flow control. |
PreTessVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation. |
PreTessVALUBusyCycles | Cycles | Number of GPU cycles vector where ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation. |
PreTessSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation. |
PreTessSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation. |
PostTessellation Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PostTessPrimsOut | Items | The number of primitives output by the DS and GS when using tessellation. |
PostTessVALUInstCount | Items | Average number of vector ALU instructions executed for the DS and GS in a pipeline that uses tessellation. Affected by flow control. |
PostTessSALUInstCount | Items | Average number of scalar ALU instructions executed for the DS and GS in a pipeline that uses tessellation. Affected by flow control. |
PostTessVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation. |
PostTessVALUBusyCycles | Cycles | Number of GPU cycles vector where ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation. |
PostTessSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation. |
PostTessSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation. |
PrimitiveAssembly Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PrimitivesIn | Items | The number of primitives received by the hardware. This includes primitives generated by tessellation. |
CulledPrims | Items | The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling. |
ClippedPrims | Items | The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes. |
PAStalledOnRasterizer | Percentage | Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations. |
PAStalledOnRasterizerCycles | Cycles | Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations. |
PixelShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PSPixelsOut | Items | Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs. |
PSExportStalls | Percentage | Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
PSExportStallsCycles | Cycles | Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
PSVALUInstCount | Items | Average number of vector ALU instructions executed in the PS. Affected by flow control. |
PSSALUInstCount | Items | Average number of scalar ALU instructions executed in the PS. Affected by flow control. |
PSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed by the PS. |
PSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are being processed by the PS. |
PSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed by the PS. |
PSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed by the PS. |
ComputeShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CSThreadGroups | Items | Total number of thread groups. |
CSWavefronts | Items | The total number of wavefronts used for the CS. |
CSThreads | Items | The number of CS threads processed by the hardware. |
CSThreadGroupSize | Items | The number of CS threads within each thread group. |
CSVALUInsts | Items | The average number of vector ALU instructions executed per work-item (affected by flow control). |
CSVALUUtilization | Percentage | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
CSSALUInsts | Items | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
CSVFetchInsts | Items | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). |
CSSFetchInsts | Items | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
CSVWriteInsts | Items | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). |
CSFlatVMemInsts | Items | The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch. |
CSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
CSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are processed. |
CSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
CSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are processed. |
CSMemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
CSMemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
CSMemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
CSMemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. |
CSWriteUnitStalled | Percentage | The percentage of GPUTime the write unit is stalled. |
CSWriteUnitStalledCycles | Cycles | Number of GPU cycles the write unit is stalled. |
CSGDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
CSLDSInsts | Items | The average number of LDS read/write instructions executed per work-item (affected by flow control). |
CSFlatLDSInsts | Items | The average number of FLAT instructions that read from or write to LDS executed per work item (affected by flow control). |
CSALUStalledByLDS | Percentage | The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
CSALUStalledByLDSCycles | Cycles | Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. |
CSLDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
CSLDSBankConflictCycles | Cycles | Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad). |
TextureUnit Group¶
Counter Name | Usage | Brief Description |
---|---|---|
TexTriFilteringPct | Percentage | Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
TexTriFilteringCount | Items | Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
NoTexTriFilteringCount | Items | Count of pixels that did not receive trilinear filtering. |
TexVolFilteringPct | Percentage | Percentage of pixels that received volume filtering. |
TexVolFilteringCount | Items | Count of pixels that received volume filtering. |
NoTexVolFilteringCount | Items | Count of pixels that did not receive volume filtering. |
TexAveAnisotropy | Items | The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy. |
DepthAndStencil Group¶
Counter Name | Usage | Brief Description |
---|---|---|
HiZTilesAccepted | Percentage | Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesAcceptedCount | Items | Count of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesRejectedCount | Items | Count of tiles not accepted by HiZ. |
PreZTilesDetailCulled | Percentage | Percentage of tiles rejected because the associated prim had no contributing area. |
PreZTilesDetailCulledCount | Items | Count of tiles rejected because the associated primitive had no contributing area. |
PreZTilesDetailSurvivingCount | Items | Count of tiles surviving because the associated primitive had contributing area. |
HiZQuadsCulled | Percentage | Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsCulledCount | Items | Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsAcceptedCount | Items | Count of quads that did continue on in the pipeline after HiZ. |
PreZQuadsCulled | Percentage | Percentage of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsCulledCount | Items | Count of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsSurvivingCount | Items | Count of quads surviving detailZ and earlyZ tests. |
PostZQuads | Percentage | Percentage of quads for which the pixel shader will run and may be postZ tested. |
PostZQuadCount | Items | Count of quads for which the pixel shader will run and may be postZ tested. |
PreZSamplesPassing | Items | Number of samples tested for Z before shading and passed. |
PreZSamplesFailingS | Items | Number of samples tested for Z before shading and failed stencil test. |
PreZSamplesFailingZ | Items | Number of samples tested for Z before shading and failed Z test. |
PostZSamplesPassing | Items | Number of samples tested for Z after shading and passed. |
PostZSamplesFailingS | Items | Number of samples tested for Z after shading and failed stencil test. |
PostZSamplesFailingZ | Items | Number of samples tested for Z after shading and failed Z test. |
ZUnitStalled | Percentage | The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations. |
ZUnitStalledCycles | Cycles | Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations. |
DBMemRead | Bytes | Number of bytes read from the depth buffer. |
DBMemWritten | Bytes | Number of bytes written to the depth buffer. |
ColorBuffer Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CBMemRead | Bytes | Number of bytes read from the color buffer. |
CBColorAndMaskRead | Bytes | Total number of bytes read from the color and mask buffers. |
CBMemWritten | Bytes | Number of bytes written to the color buffer. |
CBColorAndMaskWritten | Bytes | Total number of bytes written to the color and mask buffers. |
CBSlowPixelPct | Percentage | Percentage of pixels written to the color buffer using a half-rate or quarter-rate format. |
CBSlowPixelCount | Items | Number of pixels written to the color buffer using a half-rate or quarter-rate format. |
MemoryCache Group¶
Counter Name | Usage | Brief Description |
---|---|---|
L0TagConflictReadStalledCycles | Items | The number of cycles read operations from the L0 cache are stalled due to tag conflicts. |
L0TagConflictWriteStalledCycles | Items | The number of cycles write operations to the L0 cache are stalled due to tag conflicts. |
L0TagConflictAtomicStalledCycles | Items | The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts. |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Bytes | The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Bytes | The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
L1CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Value range: 0% (no hit) to 100% (optimal). |
L1CacheHitCount | Items | Count of fetch, write, atomic, and other instructions that hit the data in L1 cache. |
L1CacheMissCount | Items | Count of fetch, write, atomic, and other instructions that miss the data in L1 cache. |
L2CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the L2 cache. Value range: 0% (no hit) to 100% (optimal). |
L2CacheMiss | Percentage | The percentage of fetch, write, atomic, and other instructions that miss the L2 cache. Value range: 0% (optimal) to 100% (all miss). |
L2CacheHitCount | Items | Count of fetch, write, atomic, and other instructions that hit the L2 cache. |
L2CacheMissCount | Items | Count of fetch, write, atomic, and other instructions that miss the L2 cache. |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
MemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
WriteUnitStalledCycles | Cycles | Number of GPU cycles the Write unit is stalled. |
LocalVidMemBytes | Bytes | Number of bytes read from or written to local video memory |
PcieBytes | Bytes | Number of bytes sent and received over the PCIe bus |
Graphics IP v8 Counters¶
Timing Group¶
Counter Name | Usage | Brief Description |
---|---|---|
GPUTime | Nanoseconds | Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionDuration | Nanoseconds | GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel. |
ExecutionStart | Nanoseconds | GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP). |
ExecutionEnd | Nanoseconds | GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP). |
GPUBusy | Percentage | The percentage of time GPU was busy. |
GPUBusyCycles | Cycles | Number of GPU cycles that the GPU was busy. |
TessellatorBusy | Percentage | The percentage of time the tessellation engine is busy. |
TessellatorBusyCycles | Cycles | Number of GPU cycles that the tessellation engine is busy. |
VSBusy | Percentage | The percentage of time the ShaderUnit has vertex shader work to do. |
VSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has vertex shader work to do. |
VSTime | Nanoseconds | Time vertex shaders are busy in nanoseconds. |
HSBusy | Percentage | The percentage of time the ShaderUnit has hull shader work to do. |
HSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has hull shader work to do. |
HSTime | Nanoseconds | Time hull shaders are busy in nanoseconds. |
DSBusy | Percentage | The percentage of time the ShaderUnit has domain shader work to do. |
DSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has domain shader work to do. |
DSTime | Nanoseconds | Time domain shaders are busy in nanoseconds. |
GSBusy | Percentage | The percentage of time the ShaderUnit has geometry shader work to do. |
GSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has geometry shader work to do. |
GSTime | Nanoseconds | Time geometry shaders are busy in nanoseconds. |
PSBusy | Percentage | The percentage of time the ShaderUnit has pixel shader work to do. |
PSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has pixel shader work to do. |
PSTime | Nanoseconds | Time pixel shaders are busy in nanoseconds. |
CSBusy | Percentage | The percentage of time the ShaderUnit has compute shader work to do. |
CSBusyCycles | Cycles | Number of GPU cycles that the ShaderUnit has compute shader work to do. |
CSTime | Nanoseconds | Time compute shaders are busy in nanoseconds. |
PrimitiveAssemblyBusy | Percentage | The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
PrimitiveAssemblyBusyCycles | Cycles | Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck. |
TexUnitBusy | Percentage | The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
TexUnitBusyCycles | Cycles | Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account. |
DepthStencilTestBusy | Percentage | Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy. |
DepthStencilTestBusyCycles | Cycles | Number of GPU cycles spent performing depth and stencil tests. |
VertexShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
VSVerticesIn | Items | The number of vertices processed by the VS. |
VSVALUInstCount | Items | Average number of vector ALU instructions executed in the VS. Affected by flow control. |
VSSALUInstCount | Items | Average number of scalar ALU instructions executed in the VS. Affected by flow control. |
VSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed by the VS. |
VSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are being processed by the VS. |
VSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed by the VS. |
VSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed by the VS. |
HullShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
HSPatches | Items | The number of patches processed by the HS. |
HSVALUInstCount | Items | Average number of vector ALU instructions executed in the HS. Affected by flow control. |
HSSALUInstCount | Items | Average number of scalar ALU instructions executed in the HS. Affected by flow control. |
HSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed by the HS. |
HSVALUBusyCycles | Cycles | Number of GPU cycles vector where ALU instructions are being processed by the HS. |
HSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed by the HS. |
HSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed by the HS. |
DomainShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
DSVerticesIn | Items | The number of vertices processed by the DS. |
DSVALUInstCount | Items | Average number of vector ALU instructions executed in the DS. Affected by flow control. |
DSSALUInstCount | Items | Average number of scalar ALU instructions executed in the DS. Affected by flow control. |
DSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed by the DS. |
DSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are being processed by the DS. |
DSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed by the DS. |
DSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed by the DS. |
GeometryShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
GSPrimsIn | Items | The number of primitives passed into the GS. |
GSVerticesOut | Items | The number of vertices output by the GS. |
GSVALUInstCount | Items | Average number of vector ALU instructions executed in the GS. Affected by flow control. |
GSSALUInstCount | Items | Average number of scalar ALU instructions executed in the GS. Affected by flow control. |
GSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed by the GS. |
GSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are being processed by the GS. |
GSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed by the GS. |
GSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed by the GS. |
PrimitiveAssembly Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PrimitivesIn | Items | The number of primitives received by the hardware. This includes primitives generated by tessellation. |
CulledPrims | Items | The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling. |
ClippedPrims | Items | The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes. |
PAStalledOnRasterizer | Percentage | Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations. |
PAStalledOnRasterizerCycles | Cycles | Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations. |
PixelShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
PSPixelsOut | Items | Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs. |
PSExportStalls | Percentage | Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
PSExportStallsCycles | Cycles | Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer. |
PSVALUInstCount | Items | Average number of vector ALU instructions executed in the PS. Affected by flow control. |
PSSALUInstCount | Items | Average number of scalar ALU instructions executed in the PS. Affected by flow control. |
PSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are being processed by the PS. |
PSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are being processed by the PS. |
PSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are being processed by the PS. |
PSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are being processed by the PS. |
ComputeShader Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CSThreadGroups | Items | Total number of thread groups. |
CSWavefronts | Items | The total number of wavefronts used for the CS. |
CSThreads | Items | The number of CS threads processed by the hardware. |
CSThreadGroupSize | Items | The number of CS threads within each thread group. |
CSVALUInsts | Items | The average number of vector ALU instructions executed per work-item (affected by flow control). |
CSVALUUtilization | Percentage | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
CSSALUInsts | Items | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
CSVFetchInsts | Items | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). |
CSSFetchInsts | Items | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
CSVWriteInsts | Items | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). |
CSFlatVMemInsts | Items | The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch. |
CSVALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
CSVALUBusyCycles | Cycles | Number of GPU cycles where vector ALU instructions are processed. |
CSSALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
CSSALUBusyCycles | Cycles | Number of GPU cycles where scalar ALU instructions are processed. |
CSMemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
CSMemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
CSMemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
CSMemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. |
CSWriteUnitStalled | Percentage | The percentage of GPUTime the write unit is stalled. |
CSWriteUnitStalledCycles | Cycles | Number of GPU cycles the write unit is stalled. |
CSGDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
CSLDSInsts | Items | The average number of LDS read/write instructions executed per work-item (affected by flow control). |
CSFlatLDSInsts | Items | The average number of FLAT instructions that read from or write to LDS executed per work item (affected by flow control). |
CSALUStalledByLDS | Percentage | The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
CSALUStalledByLDSCycles | Cycles | Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. |
CSLDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
CSLDSBankConflictCycles | Cycles | Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad). |
TextureUnit Group¶
Counter Name | Usage | Brief Description |
---|---|---|
TexTriFilteringPct | Percentage | Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
TexTriFilteringCount | Items | Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified). |
NoTexTriFilteringCount | Items | Count of pixels that did not receive trilinear filtering. |
TexVolFilteringPct | Percentage | Percentage of pixels that received volume filtering. |
TexVolFilteringCount | Items | Count of pixels that received volume filtering. |
NoTexVolFilteringCount | Items | Count of pixels that did not receive volume filtering. |
TexAveAnisotropy | Items | The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy. |
DepthAndStencil Group¶
Counter Name | Usage | Brief Description |
---|---|---|
HiZTilesAccepted | Percentage | Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesAcceptedCount | Items | Count of tiles accepted by HiZ and will be rendered to the depth or color buffers. |
HiZTilesRejectedCount | Items | Count of tiles not accepted by HiZ. |
PreZTilesDetailCulled | Percentage | Percentage of tiles rejected because the associated prim had no contributing area. |
PreZTilesDetailCulledCount | Items | Count of tiles rejected because the associated primitive had no contributing area. |
PreZTilesDetailSurvivingCount | Items | Count of tiles surviving because the associated primitive had contributing area. |
HiZQuadsCulled | Percentage | Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsCulledCount | Items | Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized. |
HiZQuadsAcceptedCount | Items | Count of quads that did continue on in the pipeline after HiZ. |
PreZQuadsCulled | Percentage | Percentage of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsCulledCount | Items | Count of quads rejected based on the detailZ and earlyZ tests. |
PreZQuadsSurvivingCount | Items | Count of quads surviving detailZ and earlyZ tests. |
PostZQuads | Percentage | Percentage of quads for which the pixel shader will run and may be postZ tested. |
PostZQuadCount | Items | Count of quads for which the pixel shader will run and may be postZ tested. |
PreZSamplesPassing | Items | Number of samples tested for Z before shading and passed. |
PreZSamplesFailingS | Items | Number of samples tested for Z before shading and failed stencil test. |
PreZSamplesFailingZ | Items | Number of samples tested for Z before shading and failed Z test. |
PostZSamplesPassing | Items | Number of samples tested for Z after shading and passed. |
PostZSamplesFailingS | Items | Number of samples tested for Z after shading and failed stencil test. |
PostZSamplesFailingZ | Items | Number of samples tested for Z after shading and failed Z test. |
ZUnitStalled | Percentage | The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations. |
ZUnitStalledCycles | Cycles | Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations. |
DBMemRead | Bytes | Number of bytes read from the depth buffer. |
DBMemWritten | Bytes | Number of bytes written to the depth buffer. |
ColorBuffer Group¶
Counter Name | Usage | Brief Description |
---|---|---|
CBMemRead | Bytes | Number of bytes read from the color buffer. |
CBColorAndMaskRead | Bytes | Total number of bytes read from the color and mask buffers. |
CBMemWritten | Bytes | Number of bytes written to the color buffer. |
CBColorAndMaskWritten | Bytes | Total number of bytes written to the color and mask buffers. |
CBSlowPixelPct | Percentage | Percentage of pixels written to the color buffer using a half-rate or quarter-rate format. |
CBSlowPixelCount | Items | Number of pixels written to the color buffer using a half-rate or quarter-rate format. |
MemoryCache Group¶
Counter Name | Usage | Brief Description |
---|---|---|
L0TagConflictReadStalledCycles | Items | The number of cycles read operations from the L0 cache are stalled due to tag conflicts. |
L0TagConflictWriteStalledCycles | Items | The number of cycles write operations to the L0 cache are stalled due to tag conflicts. |
L0TagConflictAtomicStalledCycles | Items | The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts. |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Bytes | The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Bytes | The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data cache. Value range: 0% (no hit) to 100% (optimal). |
CacheMiss | Percentage | The percentage of fetch, write, atomic, and other instructions that miss the data cache. Value range: 0% (optimal) to 100% (all miss). |
CacheHitCount | Items | Count of fetch, write, atomic, and other instructions that hit the data cache. |
CacheMissCount | Items | Count of fetch, write, atomic, and other instructions that miss the data cache. |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitBusyCycles | Cycles | Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account. |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
MemUnitStalledCycles | Cycles | Number of GPU cycles the memory unit is stalled. |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
WriteUnitStalledCycles | Cycles | Number of GPU cycles the Write unit is stalled. |
Counters Exposed for Compute Performance Analysis¶
The following tables show the set of counters exposed for analysis of GPU Compute workloads, as well the family of GPUs and APUs on which each counter is available:
RDNA3 Counters¶
General Group¶
Counter Name | Usage | Brief Description |
---|---|---|
Wavefronts | Items | Total wavefronts. |
VALUInsts | Items | The average number of vector ALU instructions executed per work-item (affected by flow control). |
SALUInsts | Items | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
VFetchInsts | Items | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
SFetchInsts | Items | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
VWriteInsts | Items | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
GDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
VALUUtilization | Percentage | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of the wave size. Value range: 0% (bad), 100% (ideal - no thread divergence). |
VALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
SALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
LocalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
LDSInsts | Items | The average number of LDS read or LDS write instructions executed per work item (affected by flow control). |
LDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Kilobytes | The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Kilobytes | The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
L0CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L0 cache. Value range: 0% (no hit) to 100% (optimal). |
L1CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Writes and atomics always miss this cache. Value range: 0% (no hit) to 100% (optimal). |
L2CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
RDNA2 Counters¶
General Group¶
Counter Name | Usage | Brief Description |
---|---|---|
Wavefronts | Items | Total wavefronts. |
VALUInsts | Items | The average number of vector ALU instructions executed per work-item (affected by flow control). |
SALUInsts | Items | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
VFetchInsts | Items | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
SFetchInsts | Items | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
VWriteInsts | Items | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
GDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
VALUUtilization | Percentage | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of the wave size. Value range: 0% (bad), 100% (ideal - no thread divergence). |
VALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
SALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
LocalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
LDSInsts | Items | The average number of LDS read or LDS write instructions executed per work item (affected by flow control). |
LDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Kilobytes | The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Kilobytes | The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
L0CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L0 cache. Value range: 0% (no hit) to 100% (optimal). |
L1CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Writes and atomics always miss this cache. Value range: 0% (no hit) to 100% (optimal). |
L2CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
RDNA Counters¶
General Group¶
Counter Name | Usage | Brief Description |
---|---|---|
Wavefronts | Items | Total wavefronts. |
VALUInsts | Items | The average number of vector ALU instructions executed per work-item (affected by flow control). |
SALUInsts | Items | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
VFetchInsts | Items | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
SFetchInsts | Items | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
VWriteInsts | Items | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
GDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
VALUUtilization | Percentage | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of the wave size. Value range: 0% (bad), 100% (ideal - no thread divergence). |
VALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
SALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
LocalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
LDSInsts | Items | The average number of LDS read or LDS write instructions executed per work item (affected by flow control). |
LDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Kilobytes | The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Kilobytes | The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
L0CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L0 cache. Value range: 0% (no hit) to 100% (optimal). |
L1CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Writes and atomics always miss this cache. Value range: 0% (no hit) to 100% (optimal). |
L2CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
Vega Counters¶
General Group¶
Counter Name | Usage | Brief Description |
---|---|---|
Wavefronts | Items | Total wavefronts. |
VALUInsts | Items | The average number of vector ALU instructions executed per work-item (affected by flow control). |
SALUInsts | Items | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
VFetchInsts | Items | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
SFetchInsts | Items | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
VWriteInsts | Items | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
FlatVMemInsts | Items | The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch. |
GDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
VALUUtilization | Percentage | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
VALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
SALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
LocalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
LDSInsts | Items | The average number of LDS read or LDS write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS. |
FlatLDSInsts | Items | The average number of FLAT instructions that read from or write to LDS executed per work item (affected by flow control). |
LDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Kilobytes | The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Kilobytes | The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
L1CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Value range: 0% (no hit) to 100% (optimal). |
L2CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
Graphics IP v8 Counters¶
General Group¶
Counter Name | Usage | Brief Description |
---|---|---|
Wavefronts | Items | Total wavefronts. |
VALUInsts | Items | The average number of vector ALU instructions executed per work-item (affected by flow control). |
SALUInsts | Items | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
VFetchInsts | Items | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
SFetchInsts | Items | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
VWriteInsts | Items | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
FlatVMemInsts | Items | The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch. |
GDSInsts | Items | The average number of GDS read or GDS write instructions executed per work item (affected by flow control). |
VALUUtilization | Percentage | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
VALUBusy | Percentage | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
SALUBusy | Percentage | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
LocalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
LDSInsts | Items | The average number of LDS read or LDS write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS. |
FlatLDSInsts | Items | The average number of FLAT instructions that read from or write to LDS executed per work item (affected by flow control). |
LDSBankConflict | Percentage | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
GlobalMemory Group¶
Counter Name | Usage | Brief Description |
---|---|---|
FetchSize | Kilobytes | The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
WriteSize | Kilobytes | The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
CacheHit | Percentage | The percentage of fetch, write, atomic, and other instructions that hit the data cache. Value range: 0% (no hit) to 100% (optimal). |
MemUnitBusy | Percentage | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitStalled | Percentage | The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
WriteUnitStalled | Percentage | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
API Reference¶
Functions¶
GpaBeginCommandList¶
Syntax¶
GpaStatus GpaBeginCommandList(
GpaSessionId session_id,
GpaUInt32 pass_index,
void* command_list,
GpaCommandListType command_list_type,
GpaCommandListId* command_list_id)
Description¶
Begins a command list for sampling. You will be unable to create samples on a command list or command buffer before GpaBeginCommandList is called. The session must have been previously created and started before starting a command list. For multi-pass counter collection, you must call this function for each command list once per pass.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
pass_index |
The zero-based index of the pass. |
command_list |
API-specific command list on which to begin sampling. For DirectX 12, this should be a ID3D12GraphicsCommandList. For Vulkan, this should be a vkCommandBuffer. For all other APIs, this should be kGpaCommandListNone. |
command_list_type |
The type of the command_list parameter. For DirectX 12 and Vulkan, this should be either kGpaCommandListPrimary or kGpaCommandListSecondary. Secondary command lists are either bundles (DirectX 12) or secondary command buffer (Vulkan). For all other APIs, this should be kGpaCommandListNone. |
command_list_id |
On successful execution of this function, this parameter will be set to a GPA-generated unique command list identifier. This value can subsequently passed to any GPA function taking a GpaCommandListId as an input parameter. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The command list was successfully started. |
kGpaStatusErrorNullPointer | The supplied
session_id parameter is NULL.The supplied
command_list parameter is NULL and command_list_type is not kGpaCommandListNone.The supplied
command_list_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied sessionId parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorInvalidParameter | The
command_list_type parameter has an invalid value.The supplied
command_list parameter is not NULL and the command_list_type parameter is kGpaCommandListNone. |
kGpaStatusErrorSessionNotStarted | The supplied GPA Session object has not yet been started. Call GpaBeginSession before GpaBeginCommandList. |
kGpaStatusErrorCommandListAlreadyStarted | The supplied command list has already been started. |
kGpaStatusErrorFailed | The command list could not be started. |
kGpaStatusErrorException | Exception occurred. |
GpaBeginSample¶
Syntax¶
GpaStatus GpaBeginSample(
GpaUInt32 sample_id,
GpaCommandListId command_list_id);
Description¶
Begins a sample in a command list. A sample is a particular workload for which counters will be collected. If the owning session was created with kGpaSessionSampleTypeDiscreteCounter and one or more counters have been enabled, then those counters will be collected for this sample. Each sample must be associated with a GPA command list. The command list must have been previously started before starting a sample. Samples can be created by multiple threads provided no two threads are creating samples on same command list. You must provide a unique id for every new sample. When performing multiple passes, every sample id must exist in every pass. You may create as many samples as needed. However, nesting of samples is not allowed. Each sample must be wrapped in sequence of GpaBeginSample/GpaEndSample before starting another one. A sample can be started in one primary command list and continued/ended on another primary command list - See GpaContinueSampleOnCommandList.
Parameters¶
Name | Description |
---|---|
sample_id |
A unique sample identifier. |
command_list_id |
Unique identifier of a previously-created command list. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The sample was successfully started. |
kGpaStatusErrorNullPointer | The supplied command_list_id parameter is NULL. |
kGpaStatusErrorCommandListNotFound | The supplied command_list_id parameter was not recognized as a previously-created command list identifier. |
kGpaStatusErrorIndexOutOfRange | The specified command list’s pass index is out of range. |
kGpaStatusErrorFailed | The sample could not be started. |
kGpaStatusErrorException | Exception occurred. |
GpaBeginSession¶
Syntax¶
GpaStatus GpaBeginSession(
GpaSessionId session_id);
Description¶
Begins sampling with the currently enabled set of counters. A session must have been created using GpaCreateSession before it can be started. A session must be started before creating any samples. The set of enabled counters for a session cannot be changed after the session has started.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The session was successfully started. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorSessionAlreadyStarted | The session has already been started. |
kGpaStatusErrorNoCountersEnabled | There are no counters enabled |
kGpaStatusErrorFailed | The session could be be started. |
kGpaStatusErrorOtherSessionActive | Another session is active. |
kGpaStatusErrorException | Exception occurred. |
GpaCloseContext¶
Syntax¶
GpaStatus GpaCloseContext(
GpaContextId context_id);
Description¶
Closes the specified context, which ends access to GPU performance counters. After closing a context, GPA functions should not be called again until the counters are reopened with GpaOpenContext.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The context was successfully closed. |
kGpaStatusErrorNullPointer | The supplied context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorInvalidParameter | The API type of the supplied context does not match GPA’s API type. |
kGpaStatusErrorFailed | The context could not be closed. |
kGpaStatusErrorException | Exception occurred. |
GpaContinueSampleOnCommandList¶
Syntax¶
GpaStatus GpaContinueSampleOnCommandList(
GpaUInt32 src_sample_id,
GpaCommandListId primary_command_list_id);
Description¶
Continues a sample from one primary command list on to another primary command list. This function is only supported for DirectX 12 and Vulkan. Normally samples must be started and ended on the same command list. Using this function, samples can be started on one primary command list and continued/ended on another primary command list, allowing a single sample to span more than one command list.
Parameters¶
Name | Description |
---|---|
src_sample_id |
The sample id of the sample being continued on a different command list. |
primary_command_list_id |
Unique identifier of a previously-created primary command list on which the sample is continuing. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The sample was successfully continued on the specified command list. |
kGpaStatusErrorNullPointer | The supplied primary_command_list_id parameter is NULL. |
kGpaStatusErrorCommandListNotFound | The supplied primary_command_list_id parameter was not recognized as a previously-created command list identifier. |
kGpaStatusErrorSampleNotFound | The specified sample was not found. |
kGpaStatusErrorApiNotSupported | This function is not supported for the current API. Only DirectX 12 and Vulkan support this API. |
kGpaStatusErrorFailed | The sample could not be continued on the specified command list. |
kGpaStatusErrorException | Exception occurred. |
GpaCopySecondarySamples¶
Syntax¶
GpaStatus GpaCopySecondarySamples(
GpaCommandListId secondary_command_list_id,
GpaCommandListId primary_command_list_id,
GpaUInt32 num_samples,
GpaUInt32* new_sample_ids);
Description¶
Copies a set of samples from a secondary command list back to the primary command list that executed the secondary command list. This function is only supported for DirectX 12 and Vulkan. You cannot collect data for samples created on secondary command lists unless they are first copied to a new set of samples on the primary command list.
Parameters¶
Name | Description |
---|---|
secondary_command_list_id |
Unique identifier of a previously-created command list. This represents the secondary command list that acts as the source of the samples being copied. |
primary_command_list_id |
Unique identifier of a previously-created command list. This represents the primary command list that acts as the destination of the samples being copied. |
num_samples |
The number of samples to copy. |
new_sample_ids |
An array of the sample ids which should be copied from the secondary command list to the primary command list. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The samples were successfully copied. |
kGpaStatusErrorNullPointer | The supplied
secondary_command_list_id parameter is NULL.The supplied
primary_command_list_id parameter is NULL. |
kGpaStatusErrorCommandListNotFound | The supplied
secondary_command_list_id parameter was not recognized as a previously-created command list identifier.The supplied
primary_command_list_id parameter was not recognized as a previously-created command list identifier. |
kGpaStatusErrorApiNotSupported | This function is not supported for the current API. Only DirectX 12 and Vulkan support this API. |
kGpaStatusErrorFailed | The samples could not be copied. |
kGpaStatusErrorException | Exception occurred. |
GpaCreateSession¶
Syntax¶
GpaStatus GpaCreateSession(
GpaContextId context_id,
GpaSessionSampleType sample_type,
GpaSessionId* session_id);
Description¶
Creates a session on the specified context. A unique session identifier will be returned which allows counters to be enabled, samples to be measured, and stores the results of the profile. The sample type for the session should be specified by the caller. The requested sample types must be supported by the supplied context. Use GpaGetSupportedSampleTypes to determine which sample types are supported by a context.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
sample_type |
The sample type which will be created for this session. |
session_id |
On successful execution of this function, this parameter will be set to a GPA-generated unique session identifier. This value can subsequently passed to any GPA function taking a GpaSessionId as an input parameter. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The session was successfully created. |
kGpaStatusErrorNullPointer | The supplied
context_id parameter is NULL.The supplied
session_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorInvalidParameter | The sample_type parameter has an invalid value. |
kGpaStatusErrorIncompatibleSampleTypes | The sample_type is incompatible with the context’s supported sample type. |
kGpaStatusErrorFailed | The session could not be created. |
kGpaStatusErrorException | Exception occurred. |
GpaDeleteSession¶
Syntax¶
GpaStatus GpaDeleteSession(
GpaSessionId session_id);
Description¶
Deletes a session object. Deletes the specified session, along with all counter results associated with the session.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The session was successfully deleted. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorFailed | The session could not be deleted. |
kGpaStatusErrorException | Exception occurred. |
GpaDestroy¶
Syntax¶
GpaStatus GpaDestroy();
Description¶
Undoes any initialization to ensure proper behavior in applications that are not being profiled. This function must be called before the rendering context or device is released / destroyed.
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | GPA was destroyed. |
kGpaStatusErrorGpaNotInitialized | GpaInitialize was never called. |
kGpaStatusErrorException | Exception occurred. |
GpaDisableAllCounters¶
Syntax¶
GpaStatus GpaDisableAllCounters(
GpaSessionId session_id);
Description¶
Disables all counters. Subsequent sampling sessions will not provide values for any disabled counters. Initially all counters are disabled, and must explicitly be enabled.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | All counters were successfully disabled. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorCannotChangeCountersWhenSampling | The set of enabled counters cannot be changed after GpaBeginSession is called. |
kGpaStatusErrorContextNotOpen | The supplied session’s parent context is not currently open. |
kGpaStatusErrorFailed | An error occurred while trying to disable all counters. |
kGpaStatusErrorException | Exception occurred. |
GpaDisableCounter¶
Syntax¶
GpaStatus GpaDisableCounter(
GpaSessionId session_id,
GpaUInt32 index);
Description¶
Disables the specified counter. Subsequent sampling sessions will not provide values for any disabled counters. Initially all counters are disabled, and must explicitly be enabled.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
index |
The index of the counter to disable. Must lie between 0 and (GpaGetNumCounters result - 1). |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter was successfully disabled. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorCannotChangeCountersWhenSampling | The set of enabled counters cannot be changed after GpaBeginSession is called. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorNotEnabled | The specified counter is not currently enabled. |
kGpaStatusErrorFailed | An error occurred while trying to disable the counter. |
kGpaStatusErrorException | Exception occurred. |
GpaDisableCounterByName¶
Syntax¶
GpaStatus GpaDisableCounterByName(
GpsSessionId session_id,
const char* counter_name);
Description¶
Disables the counter with the specified counter name (case insensitive). Subsequent sampling sessions will not provide values for any disabled counters. Initially all counters are disabled, and must explicitly be enabled.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
counter_name |
The name of the counter to disable. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The specified counter was successfully disabled. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorCannotChangeCountersWhenSampling | The set of enabled counters cannot be changed after GpaBeginSession is called. |
kGpaStatusErrorContextNotOpen | The supplied session’s parent context is not currently open. |
kGpaStatusErrorNotEnabled | The specified counter is not currently enabled. |
kGpaStatusErrorCounterNotFound | The specified counter name is not valid. |
kGpaStatusErrorFailed | An error occurred while trying to disable the counter. |
kGpaStatusErrorException | Exception occurred. |
GpaEnableAllCounters¶
Syntax¶
GpaStatus GpaEnableAllCounters(
GpaSessionId session_id);
Description¶
Enables all counters. Subsequent sampling sessions will provide values for all counters. Initially all counters are disabled, and must explicitly be enabled.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | All counters were successfully enabled. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorCannotChangeCountersWhenSampling | The set of enabled counters cannot be changed after GpaBeginSession is called. |
kGpaStatusErrorContextNotOpen | The supplied session’s parent context is not currently open. |
kGpaStatusErrorIncompatibleSampleTypes | The supplied session was not created with a GpaSessionSampleType value that supports counter collection. |
kGpaStatusErrorFailed | An error occurred while trying to enable all counters. |
kGpaStatusErrorException | Exception occurred. |
GpaEnableCounter¶
Syntax¶
GpaStatus GpaEnableCounter(
GpaSessionId session_id,
GpaUInt32 index);
Description¶
Enables the specified counter. Subsequent sampling sessions will provide values for any enabled counters. Initially all counters are disabled, and must explicitly be enabled.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
index |
The index of the counter to enable. Must lie between 0 and (GpaGetNumCounters result - 1). |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter was successfully enabled. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorCannotChangeCountersWhenSampling | The set of enabled counters cannot be changed after GpaBeginSession is called. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorAlreadyEnabled | The specified counter was already enabled. |
kGpaStatusErrorIncompatibleSampleTypes | The supplied session was not created with a GPA_Session_Sample_Type value that supports counter collection. |
kGpaStatusErrorFailed | An error occurred while trying to enable the counter. |
kGpaStatusErrorException | Exception occurred. |
GpaEnableCounterByName¶
Syntax¶
GpaStatus GpaEnableCounterByName(
GpaSessionId session_id,
const char* counter_name);
Description¶
Enables the counter with the specified counter name (case insensitive). Subsequent sampling sessions will provide values for any enabled counters. Initially all counters are disabled, and must explicitly be enabled.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
counter_name |
The name of the counter to enable. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The specified counter was successfully enabled. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorCannotChangeCountersWhenSampling | The set of enabled counters cannot be changed after GpaBeginSession is called. |
kGpaStatusErrorContextNotOpen | The supplied session’s parent context is not currently open. |
kGpaStatusErrorAlreadyEnabled | The specified counter was already enabled. |
kGpaStatusErrorIncompatibleSampleTypes | The supplied session was not created with a GpaSessionSampleType value that supports counter collection. |
kGpaStatusErrorCounterNotFound | The specified counter name is not valid. |
kGpaStatusErrorFailed | An error occurred while trying to enable the counter. |
kGpaStatusErrorException | Exception occurred. |
GpaEndCommandList¶
Syntax¶
GpaStatus GpaEndCommandList(
GpaCommandListId command_list_id);
Description¶
Ends command list for sampling. You will be unable to create samples on the specified command list after GpaEndCommandList is called. For DirectX 12, GpaEndCommandList should be called before the application calls Close on the underlying command list. For Vulkan, it should be called before the application calls vkEndCommandBuffer on the underlying command buffer.
Parameters¶
Name | Description |
---|---|
command_list_id |
Unique identifier of a previously-created command list. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The command list was successfully ended. |
kGpaStatusErrorNullPointer | The supplied command_list_id parameter is NULL. |
kGpaStatusErrorCommandListNotFound | The supplied command_list_id parameter was not recognized as a previously-created command list identifier. |
GpaStatusErrorCommandListAlreadyEnded | The supplied command list has already been ended. |
kGpaStatusErrorFailed | The command list could not be ended. |
kGpaStatusErrorException | Exception occurred. |
GpaEndSample¶
Syntax¶
GpaStatus GpaEndSample(
GpaCommandListId command_list_id);
Description¶
Ends a sample in a command list. A sample is a particular workload for which counters will be collected. If the owning session was created with kGpaSessionSampleTypeDiscreteCounter and one or more counters have been enabled, then those counters will be collected for this sample. Each sample must be associated with a GPA command list. Samples can be created by using multiple threads provided no two threads are creating samples on same command list. You must provide a unique Id for every new sample. You may create as many samples as needed. However, nesting of samples is not allowed. Each sample must be wrapped in sequence of GpaBeginSample/GpaEndSample before starting another one. A sample can be started in one primary command list and continued/ended on another primary command list - See GpaContinueSampleOnCommandList.
Parameters¶
Name | Description |
---|---|
command_list_id |
Unique identifier of a previously-created command list. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The sample was successfully ended. |
kGpaStatusErrorNullPointer | The supplied command_list_id parameter is NULL. |
kGpaStatusErrorCommandListNotFound | The supplied command_list_id parameter was not recognized as a previously-created command list identifier. |
kGpaStatusErrorIndexOutOfRange | The specified command list’s pass index is out of range. |
kGpaStatusErrorFailed | The sample could not be started. |
kGpaStatusErrorException | Exception occurred. |
GpaEndSession¶
Syntax¶
GpaStatus GpaEndSession(
GpaSessionId session_id);
Description¶
Ends sampling with the currently enabled set of counters. This will end the sampling process. A session must be ended before results for that session can be queried.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The session was successfully ended. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorVariableNumberOfSamplesInPasses | There were an inconsistent number of samples created in each pass of the session. |
kGpaStatusErrorNotEnoughPasses | There were not enough passes created in the session |
kGpaStatusErrorSessionNotStarted | The session has not been started. |
kGpaStatusErrorFailed | The session could be be ended. |
kGpaStatusErrorOtherSessionActive | Another session is active. |
kGpaStatusErrorException | Exception occurred. |
GpaGetCounterDataType¶
Syntax¶
GpaStatus GpaGetCounterDataType(
GpaContextId context_id,
GpaUInt32 index,
GPA_Data_Type* counter_data_type);
Description¶
Gets the data type of the specified counter.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
index |
The index of the counter whose data type is needed. Must lie between 0 and (GpaGetNumCounters result - 1). |
counter_data_type |
The address which will hold the data type upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter data type was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
counter_data_type parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorFailed | The counter data type could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetCounterDescription¶
Syntax¶
GpaStatus GpaGetCounterDescription(
GpaContextId context_id,
GpaUInt32 index,
const char** description);
Description¶
Gets the description of the specified counter.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
index |
The index of the counter whose description is needed. Must lie between 0 and (GpaGetNumCounters result - 1). |
description |
The address which will hold the description upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter description was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
description parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorFailed | The counter description could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetCounterGroup¶
Syntax¶
GpaStatus GpaGetCounterGroup(
GpaContextId context_id,
GpaUInt32 index,
const char** group);
Description¶
Gets the group of the specified counter.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
index |
The index of the counter whose group is needed. Must lie between 0 and (GpaGetNumCounters result - 1). |
group |
The address which will hold the group upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter group was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
group parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorFailed | The counter group could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetCounterIndex¶
Syntax¶
GpaStatus GpaGetCounterIndex(
GpaContextId context_id,
const char* counter_name,
GpaUInt32* index);
Description¶
Gets index of a counter given its name (case insensitive).
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
counter_name |
The name of the counter whose index is needed. |
index |
The address which will hold the index upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter index was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
counter_name parameter is NULL.The supplied
index parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorCounterNotFound | The specified counter could not be found. |
kGpaStatusErrorException | Exception occurred. |
GpaGetCounterName¶
Syntax¶
GpaStatus GpaGetCounterName(
GpaContextId context_id,
GpaUInt32 index,
const char** name);
Description¶
Gets the name of the specified counter.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
index |
The index of the counter name to query. Must lie between 0 and (GpaGetNumCounters result - 1). |
name |
The address which will hold the name upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter name was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
name parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorFailed | The counter name could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetCounterSampleType¶
Syntax¶
GpaStatus GpaGetCounterSampleType(
GpaContextId context_id,
GpaUInt32 index,
GpaCounterSampleType* counter_sample_type);
Description¶
Gets the sample type of the specified counter.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
index |
The index of the counter whose sample type is needed. Must lie between 0 and (GpaGetNumCounters result - 1). |
counter_sample_type |
The address which will hold the sample type upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter sample type was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
counter_sample_type parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorFailed | The counter sample type could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetCounterUsageType¶
Syntax¶
GpaStatus GpaGetCounterUsageType(
GpaContextId context_id,
GpaUInt32 index,
GpaUsageType* counter_usage_type);
Description¶
Gets the usage type of the specified counter. The usage type indicates the units used for the counter.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
index |
The index of the counter whose usage type is needed. Must lie between 0 and (GpaGetNumCounters result - 1). |
counter_usage_type |
The address which will hold the usage type upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter usage type was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
counter_usage_type parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorFailed | The counter usage type could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetCounterUuid¶
Syntax¶
GpaStatus GpaGetCounterUuid(
GpaContextId context_id,
GpaUInt32 index,
GpaUuid* counter_uuid);
Description¶
Gets the UUID type of the specified counter. The UUID can be used to uniquely identify the counter. A counter’s unique identifier can change from one version of GPA to the next.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
index |
The index of the counter whose UUID is needed. Must lie between 0 and (GpaGetNumCounters result - 1). |
counter_uuid |
The address which will hold the UUID upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter UUID was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
counter_uuid parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorFailed | The counter UUID could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetDataTypeAsStr¶
Syntax¶
GpaStatus GpaGetDataTypeAsStr(
GpaDataType counter_data_type,
const char** type_str);
Description¶
Gets a string representation of the specified counter data type. This could be
used to display counter types along with their name or value. For example the
kGpaDataTypeUint64 counter_data_type
would return gpa_uint64
.
Parameters¶
Name | Description |
---|---|
counter_data_type |
The data type whose string representation is needed. |
type_str |
The address which will hold the string representation upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The string representation was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied type_str parameter is NULL. |
kGpaStatusErrorInvalidParameter | The counter_data_type parameter has an invalid value. |
kGpaStatusErrorException | Exception occurred. |
GpaGetDeviceAndRevisionId¶
Syntax¶
GpaStatus GpaGetDeviceAndRevisionId(
GpaContextId context_id,
GpaUInt32* device_id,
GpaUInt32* revision_id);
Description¶
Gets the GPU device id and revision id associated with the specified context.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
device_id |
The value that will be set to the device id upon successful execution. |
revision_id |
The value that will be set to the device revision id upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The device id and revision id were successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
device_id parameter is NULL.The supplied
revision_id parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorFailed | The device id and revision id could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetDeviceGeneration¶
Syntax¶
GpaStatus GpaGetDeviceGeneration(
GpaContextId context_id,
GpaHwGeneration* hardware_generation);
Description¶
Gets the device generation of the GPU associated with the specified context.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
hardware_generation |
The value that will be set to the device generation upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The device generation was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
hardware_generation parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorFailed | The device generation could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetDeviceName¶
Syntax¶
GpaStatus GpaGetDeviceName(
GpaContextId context_id,
const char** device_name);
Description¶
Gets the device name of the GPU associated with the specified context.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
device_name |
The value that will be set to the device name upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The device name was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
device_name parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorFailed | The device name could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetEnabledIndex¶
Syntax¶
GpaStatus GpaGetEnabledIndex(
GpaSessionId session_id,
GpaUInt32 enabled_number,
GpaUInt32* enabled_counter_index);
Description¶
Gets the counter index for an enabled counter. This is meant to be used with GpaGetNumEnabledCounters. Once you determine the number of enabled counters, you can use GpaGetEnabledIndex to determine which counters are enabled.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
enabled_number |
The number of the enabled counter to get the counter index for. Must lie between 0 and (GpaGetNumEnabledCounters result - 1). |
enabled_counter_index |
The value that will hold the index of the counter upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The counter index was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
session_id parameter is NULL.The supplied
enabled_counter_index parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorContextNotOpen | The supplied session’s parent context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The supplied enabled_number is out of range. |
kGpaStatusErrorFailed | The counter index could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetFuncTable¶
Syntax¶
GpaStatus GpaGetFuncTable(
void* gpa_func_table);
Description¶
Gets the GPA API function table. gpa_func_table
is both an input and output
parameter, whose type is GpaFunctionTable*
. Prior to calling this function
the major_version
and minor_version
should be set by the caller. When
compiling in C++ mode, these will be set automatically. When compiling in C
mode, they will need to be set manually to
GPA_FUNCTION_TABLE_MAJOR_VERSION_NUMBER and
GPA_FUNCTION_TABLE_MINOR_VERSION_NUMBER, respectively. After execution of this
function, the major_version
and minor_version
members will be set to the
major and minor version of the GPUPerfAPI library that is loaded. If the
versions are determined to be incompatible, the function will return an error.
The minor version of the function table is calculated as the size of the
function table structure. This allows for additional functions to be added to
the end of the table in future versions, while still maintaining
backwards-compatibility. If the minor version number specified by the caller is
lower than the minor version number of the GPUPerfAPI library, GPA_GetFuncTable
will assign the function pointers of the input structure up to the size of the
structure (as specified by the minor version). The caller can detect this
situation by checking the value of minor_version
after successful execution.
If the value is larger than the original value, the caller can infer that the
version of the GPUPerfAPI library loaded has some additional API functions
available that were not present in older versions of the library. The caller
can recompile using the newer GPUPerfAPI header files to gain access to the new
functions.
If non-backwards compatible changes are ever made in a new version of
GPUPerfAPI, the major version of the API table will be incremented. Examples of
non-backwards compatible changes would be removal of a public function or
changing the signature of a public function. In the case where the major
version number specified by the caller differs from the major version number
of the GPUPerfAPI library, this function will set the major_version
member to
the major version of the GPUPerfAPI library that was loaded and return an error.
Parameters¶
Name | Description |
---|---|
gpa_func_table |
Pointer to the GPA function table structure. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The function table was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied gpa_func_table parameter is NULL. |
kGpaStatusErrorLibLoadMajorVersionMismatch | The major version specified by the caller is different from the major version of the GPUPerfAPI library that was loaded. |
kGpaStatusErrorLibLoadMinorVersionMismatch | The minor version specified by the caller is greater than the major version of the GPUPerfAPI library that was loaded. |
kGpaStatusErrorException | Exception occurred. |
GpaGetNumCounters¶
Syntax¶
GpaStatus GpaGetNumCounters(
GpaContextId context_id,
GpaUInt32* count);
Description¶
Gets the number of counters available.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
count |
The value which will hold the count upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The number of counters was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
count parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorFailed | The number of counters could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetNumEnabledCounters¶
Syntax¶
GpaStatus GpaGetNumEnabledCounters(
GpaSessionId session_id,
GpaUInt32* count);
Description¶
Gets the number of enabled counters.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
count |
The value which will hold the number of enabled counters contained within the session upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The number of enabled counters was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
session_id parameter is NULL.The supplied
count parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorContextNotOpen | The supplied session’s parent context is not currently open. |
kGpaStatusErrorFailed | The number of enabled counters could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetPassCount¶
Syntax¶
GpaStatus GpaGetPassCount(
GpaSessionId session_id,
GpaUInt32* num_passes);
Description¶
Gets the number of passes required for the currently enabled set of counters.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
num_passes |
The value which will hold the number of required passes upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The pass count was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
session_id parameter is NULL.The supplied
num_passes parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorContextNotOpen | The supplied session’s parent context is not currently open. |
kGpaStatusErrorFailed | The pass count could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetSampleCount¶
Syntax¶
GpaStatus GpaGetSampleCount(
GpaSessionId session_id,
GpaUInt32* sample_count);
Description¶
Gets the number of samples created for the specified session. This is useful if samples are conditionally created and a count is not kept. The session must have been ended by GpaEndSession before calling this function.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
sample_count |
The value which will hold the number of samples contained within the session upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The sample count was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
session_id parameter is NULL.The supplied
sample_count parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorSessionNotEnded | The session has not been ended. A session must have been ended with GpaEndSession prior to querying the number of samples. |
kGpaStatusErrorFailed | The sample count could not be retrieved.
|
kGpaStatusErrorException | Exception occurred. |
GpaGetSampleId¶
Syntax¶
GpaStatus GpaGetSampleId(
GpaSessionId session_id,
GpaUInt32 index,
GpaUInt32* sample_id);
Description¶
Gets the sample id of the sample with the specified index. This is useful if sample ids are either non-zero-based or non-consecutive.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
index |
Zero-based index of the sample whose sample id is needed. Must lie between 0 and (GpaGetSampleCount result - 1). |
sample_id |
The value that will hold the id of the sample upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The sample id was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
session_id parameter is NULL.The supplied
sample_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorSessionNotStarted | The session has not been started. |
kGpaStatusErrorSampleNotFound | The specified sample could not be found. |
kGpaStatusErrorException | Exception occurred. |
GpaGetSampleResult¶
Syntax¶
GpaStatus GpaGetSampleResult(
GpaSessionId session_id,
GpaUInt32 sample_id,
size_t sample_result_size_in_bytes,
void* counter_sample_results);
Description¶
Gets the result data for a given sample. This function will block until results are ready. Use GpaIsSessionComplete to check if results are ready. For discrete counter samples, the data will be a set of contiguous 64-bit values, one for each counter collected for the sample. After the results are returned, you can iterate through the buffer to read the individual counter values back. Execution of all command lists (DirectX 12) or command buffers (Vulkan) must be complete before results will be available. Results for samples created in secondary command lists will not be available unless GpaCopySecondarySamples has been called to copy the samples back to the primary command list.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
sample_id |
Unique identifier of a previously-created sample. |
sample_result_size_in_bytes |
The size of the specified sample’s results - this value should have been queried from GpaGetSampleResultSize. |
counter_sample_results |
Address to which the counter data for the sample will be copied to. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The sample result was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
session_id parameter is NULL.The supplied
pCounterSampleResults parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorSampleNotFound | The specified sample was not found in the specified session. |
kGpaStatusErrorSessionNotEnded | The session has not been ended. A session must have been ended with GpaEndSession prior to retrieving results. |
kGpaStatusErrorReadingSampleRequest | The sample result could not be read. |
kGpaStatusErrorSampleInSecondaryCommandList | An attempt was made to read a result from a secondary command list. Samples from a secondary command list must copied to the primary command list using GpaCopySecondarySamples. |
kGpaStatusErrorIndexOutOfRange | An internal operation to index a particular counter failed. |
kGpaStatusErrorException | Exception occurred. |
GpaGetSampleResultSize¶
Syntax¶
GpaStatus GpaGetSampleResultSize(
GpaSessionId session_id,
GpaUInt32 sample_id,
size_t* sample_result_size_in_bytes);
Description¶
Gets the result size (in bytes) for a given sample. For discrete counter samples, the size will be the same for all samples, so it would be valid to retrieve the result size for one sample and use that when retrieving results for all samples. The retrieved size should be passed to GpaGetSampleResult to retrieve the actual results. Execution of all command lists (DirectX 12) or command buffers (Vulkan) must be complete before results will be available.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
sample_id |
Unique identifier of a previously-created sample. |
sample_result_size_in_bytes |
The value that will be set to the result size upon successful execution - this value needs to be passed to GpaGetSampleResult. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The sample result size was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
session_id parameter is NULL.The supplied
sample_result_size_in_bytes parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorSessionNotEnded | The session has not been ended. A session must have been ended with GpaEndSession prior to retrieving results. |
kGpaStatusErrorSampleNotFound | The specified sample was not found in the specified session. |
kGpaStatusErrorException | Exception occurred. |
GpaGetStatusAsStr¶
Syntax¶
const char* GpaGetStatusAsStr(
GpaStatus status);
Description¶
Gets a string representation of the specified GPA status value. Provides a simple method to convert a GpaStatus value into a string which can be used to display log messages. When an error is returned from a GPA function, GPA will also output more information about the error to a logging function if one has been registered using GpaRegisterLoggingCallback.
Parameters¶
Name | Description |
---|---|
status |
The status whose string representation is needed. |
Return value¶
A string which briefly describes the specified status. If the specified status is unknown, this function will return either “Unknown Status” or “Unknown Error”.
GpaGetSupportedSampleTypes¶
Syntax¶
GpaStatus GpaGetSupportedSampleTypes(
GpaContextId context_id,
GpaContextSampleTypeFlags* sample_types);
Description¶
Gets a mask of the sample types supported by the specified context. A call to GPA_CreateSession will fail if the requested sample types are not compatible with the context’s sample types.
Parameters¶
Name | Description |
---|---|
context_id |
Unique identifier of a previously-opened context. |
sample_types |
The value that will be set to the the mask of the supported sample types upon successful execution. This will be a combination of GpaSampleBits. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The supported sample types were successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
sample_types parameter is NULL.The supplied
context_id parameter is NULL. |
kGpaStatusErrorContextNotOpen | The supplied context is not currently open. |
kGpaStatusErrorContextNotFound | The supplied context_id parameter was not recognized as a previously-opened context identifier. |
kGpaStatusErrorFailed | The supported sample types could not be retrieved. |
kGpaStatusErrorException | Exception occurred. |
GpaGetUsageTypeAsStr¶
Syntax¶
GpaStatus GpaGetUsageTypeAsStr(
GpaUsageType counter_usage_type,
const char** usage_type_str);
Description¶
Gets a string representation of the specified counter usage type. This could be
used to display counter units along with their name or value. For example, the
kGpaUsageTypePercentage usage_type_str
would return “percentage”.
Parameters¶
Name | Description |
---|---|
counter_usage_type |
The usage type whose string representation is needed. |
usage_type_str |
The address which will hold the string representation upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The string representation was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied usage_type_str parameter is NULL. |
kGpaStatusErrorInvalidParameter | The counter_usage_type parameter has an invalid value. |
kGpaStatusErrorException | Exception occurred. |
GpaGetVersion¶
Syntax¶
GpaStatus GpaGetVersion(
GpaUInt32* major_version,
GpaUInt32* minor_version,
GpaUInt32* build_version,
GpaUInt32* update_version);
Description¶
Gets the GPA version.
Parameters¶
Name | Description |
---|---|
major_version |
The value that will hold the major version of GPA upon successful execution. |
minor_version |
The value that will hold the minor version of GPA upon successful execution. |
build_version |
The value that will hold the build number of GPA upon successful execution. |
update_version |
The value that will hold the update version of GPA upon successful execution. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The GPA version was successfully retrieved. |
kGpaStatusErrorNullPointer | The supplied
major_version parameter is NULL.The supplied
minor_version parameter is NULL.The supplied
build_version parameter is NULL.The supplied
update_version parameter is NULL. |
kGpaStatusErrorException | Exception occurred. |
GpaInitialize¶
Syntax¶
GpaStatus GpaInitialize(
GpaInitializeFlags flags);
Description¶
Initializes the driver so that counters are exposed. This function must be called before the rendering context or device is created. In the case of DirectX 12 or Vulkan, this function must be called before a queue is created.
Parameters¶
Name | Description |
---|---|
flags |
Flags used to initialize GPA. This should be a combination of GpaInitializeBits. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | GPA was destroyed. |
kGpaStatusErrorGpaAlreadyInitialized | GpaInitialize was already called. |
kGpaStatusErrorInvalidParameter | The flags parameter has an invalid value. |
kGpaStatusErrorException | Exception occurred. |
GpaIsCounterEnabled¶
Syntax¶
GpaStatus GpaIsCounterEnabled(
GpaSessionId session_id,
GpaUInt32 counter_index);
Description¶
Checks whether or not a counter is enabled.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
counter_index |
The index of the counter. Must lie between 0 and (GpaGetNumCounters result - 1). |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The specified counter is enabled. |
kGpaStatusErrorCounterNotFound | The specified counter is not enabled. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorCannotChangeCountersWhenSampling | The set of enabled counters cannot be changed after GpaBeginSession is called. |
kGpaStatusErrorContextNotOpen | The supplied session’s parent context is not currently open. |
kGpaStatusErrorIndexOutOfRange | The specified index is out of range. |
kGpaStatusErrorFailed | An error occurred while trying to retrieve the enabled status of the specified counter. |
kGpaStatusErrorException | Exception occurred. |
GpaIsPassComplete¶
Syntax¶
GpaStatus GpaIsPassComplete(
GpaSessionId session_id,
GpaUInt32 pass_index);
Description¶
Checks whether or not a pass has finished. After sampling a workload, results may be available immediately or take a certain amount of time to become available. This function allows you to determine when the pass has finished and associated resources are no longer needed in the application. The function does not block, permitting periodic polling. The application must not free its resources until this function returns kGpaStatusOk.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
pass_index |
Zero-based index of the pass to check. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The pass is complete and results are ready. |
kGpaStatusErrorResultNotReady | The pass is not yet ready. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorSessionNotStarted | The session has not been started. |
kGpaStatusErrorException | Exception occurred. |
GpaIsSessionComplete¶
Syntax¶
GpaStatus GpaIsSessionComplete(
GpaSessionId session_id);
Description¶
Checks if results for all samples within a session are available. After a sampling session results may be available immediately or take a certain amount of time to become available. This function allows you to determine when the results of a session can be read. The function does not block, permitting periodic polling. To block until a sample is ready use GpaGetSampleResult instead. Execution of all command lists (DirectX 12) or command buffers (Vulkan) must be complete before results will be available.
Parameters¶
Name | Description |
---|---|
session_id |
Unique identifier of a previously-created session. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The session is complete and results are ready. |
kGpaStatusErrorResultNotReady | The session is not yet ready. |
kGpaStatusErrorNullPointer | The supplied session_id parameter is NULL. |
kGpaStatusErrorSessionNotFound | The supplied session_id parameter was not recognized as a previously-created session identifier. |
kGpaStatusErrorSessionNotStarted | The session has not been started. |
kGpaStatusErrorSessionNotEnded | The session has not been ended. A session must have been ended with GpaEndSession prior to retrieving results. |
kGpaStatusErrorException | Exception occurred. |
GpaOpenContext¶
Syntax¶
GpaStatus GpaOpenContext(
void* context,
GpaOpenContextFlags flags,
GpaContextId* context_id);
Description¶
Opens the specified context, which provides access to GPU performance counters. This function must be called after GpaInitialize and before any other GPUPerfAPI functions.
The type of the supplied context
is different depending on which API is
being used. See the table below for the required type which should be passed to
GpaOpenContext:
API | GpaOpenContext context Parameter Type |
---|---|
Vulkan | GpaVkContextOpenInfo* (defined in gpu_perf_api_vk.h)
|
DirectX 12 | ID3D12Device* |
DirectX 11 | ID3D11Device* |
OpenGL | Windows:
HGLRC Linux:
GLXContext |
OpenCL | cl_command_queue* |
Parameters¶
Name | Description |
---|---|
context |
The context to open counters for. The specific type for this parameter depends on which API GPUPerfAPI is being used with. Refer to the table above for the specific type to be used. |
flags |
Flags used to initialize the context. This should be a combination of GpaOpenContext. |
context_id |
On successful execution of this function, this parameter will be set to a GPA-generated unique context identifier. This value can subsequently passed to any GPA function taking a GpaContextId as an input parameter. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The context was successfully opened. |
kGpaStatusErrorNullPointer | The supplied pContext parameter is NULL. |
kGpaStatusErrorInvalidParameter | The flags parameter has an invalid value. |
kGpaStatusErrorHardwareNotSupported | The current GPU hardware is not supported. |
kGpaStatusErrorDriverNotSupported | The currently-installed GPU driver is not supported. |
kGpaStatusErrorContextAlreadyOpen | The supplied context has already been opened. |
kGpaStatusErrorFailed | The context could not be opened. |
kGpaStatusErrorException | Exception occurred. |
A Note about GPU Clock Modes¶
Due to the desire to operate within reasonable power envelopes, modern GPUs
employ techniques which alter the frequencies of their clocks dynamically.
This can make tuning for performance difficult, as there is no single clock
frequency which can be assumed. By default, GPA uses a clock mode known as
“Profiling Clocks”. Under this mode, the clocks will be fixed at frequencies
which may be lower than the normal operating frequencies. This mode should help
to ensure consistent results between different runs of the application.
However, the observed performance of the application (especially using the
GPUTime counter) may be lower than expected or lower than the application can
achieve during normal operation. Using the flags
parameter when calling
GpaOpenContext, you can alter the GPU clock frequencies used while profiling.
The table below explains the stable clock modes that can be specified via the
flags
parameter.
Clock mode | Description |
---|---|
kGpaOpenContextDefaultBit (or any combination of GpaOpenContextBits which doesn’t include any of the
kGpaOpenContextClockMode* bits) |
Clocks are set to stable frequencies which are known to be power and thermal sustainable. The ratio between the engine and memory clock frequencies will be kept the same as much as possible. |
kGpaOpenContextClockModeNoneBit |
Clock frequencies are not altered and may vary widely during profiling based on GPU usage and other factors. |
kGpaOpenContextClockModePeakBit |
Clocks are set to peak frequencies. In most cases this is safe to do for short periods of time while profiling. However, the GPU clock frequencies could still be reduced from peak level under power and thermal constraints. |
kGpaOpenContextClockModeMinMemoryBit |
The memory clock frequency is set to the minimum level, whiles the engine clock is set to a power and thermal sustainable level. |
kGpaOpenContextClockModeMinEngineBit |
The engine clock frequency is set to the minimum level, whiles the memory clock is set to a power and thermal sustainable level. |
A Note about Raw Hardware Counters¶
By default, GPA exposes a set of derived counters that are computed from one or
more raw hardware counters. GPA can also be configured to expose the raw
hardware counters directly. In order to do this, the flags
parameter
specified when calling GpaOpenContext should include the
kGpaOpenContextEnableHardwareCountersBit
bit.
GpaRegisterLoggingCallback¶
Syntax¶
GpaStatus GpaRegisterLoggingCallback(
GpaLoggingType logging_type,
GpaLoggingCallbackPtrType callback_func_ptr);
Description¶
Registers a callback function to receive log messages. Only one callback
function can be registered, so the implementation should be able to handle the
different types of messages. A parameter to the callback function will indicate
the message type being received. Messages will not contain a newline character
at the end of the message. To unregister a callback function, specify
kGpaLoggingNone for the logging_type
and NULL for the callback_func_ptr
.
Parameters¶
Name | Description |
---|---|
logging_type |
Identifies the type of messages to receive callbacks for. |
callback_func_ptr |
Pointer to the callback function. |
Return value¶
Return value | Description |
---|---|
kGpaStatusOk | The logging callback function was successfully registered. |
kGpaStatusErrorNullPointer | The supplied callback_func_ptr parameter is NULL and the specified logging_type is not kGpaLoggingNone. |
kGpaStatusErrorException | Exception occurred. |
Types¶
For information on the various typedefs, enumerations, and macros used in the GPUPerfAPI interface, please refer to the declarations in the gpu_performance_api/gpu_perf_api_types.h header file.