Performance Counters¶

The performance counters exposed through GPU Performance API are organized into groups to help provide clarity and organization to all the available data. Below is a collective list of counters from all the supported hardware generations. Some of the counters may not be available depending on the hardware being profiled. To view which GPUs belong to which hardware generations, the best reference is the gs_cardInfo array in the common-src-DeviceInfo repository on GitHub. You can see how the various cards map to hardware generations by looking at the GDT_HW_GENERATION enum

For Graphics workloads, it is recommended that you initially profile with counters from the Timing group to determine whether the profiled calls are worth optimizing (based on GPUTime value), and which parts of the pipeline are performing the most work. Note that because the GPU is highly parallelized, various parts of the pipeline can be active at the same time; thus, the “Busy” counters probably will sum over 100 percent. After identifying one or more stages to investigate further, enable the corresponding counter groups for more information on the stage and whether or not potential optimizations exist.

Pipeline-Based Counter Groups¶

On Vega, RDNA, RDNA2, and RDNA3 hardware, certain use cases allow the driver to make optimizations by combining two shader stages together. For example, in a Vertex + Geometry + Pixel Shader pipeline (VS-GS-PS), the Vertex and Geometry Shaders get combined and GPUPerfAPI exposes them in the “VertexGeometry” group (counters with the “VsGs” prefix). In pipelines that use tessellation, the Vertex and Hull Shaders are combined and exposed as the “PreTessellation” group (with “PreTess” prefix), and the Domain and Geometry Shaders (if GS is used) are combined into the the “PostTessellation” group (with “PostTess” prefix). Pixel Shaders and Compute Shaders are always exposed as their respective types. The table below may help to visualize the mapping between the API-level shaders (across the top), and which prefixes to look for in the GPUPerfAPI counters.

Pipeline	Vertex	Hull	Domain	Geometry	Pixel	Compute
VS-PS	VsGs				PS
VS-GS-PS	VsGs			VsGs	PS
VS-HS-DS-PS	PreTess	PreTess	PostTess	PostTess	PS
VS-HS-DS-GS-PS	PreTess	PreTess	PostTess	PostTess	PS
CS						CS

A Note About Third-Party Applications¶

Several third-party applications (such as RenderDoc and Microsoft PIX) integrate GPUPerfAPI as part of their profiling feature set. These applications may choose to expose only a subset of the counters supported by GPUPerfAPI, especially in cases where the counters do not support the design goals of the application. Specifically, it is known that the counters reporting a percentage are not exposed. This is due to the way that these tools collect and report aggregate performance counter values for groups of draw calls. For instance, if a set of draw calls is grouped together by a User Marker, a tool may report performance counter values for the User Marker by simply summing up the counter values for the individual draw calls. While this may be valid for many counters, it does not work well for percentage-based counters. Even if the tools were to perform a simple average of the percent values, it still may not provide an accurate reflection of the actual performance. For most of the percentage-based counters, GPUPerfAPI also exposes counters representing the components used to calculate the percentage. One example of this is the cache hit counters – these are exposed both as a cache hit percentage and as individual counters representing the number of cache requests, the number of hits and the number of misses. Please reference the Usage column of the tables below to know which counters will not be exposed by these applications.

Counters Exposed for Graphics Performance Analysis¶

The following tables show the set of counters exposed for analysis of GPU Graphics workloads, as well the family of GPUs and APUs on which each counter is available:

RDNA3 Counters¶

Timing Group¶

Counter Name	Usage	Brief Description
GPUTime	Nanoseconds	Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionDuration	Nanoseconds	GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionStart	Nanoseconds	GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP).
ExecutionEnd	Nanoseconds	GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP).
GPUBusy	Percentage	The percentage of time the GPU command processor was busy.
GPUBusyCycles	Cycles	Number of GPU cycles that the GPU command processor was busy.
TessellatorBusy	Percentage	The percentage of time the tessellation engine is busy.
TessellatorBusyCycles	Cycles	Number of GPU cycles that the tessellation engine is busy.
VsGsBusy	Percentage	The percentage of time the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline.
VsGsBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline.
VsGsTime	Nanoseconds	Time VS or GS are busy in nanoseconds in a VS-[GS-]PS pipeline.
PreTessellationBusy	Percentage	The percentage of time the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation.
PreTessellationBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation.
PreTessellationTime	Nanoseconds	Time VS and HS are busy in nanoseconds in a pipeline that uses tessellation.
PostTessellationBusy	Percentage	The percentage of time the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation.
PostTessellationBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation.
PostTessellationTime	Nanoseconds	Time DS or GS are busy in nanoseconds in a pipeline that uses tessellation.
PSBusy	Percentage	The percentage of time the ShaderUnit has pixel shader work to do.
PSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has pixel shader work to do.
PSTime	Nanoseconds	Time pixel shaders are busy in nanoseconds.
CSBusy	Percentage	The percentage of time the ShaderUnit has compute shader work to do.
CSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has compute shader work to do.
CSTime	Nanoseconds	Time compute shaders are busy in nanoseconds.
PrimitiveAssemblyBusy	Percentage	The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
PrimitiveAssemblyBusyCycles	Cycles	Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
TexUnitBusy	Percentage	The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
TexUnitBusyCycles	Cycles	Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
DepthStencilTestBusy	Percentage	Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy.
DepthStencilTestBusyCycles	Cycles	Number of GPU cycles spent performing depth and stencil tests.

VertexGeometry Group¶

Counter Name	Usage	Brief Description
VsGsVerticesIn	Items	The number of unique vertices processed by the VS and GS.
VsGsPrimsIn	Items	The number of primitives passed into the GS.

PreTessellation Group¶

Counter Name	Usage	Brief Description
PreTessVerticesIn	Items	The number of unique vertices processed by the VS and HS when using tessellation.

PostTessellation Group¶

Counter Name	Usage	Brief Description
PostTessPrimsOut	Items	The number of primitives output by the DS and GS when using tessellation.

PrimitiveAssembly Group¶

Counter Name	Usage	Brief Description
PrimitivesIn	Items	The number of primitives received by the hardware. This includes primitives generated by tessellation.
CulledPrims	Items	The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling.
ClippedPrims	Items	The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes.
PAStalledOnRasterizer	Percentage	Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations.
PAStalledOnRasterizerCycles	Cycles	Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations.

PixelShader Group¶

Counter Name	Usage	Brief Description
PSPixelsOut	Items	Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs.
PSExportStalls	Percentage	Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.
PSExportStallsCycles	Cycles	Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.

ComputeShader Group¶

Counter Name	Usage	Brief Description
CSThreadGroups	Items	Total number of thread groups.
CSWavefronts	Items	The total number of wavefronts used for the CS.
CSThreads	Items	The number of CS threads processed by the hardware.
CSThreadGroupSize	Items	The number of CS threads within each thread group.
CSMemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
CSMemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
CSMemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
CSMemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible.
CSWriteUnitStalled	Percentage	The percentage of GPUTime the write unit is stalled.
CSWriteUnitStalledCycles	Cycles	Number of GPU cycles the write unit is stalled.
CSALUStalledByLDS	Percentage	The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
CSALUStalledByLDSCycles	Cycles	Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible.
CSLDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
CSLDSBankConflictCycles	Cycles	Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad).

TextureUnit Group¶

Counter Name	Usage	Brief Description
TexTriFilteringPct	Percentage	Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
TexTriFilteringCount	Items	Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
NoTexTriFilteringCount	Items	Count of pixels that did not receive trilinear filtering.
TexVolFilteringPct	Percentage	Percentage of pixels that received volume filtering.
TexVolFilteringCount	Items	Count of pixels that received volume filtering.
NoTexVolFilteringCount	Items	Count of pixels that did not receive volume filtering.
TexAveAnisotropy	Items	The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy.

DepthAndStencil Group¶

Counter Name	Usage	Brief Description
HiZTilesAccepted	Percentage	Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesAcceptedCount	Items	Count of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesRejectedCount	Items	Count of tiles not accepted by HiZ.
PreZTilesDetailCulled	Percentage	Percentage of tiles rejected because the associated prim had no contributing area.
PreZTilesDetailCulledCount	Items	Count of tiles rejected because the associated primitive had no contributing area.
PreZTilesDetailSurvivingCount	Items	Count of tiles surviving because the associated primitive had contributing area.
HiZQuadsCulled	Percentage	Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsCulledCount	Items	Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsAcceptedCount	Items	Count of quads that did continue on in the pipeline after HiZ.
PreZQuadsCulled	Percentage	Percentage of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsCulledCount	Items	Count of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsSurvivingCount	Items	Count of quads surviving detailZ and earlyZ tests.
PostZQuads	Percentage	Percentage of quads for which the pixel shader will run and may be postZ tested.
PostZQuadCount	Items	Count of quads for which the pixel shader will run and may be postZ tested.
PreZSamplesPassing	Items	Number of samples tested for Z before shading and passed.
PreZSamplesFailingS	Items	Number of samples tested for Z before shading and failed stencil test.
PreZSamplesFailingZ	Items	Number of samples tested for Z before shading and failed Z test.
PostZSamplesPassing	Items	Number of samples tested for Z after shading and passed.
PostZSamplesFailingS	Items	Number of samples tested for Z after shading and failed stencil test.
PostZSamplesFailingZ	Items	Number of samples tested for Z after shading and failed Z test.
ZUnitStalled	Percentage	The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations.
ZUnitStalledCycles	Cycles	Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations.
DBMemRead	Bytes	Number of bytes read from the depth buffer.
DBMemWritten	Bytes	Number of bytes written to the depth buffer.

ColorBuffer Group¶

Counter Name	Usage	Brief Description
CBMemRead	Bytes	Number of bytes read from the color buffer.
CBColorAndMaskRead	Bytes	Total number of bytes read from the color and mask buffers.
CBMemWritten	Bytes	Number of bytes written to the color buffer.
CBColorAndMaskWritten	Bytes	Total number of bytes written to the color and mask buffers.

MemoryCache Group¶

Counter Name	Usage	Brief Description
L0CacheHit	Percentage	The percentage of read requests that hit the data in the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L0CacheRequestCount	Items	The number of read requests made to the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
L0CacheHitCount	Items	The number of read requests which result in a cache hit from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
L0CacheMissCount	Items	The number of read requests which result in a cache miss from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
ScalarCacheHit	Percentage	The percentage of read requests made from executing shader code that hit the data in the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal).
ScalarCacheRequestCount	Items	The number of read requests made from executing shader code to the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
ScalarCacheHitCount	Items	The number of read requests made from executing shader code which result in a cache hit from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
ScalarCacheMissCount	Items	The number of read requests made from executing shader code which result in a cache miss from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
InstCacheHit	Percentage	The percentage of read requests made that hit the data in the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal).
InstCacheRequestCount	Items	The number of read requests made to the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
InstCacheHitCount	Items	The number of read requests which result in a cache hit from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
InstCacheMissCount	Items	The number of read requests which result in a cache miss from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
L1CacheHit	Percentage	The percentage of read or write requests that hit the data in the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L1CacheRequestCount	Items	The number of read or write requests made to the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L1CacheHitCount	Items	The number of read or write requests which result in a cache hit from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L1CacheMissCount	Items	The number of read or write requests which result in a cache miss from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L2CacheHit	Percentage	The percentage of read or write requests that hit the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L2CacheMiss	Percentage	The percentage of read or write requests that miss the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (optimal) to 100% (all miss).
L2CacheRequestCount	Items	The number of read or write requests made to the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L2CacheHitCount	Items	The number of read or write requests which result in a cache hit from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L2CacheMissCount	Items	The number of read or write requests which result in a cache miss from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L0TagConflictReadStalledCycles	Items	The number of cycles read operations from the L0 cache are stalled due to tag conflicts.
L0TagConflictWriteStalledCycles	Items	The number of cycles write operations to the L0 cache are stalled due to tag conflicts.
L0TagConflictAtomicStalledCycles	Items	The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts.

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Bytes	The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Bytes	The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
MemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled.
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
WriteUnitStalledCycles	Cycles	Number of GPU cycles the Write unit is stalled.
LocalVidMemBytes	Bytes	Number of bytes read from or written to local video memory
PcieBytes	Bytes	Number of bytes sent and received over the PCIe bus

RayTracing Group¶

Counter Name	Usage	Brief Description
RayTriTests	Items	The number of ray triangle intersection tests.
RayBoxTests	Items	The number of ray box intersection tests.
TotalRayTests	Items	Total number of ray intersection tests, includes both box and triangle intersections.
RayTestsPerWave	Items	The number of ray intersection tests per wave.

RDNA2 Counters¶

Timing Group¶

Counter Name	Usage	Brief Description
GPUTime	Nanoseconds	Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionDuration	Nanoseconds	GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionStart	Nanoseconds	GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP).
ExecutionEnd	Nanoseconds	GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP).
GPUBusy	Percentage	The percentage of time the GPU command processor was busy.
GPUBusyCycles	Cycles	Number of GPU cycles that the GPU command processor was busy.
TessellatorBusy	Percentage	The percentage of time the tessellation engine is busy.
TessellatorBusyCycles	Cycles	Number of GPU cycles that the tessellation engine is busy.
VsGsBusy	Percentage	The percentage of time the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline.
VsGsBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline.
VsGsTime	Nanoseconds	Time VS or GS are busy in nanoseconds in a VS-[GS-]PS pipeline.
PreTessellationBusy	Percentage	The percentage of time the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation.
PreTessellationBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation.
PreTessellationTime	Nanoseconds	Time VS and HS are busy in nanoseconds in a pipeline that uses tessellation.
PostTessellationBusy	Percentage	The percentage of time the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation.
PostTessellationBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation.
PostTessellationTime	Nanoseconds	Time DS or GS are busy in nanoseconds in a pipeline that uses tessellation.
PSBusy	Percentage	The percentage of time the ShaderUnit has pixel shader work to do.
PSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has pixel shader work to do.
PSTime	Nanoseconds	Time pixel shaders are busy in nanoseconds.
CSBusy	Percentage	The percentage of time the ShaderUnit has compute shader work to do.
CSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has compute shader work to do.
CSTime	Nanoseconds	Time compute shaders are busy in nanoseconds.
PrimitiveAssemblyBusy	Percentage	The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
PrimitiveAssemblyBusyCycles	Cycles	Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
TexUnitBusy	Percentage	The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
TexUnitBusyCycles	Cycles	Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
DepthStencilTestBusy	Percentage	Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy.
DepthStencilTestBusyCycles	Cycles	Number of GPU cycles spent performing depth and stencil tests.

VertexGeometry Group¶

Counter Name	Usage	Brief Description
VsGsVerticesIn	Items	The number of unique vertices processed by the VS and GS.
VsGsPrimsIn	Items	The number of primitives passed into the VS and GS.
GSVerticesOut	Items	The number of vertices output by the GS.

PreTessellation Group¶

Counter Name	Usage	Brief Description
PreTessVerticesIn	Items	The number of vertices processed by the VS and HS when using tessellation.

PostTessellation Group¶

Counter Name	Usage	Brief Description
PostTessPrimsOut	Items	The number of primitives output by the DS and GS when using tessellation.

PrimitiveAssembly Group¶

Counter Name	Usage	Brief Description
PrimitivesIn	Items	The number of primitives received by the hardware. This includes primitives generated by tessellation.
CulledPrims	Items	The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling.
ClippedPrims	Items	The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes.
PAStalledOnRasterizer	Percentage	Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations.
PAStalledOnRasterizerCycles	Cycles	Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations.

PixelShader Group¶

Counter Name	Usage	Brief Description
PSPixelsOut	Items	Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs.
PSExportStalls	Percentage	Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.
PSExportStallsCycles	Cycles	Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.

ComputeShader Group¶

Counter Name	Usage	Brief Description
CSThreadGroups	Items	Total number of thread groups.
CSWavefronts	Items	The total number of wavefronts used for the CS.
CSThreads	Items	The number of CS threads processed by the hardware.
CSThreadGroupSize	Items	The number of CS threads within each thread group.
CSMemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
CSMemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
CSMemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
CSMemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible.
CSWriteUnitStalled	Percentage	The percentage of GPUTime the write unit is stalled.
CSWriteUnitStalledCycles	Cycles	Number of GPU cycles the write unit is stalled.
CSGDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
CSLDSInsts	Items	The average number of LDS read/write instructions executed per work-item (affected by flow control).
CSALUStalledByLDS	Percentage	The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
CSALUStalledByLDSCycles	Cycles	Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible.
CSLDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
CSLDSBankConflictCycles	Cycles	Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad).

TextureUnit Group¶

Counter Name	Usage	Brief Description
TexTriFilteringPct	Percentage	Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
TexTriFilteringCount	Items	Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
NoTexTriFilteringCount	Items	Count of pixels that did not receive trilinear filtering.
TexVolFilteringPct	Percentage	Percentage of pixels that received volume filtering.
TexVolFilteringCount	Items	Count of pixels that received volume filtering.
NoTexVolFilteringCount	Items	Count of pixels that did not receive volume filtering.
TexAveAnisotropy	Items	The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy.

DepthAndStencil Group¶

Counter Name	Usage	Brief Description
HiZTilesAccepted	Percentage	Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesAcceptedCount	Items	Count of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesRejectedCount	Items	Count of tiles not accepted by HiZ.
PreZTilesDetailCulled	Percentage	Percentage of tiles rejected because the associated prim had no contributing area.
PreZTilesDetailCulledCount	Items	Count of tiles rejected because the associated primitive had no contributing area.
PreZTilesDetailSurvivingCount	Items	Count of tiles surviving because the associated primitive had contributing area.
HiZQuadsCulled	Percentage	Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsCulledCount	Items	Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsAcceptedCount	Items	Count of quads that did continue on in the pipeline after HiZ.
PreZQuadsCulled	Percentage	Percentage of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsCulledCount	Items	Count of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsSurvivingCount	Items	Count of quads surviving detailZ and earlyZ tests.
PostZQuads	Percentage	Percentage of quads for which the pixel shader will run and may be postZ tested.
PostZQuadCount	Items	Count of quads for which the pixel shader will run and may be postZ tested.
PreZSamplesPassing	Items	Number of samples tested for Z before shading and passed.
PreZSamplesFailingS	Items	Number of samples tested for Z before shading and failed stencil test.
PreZSamplesFailingZ	Items	Number of samples tested for Z before shading and failed Z test.
PostZSamplesPassing	Items	Number of samples tested for Z after shading and passed.
PostZSamplesFailingS	Items	Number of samples tested for Z after shading and failed stencil test.
PostZSamplesFailingZ	Items	Number of samples tested for Z after shading and failed Z test.
ZUnitStalled	Percentage	The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations.
ZUnitStalledCycles	Cycles	Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations.
DBMemRead	Bytes	Number of bytes read from the depth buffer.
DBMemWritten	Bytes	Number of bytes written to the depth buffer.

ColorBuffer Group¶

Counter Name	Usage	Brief Description
CBMemRead	Bytes	Number of bytes read from the color buffer.
CBColorAndMaskRead	Bytes	Total number of bytes read from the color and mask buffers.
CBMemWritten	Bytes	Number of bytes written to the color buffer.
CBColorAndMaskWritten	Bytes	Total number of bytes written to the color and mask buffers.
CBSlowPixelPct	Percentage	Percentage of pixels written to the color buffer using a half-rate or quarter-rate format.
CBSlowPixelCount	Items	Number of pixels written to the color buffer using a half-rate or quarter-rate format.

MemoryCache Group¶

Counter Name	Usage	Brief Description
L0CacheHit	Percentage	The percentage of read requests that hit the data in the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L0CacheRequestCount	Items	The number of read requests made to the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
L0CacheHitCount	Items	The number of read requests which result in a cache hit from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
L0CacheMissCount	Items	The number of read requests which result in a cache miss from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
ScalarCacheHit	Percentage	The percentage of read requests made from executing shader code that hit the data in the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal).
ScalarCacheRequestCount	Items	The number of read requests made from executing shader code to the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
ScalarCacheHitCount	Items	The number of read requests made from executing shader code which result in a cache hit from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
ScalarCacheMissCount	Items	The number of read requests made from executing shader code which result in a cache miss from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
InstCacheHit	Percentage	The percentage of read requests made that hit the data in the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal).
InstCacheRequestCount	Items	The number of read requests made to the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
InstCacheHitCount	Items	The number of read requests which result in a cache hit from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
InstCacheMissCount	Items	The number of read requests which result in a cache miss from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
L1CacheHit	Percentage	The percentage of read or write requests that hit the data in the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L1CacheRequestCount	Items	The number of read or write requests made to the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L1CacheHitCount	Items	The number of read or write requests which result in a cache hit from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L1CacheMissCount	Items	The number of read or write requests which result in a cache miss from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L2CacheHit	Percentage	The percentage of read or write requests that hit the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L2CacheMiss	Percentage	The percentage of read or write requests that miss the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (optimal) to 100% (all miss).
L2CacheRequestCount	Items	The number of read or write requests made to the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L2CacheHitCount	Items	The number of read or write requests which result in a cache hit from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L2CacheMissCount	Items	The number of read or write requests which result in a cache miss from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L0TagConflictReadStalledCycles	Items	The number of cycles read operations from the L0 cache are stalled due to tag conflicts.
L0TagConflictWriteStalledCycles	Items	The number of cycles write operations to the L0 cache are stalled due to tag conflicts.
L0TagConflictAtomicStalledCycles	Items	The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts.

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Bytes	The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Bytes	The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
MemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled.
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
WriteUnitStalledCycles	Cycles	Number of GPU cycles the Write unit is stalled.
LocalVidMemBytes	Bytes	Number of bytes read from or written to local video memory
PcieBytes	Bytes	Number of bytes sent and received over the PCIe bus

RayTracing Group¶

Counter Name	Usage	Brief Description
RayTriTests	Items	The number of ray triangle intersection tests.
RayBoxTests	Items	The number of ray box intersection tests.
TotalRayTests	Items	Total number of ray intersection tests, includes both box and triangle intersections.
RayTestsPerWave	Items	The number of ray intersection tests per wave.

RDNA Counters¶

Timing Group¶

Counter Name	Usage	Brief Description
GPUTime	Nanoseconds	Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionDuration	Nanoseconds	GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionStart	Nanoseconds	GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP).
ExecutionEnd	Nanoseconds	GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP).
GPUBusy	Percentage	The percentage of time the GPU command processor was busy.
GPUBusyCycles	Cycles	Number of GPU cycles that the GPU command processor was busy.
TessellatorBusy	Percentage	The percentage of time the tessellation engine is busy.
TessellatorBusyCycles	Cycles	Number of GPU cycles that the tessellation engine is busy.
VsGsBusy	Percentage	The percentage of time the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline.
VsGsBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline.
VsGsTime	Nanoseconds	Time VS or GS are busy in nanoseconds in a VS-[GS-]PS pipeline.
PreTessellationBusy	Percentage	The percentage of time the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation.
PreTessellationBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation.
PreTessellationTime	Nanoseconds	Time VS and HS are busy in nanoseconds in a pipeline that uses tessellation.
PostTessellationBusy	Percentage	The percentage of time the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation.
PostTessellationBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation.
PostTessellationTime	Nanoseconds	Time DS or GS are busy in nanoseconds in a pipeline that uses tessellation.
PSBusy	Percentage	The percentage of time the ShaderUnit has pixel shader work to do.
PSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has pixel shader work to do.
PSTime	Nanoseconds	Time pixel shaders are busy in nanoseconds.
CSBusy	Percentage	The percentage of time the ShaderUnit has compute shader work to do.
CSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has compute shader work to do.
CSTime	Nanoseconds	Time compute shaders are busy in nanoseconds.
PrimitiveAssemblyBusy	Percentage	The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
PrimitiveAssemblyBusyCycles	Cycles	Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
TexUnitBusy	Percentage	The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
TexUnitBusyCycles	Cycles	Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
DepthStencilTestBusy	Percentage	Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy.
DepthStencilTestBusyCycles	Cycles	Number of GPU cycles spent performing depth and stencil tests.

VertexGeometry Group¶

Counter Name	Usage	Brief Description
VsGsVerticesIn	Items	The number of unique vertices processed by the VS and GS.
VsGsPrimsIn	Items	The number of primitives passed into the VS and GS.
GSVerticesOut	Items	The number of vertices output by the GS.
VsGsVALUInstCount	Items	Average number of vector ALU instructions executed for the VS and GS in a VS-[GS-]PS pipeline. Affected by flow control.
VsGsSALUInstCount	Items	Average number of scalar ALU instructions executed for the VS and GS. Affected by flow control.
VsGsVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed for the VS and GS.
VsGsVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are being processed for the VS and GS.
VsGsSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed for the VS and GS.
VsGsSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed for the VS and GS.

PreTessellation Group¶

Counter Name	Usage	Brief Description
PreTessVerticesIn	Items	The number of vertices processed by the VS and HS when using tessellation.
PreTessVALUInstCount	Items	Average number of vector ALU instructions executed for the VS and HS in a pipeline that uses tessellation. Affected by flow control.
PreTessSALUInstCount	Items	Average number of scalar ALU instructions executed for the VS and HS in a pipeline that uses tessellation. Affected by flow control.
PreTessVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation.
PreTessVALUBusyCycles	Cycles	Number of GPU cycles vector where ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation.
PreTessSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation.
PreTessSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation.

PostTessellation Group¶

Counter Name	Usage	Brief Description
PostTessPrimsOut	Items	The number of primitives output by the DS and GS when using tessellation.
PostTessVALUInstCount	Items	Average number of vector ALU instructions executed for the DS and GS in a pipeline that uses tessellation. Affected by flow control.
PostTessSALUInstCount	Items	Average number of scalar ALU instructions executed for the DS and GS in a pipeline that uses tessellation. Affected by flow control.
PostTessVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation.
PostTessVALUBusyCycles	Cycles	Number of GPU cycles vector where ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation.
PostTessSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation.
PostTessSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation.

PrimitiveAssembly Group¶

Counter Name	Usage	Brief Description
PrimitivesIn	Items	The number of primitives received by the hardware. This includes primitives generated by tessellation.
CulledPrims	Items	The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling.
ClippedPrims	Items	The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes.
PAStalledOnRasterizer	Percentage	Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations.
PAStalledOnRasterizerCycles	Cycles	Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations.

PixelShader Group¶

Counter Name	Usage	Brief Description
PSPixelsOut	Items	Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs.
PSExportStalls	Percentage	Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.
PSExportStallsCycles	Cycles	Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.
PSVALUInstCount	Items	Average number of vector ALU instructions executed in the PS. Affected by flow control.
PSSALUInstCount	Items	Average number of scalar ALU instructions executed in the PS. Affected by flow control.
PSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed by the PS.
PSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are being processed by the PS.
PSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed by the PS.
PSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed by the PS.

ComputeShader Group¶

Counter Name	Usage	Brief Description
CSThreadGroups	Items	Total number of thread groups.
CSWavefronts	Items	The total number of wavefronts used for the CS.
CSThreads	Items	The number of CS threads processed by the hardware.
CSThreadGroupSize	Items	The number of CS threads within each thread group.
CSVALUInsts	Items	The average number of vector ALU instructions executed per work-item (affected by flow control).
CSVALUUtilization	Percentage	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of the wave size. Value range: 0% (bad), 100% (ideal - no thread divergence).
CSSALUInsts	Items	The average number of scalar ALU instructions executed per work-item (affected by flow control).
CSVFetchInsts	Items	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control).
CSSFetchInsts	Items	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
CSVWriteInsts	Items	The average number of vector write instructions to the video memory executed per work-item (affected by flow control).
CSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
CSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are processed.
CSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
CSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are processed.
CSMemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
CSMemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
CSMemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
CSMemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible.
CSWriteUnitStalled	Percentage	The percentage of GPUTime the write unit is stalled.
CSWriteUnitStalledCycles	Cycles	Number of GPU cycles the write unit is stalled.
CSGDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
CSLDSInsts	Items	The average number of LDS read/write instructions executed per work-item (affected by flow control).
CSALUStalledByLDS	Percentage	The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
CSALUStalledByLDSCycles	Cycles	Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible.
CSLDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
CSLDSBankConflictCycles	Cycles	Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad).

TextureUnit Group¶

Counter Name	Usage	Brief Description
TexTriFilteringPct	Percentage	Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
TexTriFilteringCount	Items	Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
NoTexTriFilteringCount	Items	Count of pixels that did not receive trilinear filtering.
TexVolFilteringPct	Percentage	Percentage of pixels that received volume filtering.
TexVolFilteringCount	Items	Count of pixels that received volume filtering.
NoTexVolFilteringCount	Items	Count of pixels that did not receive volume filtering.
TexAveAnisotropy	Items	The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy.

DepthAndStencil Group¶

Counter Name	Usage	Brief Description
HiZTilesAccepted	Percentage	Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesAcceptedCount	Items	Count of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesRejectedCount	Items	Count of tiles not accepted by HiZ.
PreZTilesDetailCulled	Percentage	Percentage of tiles rejected because the associated prim had no contributing area.
PreZTilesDetailCulledCount	Items	Count of tiles rejected because the associated primitive had no contributing area.
PreZTilesDetailSurvivingCount	Items	Count of tiles surviving because the associated primitive had contributing area.
HiZQuadsCulled	Percentage	Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsCulledCount	Items	Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsAcceptedCount	Items	Count of quads that did continue on in the pipeline after HiZ.
PreZQuadsCulled	Percentage	Percentage of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsCulledCount	Items	Count of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsSurvivingCount	Items	Count of quads surviving detailZ and earlyZ tests.
PostZQuads	Percentage	Percentage of quads for which the pixel shader will run and may be postZ tested.
PostZQuadCount	Items	Count of quads for which the pixel shader will run and may be postZ tested.
PreZSamplesPassing	Items	Number of samples tested for Z before shading and passed.
PreZSamplesFailingS	Items	Number of samples tested for Z before shading and failed stencil test.
PreZSamplesFailingZ	Items	Number of samples tested for Z before shading and failed Z test.
PostZSamplesPassing	Items	Number of samples tested for Z after shading and passed.
PostZSamplesFailingS	Items	Number of samples tested for Z after shading and failed stencil test.
PostZSamplesFailingZ	Items	Number of samples tested for Z after shading and failed Z test.
ZUnitStalled	Percentage	The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations.
ZUnitStalledCycles	Cycles	Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations.
DBMemRead	Bytes	Number of bytes read from the depth buffer.
DBMemWritten	Bytes	Number of bytes written to the depth buffer.

ColorBuffer Group¶

Counter Name	Usage	Brief Description
CBMemRead	Bytes	Number of bytes read from the color buffer.
CBColorAndMaskRead	Bytes	Total number of bytes read from the color and mask buffers.
CBMemWritten	Bytes	Number of bytes written to the color buffer.
CBColorAndMaskWritten	Bytes	Total number of bytes written to the color and mask buffers.
CBSlowPixelPct	Percentage	Percentage of pixels written to the color buffer using a half-rate or quarter-rate format.
CBSlowPixelCount	Items	Number of pixels written to the color buffer using a half-rate or quarter-rate format.

MemoryCache Group¶

Counter Name	Usage	Brief Description
L0CacheHit	Percentage	The percentage of read requests that hit the data in the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L0CacheRequestCount	Items	The number of read requests made to the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
L0CacheHitCount	Items	The number of read requests which result in a cache hit from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
L0CacheMissCount	Items	The number of read requests which result in a cache miss from the L0 cache. The L0 cache contains vector data, which is data that may vary in each thread across the wavefront. Each request is 128 bytes in size.
ScalarCacheHit	Percentage	The percentage of read requests made from executing shader code that hit the data in the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal).
ScalarCacheRequestCount	Items	The number of read requests made from executing shader code to the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
ScalarCacheHitCount	Items	The number of read requests made from executing shader code which result in a cache hit from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
ScalarCacheMissCount	Items	The number of read requests made from executing shader code which result in a cache miss from the Scalar cache. The Scalar cache contains data that does not vary in each thread across the wavefront. Each request is 64 bytes in size.
InstCacheHit	Percentage	The percentage of read requests made that hit the data in the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size. Value range: 0% (no hit) to 100% (optimal).
InstCacheRequestCount	Items	The number of read requests made to the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
InstCacheHitCount	Items	The number of read requests which result in a cache hit from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
InstCacheMissCount	Items	The number of read requests which result in a cache miss from the Instruction cache. The Instruction cache supplies shader code to an executing shader. Each request is 64 bytes in size.
L1CacheHit	Percentage	The percentage of read or write requests that hit the data in the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L1CacheRequestCount	Items	The number of read or write requests made to the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L1CacheHitCount	Items	The number of read or write requests which result in a cache hit from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L1CacheMissCount	Items	The number of read or write requests which result in a cache miss from the L1 cache. The L1 cache is shared across all WGPs in a single shader engine. Each request is 128 bytes in size.
L2CacheHit	Percentage	The percentage of read or write requests that hit the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (no hit) to 100% (optimal).
L2CacheMiss	Percentage	The percentage of read or write requests that miss the data in the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size. Value range: 0% (optimal) to 100% (all miss).
L2CacheRequestCount	Items	The number of read or write requests made to the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L2CacheHitCount	Items	The number of read or write requests which result in a cache hit from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L2CacheMissCount	Items	The number of read or write requests which result in a cache miss from the L2 cache. The L2 cache is shared by many blocks across the GPU, including the Command Processor, Geometry Engine, all WGPs, all Render Backends, and others. Each request is 128 bytes in size.
L0TagConflictReadStalledCycles	Items	The number of cycles read operations from the L0 cache are stalled due to tag conflicts.
L0TagConflictWriteStalledCycles	Items	The number of cycles write operations to the L0 cache are stalled due to tag conflicts.
L0TagConflictAtomicStalledCycles	Items	The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts.

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Bytes	The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Bytes	The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
MemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled.
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
WriteUnitStalledCycles	Cycles	Number of GPU cycles the Write unit is stalled.
LocalVidMemBytes	Bytes	Number of bytes read from or written to local video memory
PcieBytes	Bytes	Number of bytes sent and received over the PCIe bus

Vega Counters¶

Timing Group¶

Counter Name	Usage	Brief Description
GPUTime	Nanoseconds	Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionDuration	Nanoseconds	GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionStart	Nanoseconds	GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP).
ExecutionEnd	Nanoseconds	GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP).
GPUBusy	Percentage	The percentage of time the GPU command processor was busy.
GPUBusyCycles	Cycles	Number of GPU cycles that the GPU command processor was busy.
TessellatorBusy	Percentage	The percentage of time the tessellation engine is busy.
TessellatorBusyCycles	Cycles	Number of GPU cycles that the tessellation engine is busy.
VsGsBusy	Percentage	The percentage of time the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline.
VsGsBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has VS or GS work to do in a VS-[GS-]PS pipeline.
VsGsTime	Nanoseconds	Time VS or GS are busy in nanoseconds in a VS-[GS-]PS pipeline.
PreTessellationBusy	Percentage	The percentage of time the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation.
PreTessellationBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has VS and HS work to do in a pipeline that uses tessellation.
PreTessellationTime	Nanoseconds	Time VS and HS are busy in nanoseconds in a pipeline that uses tessellation.
PostTessellationBusy	Percentage	The percentage of time the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation.
PostTessellationBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has DS or GS work to do in a pipeline that uses tessellation.
PostTessellationTime	Nanoseconds	Time DS or GS are busy in nanoseconds in a pipeline that uses tessellation.
PSBusy	Percentage	The percentage of time the ShaderUnit has pixel shader work to do.
PSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has pixel shader work to do.
PSTime	Nanoseconds	Time pixel shaders are busy in nanoseconds.
CSBusy	Percentage	The percentage of time the ShaderUnit has compute shader work to do.
CSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has compute shader work to do.
CSTime	Nanoseconds	Time compute shaders are busy in nanoseconds.
PrimitiveAssemblyBusy	Percentage	The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
PrimitiveAssemblyBusyCycles	Cycles	Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
TexUnitBusy	Percentage	The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
TexUnitBusyCycles	Cycles	Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
DepthStencilTestBusy	Percentage	Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy.
DepthStencilTestBusyCycles	Cycles	Number of GPU cycles spent performing depth and stencil tests.

VertexGeometry Group¶

Counter Name	Usage	Brief Description
VsGsVerticesIn	Items	The number of unique vertices processed by the VS and GS.
VsGsPrimsIn	Items	The number of primitives passed into the VS and GS.
GSVerticesOut	Items	The number of vertices output by the GS.
VsGsVALUInstCount	Items	Average number of vector ALU instructions executed in the VS and GS in a VS-[GS-]PS pipeline. Affected by flow control.
VsGsSALUInstCount	Items	Average number of scalar ALU instructions executed in the VS and GS in a VS-[GS-]PS pipeline. Affected by flow control.
VsGsVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed by the VS and GS in a VS-[GS-]PS pipeline.
VsGsVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are being processed by the VS and GS in a VS-[GS-]PS pipeline.
VsGsSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed by the VS and GS in a VS-[GS-]PS pipeline.
VsGsSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed by the VS and GS in a VS-[GS-]PS pipeline.

PreTessellation Group¶

Counter Name	Usage	Brief Description
PreTessVerticesIn	Items	The number of vertices processed by the VS and HS when using tessellation.
PreTessVALUInstCount	Items	Average number of vector ALU instructions executed for the VS and HS in a pipeline that uses tessellation. Affected by flow control.
PreTessSALUInstCount	Items	Average number of scalar ALU instructions executed for the VS and HS in a pipeline that uses tessellation. Affected by flow control.
PreTessVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation.
PreTessVALUBusyCycles	Cycles	Number of GPU cycles vector where ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation.
PreTessSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation.
PreTessSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed for the VS and HS in a pipeline that uses tessellation.

PostTessellation Group¶

Counter Name	Usage	Brief Description
PostTessPrimsOut	Items	The number of primitives output by the DS and GS when using tessellation.
PostTessVALUInstCount	Items	Average number of vector ALU instructions executed for the DS and GS in a pipeline that uses tessellation. Affected by flow control.
PostTessSALUInstCount	Items	Average number of scalar ALU instructions executed for the DS and GS in a pipeline that uses tessellation. Affected by flow control.
PostTessVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation.
PostTessVALUBusyCycles	Cycles	Number of GPU cycles vector where ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation.
PostTessSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation.
PostTessSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed for the DS and GS in a pipeline that uses tessellation.

PrimitiveAssembly Group¶

Counter Name	Usage	Brief Description
PrimitivesIn	Items	The number of primitives received by the hardware. This includes primitives generated by tessellation.
CulledPrims	Items	The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling.
ClippedPrims	Items	The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes.
PAStalledOnRasterizer	Percentage	Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations.
PAStalledOnRasterizerCycles	Cycles	Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations.

PixelShader Group¶

Counter Name	Usage	Brief Description
PSPixelsOut	Items	Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs.
PSExportStalls	Percentage	Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.
PSExportStallsCycles	Cycles	Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.
PSVALUInstCount	Items	Average number of vector ALU instructions executed in the PS. Affected by flow control.
PSSALUInstCount	Items	Average number of scalar ALU instructions executed in the PS. Affected by flow control.
PSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed by the PS.
PSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are being processed by the PS.
PSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed by the PS.
PSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed by the PS.

ComputeShader Group¶

Counter Name	Usage	Brief Description
CSThreadGroups	Items	Total number of thread groups.
CSWavefronts	Items	The total number of wavefronts used for the CS.
CSThreads	Items	The number of CS threads processed by the hardware.
CSThreadGroupSize	Items	The number of CS threads within each thread group.
CSVALUInsts	Items	The average number of vector ALU instructions executed per work-item (affected by flow control).
CSVALUUtilization	Percentage	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
CSSALUInsts	Items	The average number of scalar ALU instructions executed per work-item (affected by flow control).
CSVFetchInsts	Items	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control).
CSSFetchInsts	Items	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
CSVWriteInsts	Items	The average number of vector write instructions to the video memory executed per work-item (affected by flow control).
CSFlatVMemInsts	Items	The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
CSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
CSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are processed.
CSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
CSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are processed.
CSMemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
CSMemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
CSMemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
CSMemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible.
CSWriteUnitStalled	Percentage	The percentage of GPUTime the write unit is stalled.
CSWriteUnitStalledCycles	Cycles	Number of GPU cycles the write unit is stalled.
CSGDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
CSLDSInsts	Items	The average number of LDS read/write instructions executed per work-item (affected by flow control).
CSFlatLDSInsts	Items	The average number of FLAT instructions that read from or write to LDS executed per work item (affected by flow control).
CSALUStalledByLDS	Percentage	The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
CSALUStalledByLDSCycles	Cycles	Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible.
CSLDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
CSLDSBankConflictCycles	Cycles	Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad).

TextureUnit Group¶

Counter Name	Usage	Brief Description
TexTriFilteringPct	Percentage	Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
TexTriFilteringCount	Items	Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
NoTexTriFilteringCount	Items	Count of pixels that did not receive trilinear filtering.
TexVolFilteringPct	Percentage	Percentage of pixels that received volume filtering.
TexVolFilteringCount	Items	Count of pixels that received volume filtering.
NoTexVolFilteringCount	Items	Count of pixels that did not receive volume filtering.
TexAveAnisotropy	Items	The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy.

DepthAndStencil Group¶

Counter Name	Usage	Brief Description
HiZTilesAccepted	Percentage	Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesAcceptedCount	Items	Count of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesRejectedCount	Items	Count of tiles not accepted by HiZ.
PreZTilesDetailCulled	Percentage	Percentage of tiles rejected because the associated prim had no contributing area.
PreZTilesDetailCulledCount	Items	Count of tiles rejected because the associated primitive had no contributing area.
PreZTilesDetailSurvivingCount	Items	Count of tiles surviving because the associated primitive had contributing area.
HiZQuadsCulled	Percentage	Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsCulledCount	Items	Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsAcceptedCount	Items	Count of quads that did continue on in the pipeline after HiZ.
PreZQuadsCulled	Percentage	Percentage of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsCulledCount	Items	Count of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsSurvivingCount	Items	Count of quads surviving detailZ and earlyZ tests.
PostZQuads	Percentage	Percentage of quads for which the pixel shader will run and may be postZ tested.
PostZQuadCount	Items	Count of quads for which the pixel shader will run and may be postZ tested.
PreZSamplesPassing	Items	Number of samples tested for Z before shading and passed.
PreZSamplesFailingS	Items	Number of samples tested for Z before shading and failed stencil test.
PreZSamplesFailingZ	Items	Number of samples tested for Z before shading and failed Z test.
PostZSamplesPassing	Items	Number of samples tested for Z after shading and passed.
PostZSamplesFailingS	Items	Number of samples tested for Z after shading and failed stencil test.
PostZSamplesFailingZ	Items	Number of samples tested for Z after shading and failed Z test.
ZUnitStalled	Percentage	The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations.
ZUnitStalledCycles	Cycles	Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations.
DBMemRead	Bytes	Number of bytes read from the depth buffer.
DBMemWritten	Bytes	Number of bytes written to the depth buffer.

ColorBuffer Group¶

Counter Name	Usage	Brief Description
CBMemRead	Bytes	Number of bytes read from the color buffer.
CBColorAndMaskRead	Bytes	Total number of bytes read from the color and mask buffers.
CBMemWritten	Bytes	Number of bytes written to the color buffer.
CBColorAndMaskWritten	Bytes	Total number of bytes written to the color and mask buffers.
CBSlowPixelPct	Percentage	Percentage of pixels written to the color buffer using a half-rate or quarter-rate format.
CBSlowPixelCount	Items	Number of pixels written to the color buffer using a half-rate or quarter-rate format.

MemoryCache Group¶

Counter Name	Usage	Brief Description
L0TagConflictReadStalledCycles	Items	The number of cycles read operations from the L0 cache are stalled due to tag conflicts.
L0TagConflictWriteStalledCycles	Items	The number of cycles write operations to the L0 cache are stalled due to tag conflicts.
L0TagConflictAtomicStalledCycles	Items	The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts.

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Bytes	The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Bytes	The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
L1CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Value range: 0% (no hit) to 100% (optimal).
L1CacheHitCount	Items	Count of fetch, write, atomic, and other instructions that hit the data in L1 cache.
L1CacheMissCount	Items	Count of fetch, write, atomic, and other instructions that miss the data in L1 cache.
L2CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the L2 cache. Value range: 0% (no hit) to 100% (optimal).
L2CacheMiss	Percentage	The percentage of fetch, write, atomic, and other instructions that miss the L2 cache. Value range: 0% (optimal) to 100% (all miss).
L2CacheHitCount	Items	Count of fetch, write, atomic, and other instructions that hit the L2 cache.
L2CacheMissCount	Items	Count of fetch, write, atomic, and other instructions that miss the L2 cache.
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
MemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled.
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
WriteUnitStalledCycles	Cycles	Number of GPU cycles the Write unit is stalled.
LocalVidMemBytes	Bytes	Number of bytes read from or written to local video memory
PcieBytes	Bytes	Number of bytes sent and received over the PCIe bus

Graphics IP v8 Counters¶

Timing Group¶

Counter Name	Usage	Brief Description
GPUTime	Nanoseconds	Time this API command took to execute on the GPU in nanoseconds from the time the previous command reached the bottom of the pipeline (BOP) to the time this command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionDuration	Nanoseconds	GPU command execution duration in nanoseconds, from the time the command enters the top of the pipeline (TOP) to the time the command reaches the bottom of the pipeline (BOP). Does not include time that draw calls are processed in parallel.
ExecutionStart	Nanoseconds	GPU command execution start time in nanoseconds. This is the time the command enters the top of the pipeline (TOP).
ExecutionEnd	Nanoseconds	GPU command execution end time in nanoseconds. This is the time the command reaches the bottom of the pipeline (BOP).
GPUBusy	Percentage	The percentage of time GPU was busy.
GPUBusyCycles	Cycles	Number of GPU cycles that the GPU was busy.
TessellatorBusy	Percentage	The percentage of time the tessellation engine is busy.
TessellatorBusyCycles	Cycles	Number of GPU cycles that the tessellation engine is busy.
VSBusy	Percentage	The percentage of time the ShaderUnit has vertex shader work to do.
VSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has vertex shader work to do.
VSTime	Nanoseconds	Time vertex shaders are busy in nanoseconds.
HSBusy	Percentage	The percentage of time the ShaderUnit has hull shader work to do.
HSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has hull shader work to do.
HSTime	Nanoseconds	Time hull shaders are busy in nanoseconds.
DSBusy	Percentage	The percentage of time the ShaderUnit has domain shader work to do.
DSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has domain shader work to do.
DSTime	Nanoseconds	Time domain shaders are busy in nanoseconds.
GSBusy	Percentage	The percentage of time the ShaderUnit has geometry shader work to do.
GSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has geometry shader work to do.
GSTime	Nanoseconds	Time geometry shaders are busy in nanoseconds.
PSBusy	Percentage	The percentage of time the ShaderUnit has pixel shader work to do.
PSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has pixel shader work to do.
PSTime	Nanoseconds	Time pixel shaders are busy in nanoseconds.
CSBusy	Percentage	The percentage of time the ShaderUnit has compute shader work to do.
CSBusyCycles	Cycles	Number of GPU cycles that the ShaderUnit has compute shader work to do.
CSTime	Nanoseconds	Time compute shaders are busy in nanoseconds.
PrimitiveAssemblyBusy	Percentage	The percentage of GPUTime that primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
PrimitiveAssemblyBusyCycles	Cycles	Number of GPU cycles the primitive assembly (clipping and culling) is busy. High values may be caused by having many small primitives; mid to low values may indicate pixel shader or output buffer bottleneck.
TexUnitBusy	Percentage	The percentage of GPUTime the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
TexUnitBusyCycles	Cycles	Number of GPU cycles the texture unit is active. This is measured with all extra fetches and any cache or memory effects taken into account.
DepthStencilTestBusy	Percentage	Percentage of time GPU spent performing depth and stencil tests relative to GPUBusy.
DepthStencilTestBusyCycles	Cycles	Number of GPU cycles spent performing depth and stencil tests.

VertexShader Group¶

Counter Name	Usage	Brief Description
VSVerticesIn	Items	The number of vertices processed by the VS.
VSVALUInstCount	Items	Average number of vector ALU instructions executed in the VS. Affected by flow control.
VSSALUInstCount	Items	Average number of scalar ALU instructions executed in the VS. Affected by flow control.
VSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed by the VS.
VSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are being processed by the VS.
VSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed by the VS.
VSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed by the VS.

HullShader Group¶

Counter Name	Usage	Brief Description
HSPatches	Items	The number of patches processed by the HS.
HSVALUInstCount	Items	Average number of vector ALU instructions executed in the HS. Affected by flow control.
HSSALUInstCount	Items	Average number of scalar ALU instructions executed in the HS. Affected by flow control.
HSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed by the HS.
HSVALUBusyCycles	Cycles	Number of GPU cycles vector where ALU instructions are being processed by the HS.
HSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed by the HS.
HSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed by the HS.

DomainShader Group¶

Counter Name	Usage	Brief Description
DSVerticesIn	Items	The number of vertices processed by the DS.
DSVALUInstCount	Items	Average number of vector ALU instructions executed in the DS. Affected by flow control.
DSSALUInstCount	Items	Average number of scalar ALU instructions executed in the DS. Affected by flow control.
DSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed by the DS.
DSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are being processed by the DS.
DSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed by the DS.
DSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed by the DS.

GeometryShader Group¶

Counter Name	Usage	Brief Description
GSPrimsIn	Items	The number of primitives passed into the GS.
GSVerticesOut	Items	The number of vertices output by the GS.
GSVALUInstCount	Items	Average number of vector ALU instructions executed in the GS. Affected by flow control.
GSSALUInstCount	Items	Average number of scalar ALU instructions executed in the GS. Affected by flow control.
GSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed by the GS.
GSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are being processed by the GS.
GSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed by the GS.
GSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed by the GS.

PrimitiveAssembly Group¶

Counter Name	Usage	Brief Description
PrimitivesIn	Items	The number of primitives received by the hardware. This includes primitives generated by tessellation.
CulledPrims	Items	The number of culled primitives. Typical reasons include scissor, the primitive having zero area, and back or front face culling.
ClippedPrims	Items	The number of primitives that required one or more clipping operations due to intersecting the view volume or user clip planes.
PAStalledOnRasterizer	Percentage	Percentage of GPUTime that primitive assembly waits for rasterization to be ready to accept data. This roughly indicates for what percentage of time the pipeline is bottlenecked by pixel operations.
PAStalledOnRasterizerCycles	Cycles	Number of GPU cycles the primitive assembly waits for rasterization to be ready to accept data. Indicates the number of GPU cycles the pipeline is bottlenecked by pixel operations.

PixelShader Group¶

Counter Name	Usage	Brief Description
PSPixelsOut	Items	Pixels exported from shader to color buffers. Does not include killed or alpha tested pixels; if there are multiple render targets, each render target receives one export, so this will be 2 for 1 pixel written to two RTs.
PSExportStalls	Percentage	Pixel shader output stalls. Percentage of GPUBusy. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.
PSExportStallsCycles	Cycles	Number of GPU cycles the pixel shader output stalls. Should be zero for PS or further upstream limited cases; if not zero, indicates a bottleneck in late Z testing or in the color buffer.
PSVALUInstCount	Items	Average number of vector ALU instructions executed in the PS. Affected by flow control.
PSSALUInstCount	Items	Average number of scalar ALU instructions executed in the PS. Affected by flow control.
PSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are being processed by the PS.
PSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are being processed by the PS.
PSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are being processed by the PS.
PSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are being processed by the PS.

ComputeShader Group¶

Counter Name	Usage	Brief Description
CSThreadGroups	Items	Total number of thread groups.
CSWavefronts	Items	The total number of wavefronts used for the CS.
CSThreads	Items	The number of CS threads processed by the hardware.
CSThreadGroupSize	Items	The number of CS threads within each thread group.
CSVALUInsts	Items	The average number of vector ALU instructions executed per work-item (affected by flow control).
CSVALUUtilization	Percentage	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
CSSALUInsts	Items	The average number of scalar ALU instructions executed per work-item (affected by flow control).
CSVFetchInsts	Items	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control).
CSSFetchInsts	Items	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
CSVWriteInsts	Items	The average number of vector write instructions to the video memory executed per work-item (affected by flow control).
CSFlatVMemInsts	Items	The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
CSVALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
CSVALUBusyCycles	Cycles	Number of GPU cycles where vector ALU instructions are processed.
CSSALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
CSSALUBusyCycles	Cycles	Number of GPU cycles where scalar ALU instructions are processed.
CSMemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
CSMemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
CSMemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
CSMemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled. Try reducing the number or size of fetches and writes if possible.
CSWriteUnitStalled	Percentage	The percentage of GPUTime the write unit is stalled.
CSWriteUnitStalledCycles	Cycles	Number of GPU cycles the write unit is stalled.
CSGDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
CSLDSInsts	Items	The average number of LDS read/write instructions executed per work-item (affected by flow control).
CSFlatLDSInsts	Items	The average number of FLAT instructions that read from or write to LDS executed per work item (affected by flow control).
CSALUStalledByLDS	Percentage	The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
CSALUStalledByLDSCycles	Cycles	Number of GPU cycles the ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible.
CSLDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
CSLDSBankConflictCycles	Cycles	Number of GPU cycles the LDS is stalled by bank conflicts. Value range: 0 (optimal) to GPUBusyCycles (bad).

TextureUnit Group¶

Counter Name	Usage	Brief Description
TexTriFilteringPct	Percentage	Percentage of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
TexTriFilteringCount	Items	Count of pixels that received trilinear filtering. Note that not all pixels for which trilinear filtering is enabled will receive it (e.g. if the texture is magnified).
NoTexTriFilteringCount	Items	Count of pixels that did not receive trilinear filtering.
TexVolFilteringPct	Percentage	Percentage of pixels that received volume filtering.
TexVolFilteringCount	Items	Count of pixels that received volume filtering.
NoTexVolFilteringCount	Items	Count of pixels that did not receive volume filtering.
TexAveAnisotropy	Items	The average degree of anisotropy applied. A number between 1 and 16. The anisotropic filtering algorithm only applies samples where they are required (e.g. there will be no extra anisotropic samples if the view vector is perpendicular to the surface) so this can be much lower than the requested anisotropy.

DepthAndStencil Group¶

Counter Name	Usage	Brief Description
HiZTilesAccepted	Percentage	Percentage of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesAcceptedCount	Items	Count of tiles accepted by HiZ and will be rendered to the depth or color buffers.
HiZTilesRejectedCount	Items	Count of tiles not accepted by HiZ.
PreZTilesDetailCulled	Percentage	Percentage of tiles rejected because the associated prim had no contributing area.
PreZTilesDetailCulledCount	Items	Count of tiles rejected because the associated primitive had no contributing area.
PreZTilesDetailSurvivingCount	Items	Count of tiles surviving because the associated primitive had contributing area.
HiZQuadsCulled	Percentage	Percentage of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsCulledCount	Items	Count of quads that did not have to continue on in the pipeline after HiZ. They may be written directly to the depth buffer, or culled completely. Consistently low values here may suggest that the Z-range is not being fully utilized.
HiZQuadsAcceptedCount	Items	Count of quads that did continue on in the pipeline after HiZ.
PreZQuadsCulled	Percentage	Percentage of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsCulledCount	Items	Count of quads rejected based on the detailZ and earlyZ tests.
PreZQuadsSurvivingCount	Items	Count of quads surviving detailZ and earlyZ tests.
PostZQuads	Percentage	Percentage of quads for which the pixel shader will run and may be postZ tested.
PostZQuadCount	Items	Count of quads for which the pixel shader will run and may be postZ tested.
PreZSamplesPassing	Items	Number of samples tested for Z before shading and passed.
PreZSamplesFailingS	Items	Number of samples tested for Z before shading and failed stencil test.
PreZSamplesFailingZ	Items	Number of samples tested for Z before shading and failed Z test.
PostZSamplesPassing	Items	Number of samples tested for Z after shading and passed.
PostZSamplesFailingS	Items	Number of samples tested for Z after shading and failed stencil test.
PostZSamplesFailingZ	Items	Number of samples tested for Z after shading and failed Z test.
ZUnitStalled	Percentage	The percentage of GPUTime the depth buffer spends waiting for the color buffer to be ready to accept data. High figures here indicate a bottleneck in color buffer operations.
ZUnitStalledCycles	Cycles	Number of GPU cycles the depth buffer spends waiting for the color buffer to be ready to accept data. Larger numbers indicate a bottleneck in color buffer operations.
DBMemRead	Bytes	Number of bytes read from the depth buffer.
DBMemWritten	Bytes	Number of bytes written to the depth buffer.

ColorBuffer Group¶

Counter Name	Usage	Brief Description
CBMemRead	Bytes	Number of bytes read from the color buffer.
CBColorAndMaskRead	Bytes	Total number of bytes read from the color and mask buffers.
CBMemWritten	Bytes	Number of bytes written to the color buffer.
CBColorAndMaskWritten	Bytes	Total number of bytes written to the color and mask buffers.
CBSlowPixelPct	Percentage	Percentage of pixels written to the color buffer using a half-rate or quarter-rate format.
CBSlowPixelCount	Items	Number of pixels written to the color buffer using a half-rate or quarter-rate format.

MemoryCache Group¶

Counter Name	Usage	Brief Description
L0TagConflictReadStalledCycles	Items	The number of cycles read operations from the L0 cache are stalled due to tag conflicts.
L0TagConflictWriteStalledCycles	Items	The number of cycles write operations to the L0 cache are stalled due to tag conflicts.
L0TagConflictAtomicStalledCycles	Items	The number of cycles atomic operations on the L0 cache are stalled due to tag conflicts.

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Bytes	The total bytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Bytes	The total bytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data cache. Value range: 0% (no hit) to 100% (optimal).
CacheMiss	Percentage	The percentage of fetch, write, atomic, and other instructions that miss the data cache. Value range: 0% (optimal) to 100% (all miss).
CacheHitCount	Items	Count of fetch, write, atomic, and other instructions that hit the data cache.
CacheMissCount	Items	Count of fetch, write, atomic, and other instructions that miss the data cache.
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitBusyCycles	Cycles	Number of GPU cycles the memory unit is active. The result includes the stall time (MemUnitStalledCycles). This is measured with all extra fetches and writes and any cache or memory effects taken into account.
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
MemUnitStalledCycles	Cycles	Number of GPU cycles the memory unit is stalled.
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
WriteUnitStalledCycles	Cycles	Number of GPU cycles the Write unit is stalled.

Counters Exposed for Compute Performance Analysis¶

The following tables show the set of counters exposed for analysis of GPU Compute workloads, as well the family of GPUs and APUs on which each counter is available:

RDNA3 Counters¶

General Group¶

Counter Name	Usage	Brief Description
Wavefronts	Items	Total wavefronts.
VALUInsts	Items	The average number of vector ALU instructions executed per work-item (affected by flow control).
SALUInsts	Items	The average number of scalar ALU instructions executed per work-item (affected by flow control).
VFetchInsts	Items	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
SFetchInsts	Items	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
VWriteInsts	Items	The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
GDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
VALUUtilization	Percentage	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of the wave size. Value range: 0% (bad), 100% (ideal - no thread divergence).
VALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
SALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).

LocalMemory Group¶

Counter Name	Usage	Brief Description
LDSInsts	Items	The average number of LDS read or LDS write instructions executed per work item (affected by flow control).
LDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Kilobytes	The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Kilobytes	The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
L0CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L0 cache. Value range: 0% (no hit) to 100% (optimal).
L1CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Writes and atomics always miss this cache. Value range: 0% (no hit) to 100% (optimal).
L2CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).

RDNA2 Counters¶

General Group¶

Counter Name	Usage	Brief Description
Wavefronts	Items	Total wavefronts.
VALUInsts	Items	The average number of vector ALU instructions executed per work-item (affected by flow control).
SALUInsts	Items	The average number of scalar ALU instructions executed per work-item (affected by flow control).
VFetchInsts	Items	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
SFetchInsts	Items	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
VWriteInsts	Items	The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
GDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
VALUUtilization	Percentage	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of the wave size. Value range: 0% (bad), 100% (ideal - no thread divergence).
VALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
SALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).

LocalMemory Group¶

Counter Name	Usage	Brief Description
LDSInsts	Items	The average number of LDS read or LDS write instructions executed per work item (affected by flow control).
LDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Kilobytes	The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Kilobytes	The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
L0CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L0 cache. Value range: 0% (no hit) to 100% (optimal).
L1CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Writes and atomics always miss this cache. Value range: 0% (no hit) to 100% (optimal).
L2CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).

RDNA Counters¶

General Group¶

Counter Name	Usage	Brief Description
Wavefronts	Items	Total wavefronts.
VALUInsts	Items	The average number of vector ALU instructions executed per work-item (affected by flow control).
SALUInsts	Items	The average number of scalar ALU instructions executed per work-item (affected by flow control).
VFetchInsts	Items	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
SFetchInsts	Items	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
VWriteInsts	Items	The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
GDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
VALUUtilization	Percentage	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of the wave size. Value range: 0% (bad), 100% (ideal - no thread divergence).
VALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
SALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).

LocalMemory Group¶

Counter Name	Usage	Brief Description
LDSInsts	Items	The average number of LDS read or LDS write instructions executed per work item (affected by flow control).
LDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Kilobytes	The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Kilobytes	The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
L0CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L0 cache. Value range: 0% (no hit) to 100% (optimal).
L1CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Writes and atomics always miss this cache. Value range: 0% (no hit) to 100% (optimal).
L2CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).

Vega Counters¶

General Group¶

Counter Name	Usage	Brief Description
Wavefronts	Items	Total wavefronts.
VALUInsts	Items	The average number of vector ALU instructions executed per work-item (affected by flow control).
SALUInsts	Items	The average number of scalar ALU instructions executed per work-item (affected by flow control).
VFetchInsts	Items	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
SFetchInsts	Items	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
VWriteInsts	Items	The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
FlatVMemInsts	Items	The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
GDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
VALUUtilization	Percentage	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
VALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
SALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).

LocalMemory Group¶

Counter Name	Usage	Brief Description
LDSInsts	Items	The average number of LDS read or LDS write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS.
FlatLDSInsts	Items	The average number of FLAT instructions that read from or write to LDS executed per work item (affected by flow control).
LDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Kilobytes	The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Kilobytes	The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
L1CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L1 cache. Value range: 0% (no hit) to 100% (optimal).
L2CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).

Graphics IP v8 Counters¶

General Group¶

Counter Name	Usage	Brief Description
Wavefronts	Items	Total wavefronts.
VALUInsts	Items	The average number of vector ALU instructions executed per work-item (affected by flow control).
SALUInsts	Items	The average number of scalar ALU instructions executed per work-item (affected by flow control).
VFetchInsts	Items	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
SFetchInsts	Items	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
VWriteInsts	Items	The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
FlatVMemInsts	Items	The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
GDSInsts	Items	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
VALUUtilization	Percentage	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
VALUBusy	Percentage	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
SALUBusy	Percentage	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).

LocalMemory Group¶

Counter Name	Usage	Brief Description
LDSInsts	Items	The average number of LDS read or LDS write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS.
FlatLDSInsts	Items	The average number of FLAT instructions that read from or write to LDS executed per work item (affected by flow control).
LDSBankConflict	Percentage	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).

GlobalMemory Group¶

Counter Name	Usage	Brief Description
FetchSize	Kilobytes	The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	Kilobytes	The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
CacheHit	Percentage	The percentage of fetch, write, atomic, and other instructions that hit the data cache. Value range: 0% (no hit) to 100% (optimal).
MemUnitBusy	Percentage	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitStalled	Percentage	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
WriteUnitStalled	Percentage	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).