Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions projects/rocprofiler-compute/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
* Synced latest metric descriptions to public facing documentation
* Updated metric units to be more human readable in public facing documentation

* Added missing metric descriptions for gfx942 architecture

### Changed

* Default output format for the underlying ROCprofiler-SDK tool has been changed from ``csv`` to ``rocpd``.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,8 @@ Panel Config:
normalization unit.
sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
line, per normalization unit.
sL1D Lat: The time-averaged number of cycles scalar L1D cache requests spent in
flight before data was returned to a CU.
sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
unit.
sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,11 @@ Panel Config:
also presented as a percent of the peak theoretical IOPs achievable on the
specific accelerator. Note: this does not include any integer operations from
MFMA instructions.
MFMA FLOPs (F8): >-
The total number of 8-bit floating point MFMA operations executed per
second. Note: this does not include any 8-bit floating point operations from
VALU instructions. This is also presented as a percent of the peak theoretical
F8 MFMA operations achievable on the specific accelerator.
MFMA FLOPs (BF16): >-
The total number of 16-bit brain floating point MFMA operations executed
per second. Note: this does not include any 16-bit brain floating point operations
Expand Down Expand Up @@ -325,6 +330,8 @@ Panel Config:
the VALU or MFMA units, per normalization unit.
IOPs (Total): The total number of integer operations executed on either the VALU
or MFMA units, per normalization unit.
F8 OPs: The total number of 8-bit floating-point MFMA operations executed, per
normalization unit.
F16 OPs: The total number of 16-bit floating-point operations executed on either
the VALU or MFMA units, per normalization unit.
BF16 OPs: The total number of 16-bit brain floating-point operations executed
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,15 @@ Panel Config:
sending write/atomic data further into the vL1D pipeline.
"Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
processor was stalled waiting to send command data to the data processor.
"Sequencer \u2192 TA Address Stall": The number of cycles the sequencer was stalled
waiting to send address requests to the address processor due to a full address
FIFO, per normalization unit.
"Sequencer \u2192 TA Command Stall": The number of cycles the sequencer was stalled
waiting to send commands to the address processor due to a full command FIFO,
per normalization unit.
"Sequencer \u2192 TA Data Stall": The number of cycles the sequencer was stalled
waiting to send write data to the address processor due to a full data FIFO,
per normalization unit.
Total Instructions: The total number of memory instructions executed by the address
processer over all compute units on the accelerator, per normalization unit.
Global/Generic Instructions: The total number of global & generic memory instructions
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -491,6 +491,8 @@ Panel Config:
data from any memory location, per normalization unit.
Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
data from any memory location, per normalization unit.
Read (128B): The total number of L2 requests to Infinity Fabric to read 128B
of data from any memory location, per normalization unit.
Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
data from any memory location, per normalization unit. 64B requests for uncached
data are counted as two 32B uncached data requests.
Expand Down
8 changes: 4 additions & 4 deletions projects/rocprofiler-compute/src/utils/.config_hashes.json
Original file line number Diff line number Diff line change
Expand Up @@ -98,19 +98,19 @@
"0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",
"0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",
"0200_system_speed_of_light.yaml": "1bceb9c4727b953a474f92a2f9cfe35d",
"0300_memory_chart.yaml": "0a57cdf55be606799ee8d7b42a993027",
"0300_memory_chart.yaml": "d34910b7300bd5920ae8ecedc9d52198",
"0400_roofline.yaml": "318c3e774d41a639628a7f72c2462375",
"0500_command_processor_cpc_cpf.yaml": "a049849fd5031e509b216614225e3a99",
"0600_workgroup_manager_spi.yaml": "b12975cfb14c5f06a495c74163f8b8f3",
"0700_wavefront.yaml": "ba89cee91714d3ca8005ed0bc9d1a70a",
"1000_compute_units_instruction_mix.yaml": "1c9b9237908dc461991e8bb3b092519d",
"1100_compute_units_compute_pipeline.yaml": "4fa8e3dd97b6f305294b224a993a7865",
"1100_compute_units_compute_pipeline.yaml": "b034dcb67b272de2271407905aafd1f8",
"1200_local_data_share_lds.yaml": "4d34d6c4618833e394fb8fdd0ac4e7cf",
"1300_instruction_cache.yaml": "e616b2e4ec05c2d91df43cdaabfc9fea",
"1400_scalar_l1_data_cache.yaml": "393c4aea974c05e45590f3053d66e12e",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "0a95f88d901d89e72fc353a2db39aacb",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "2a0325e72f5240e33c2a2cc124113cdd",
"1600_vector_l1_data_cache.yaml": "2a539ff492d3a83b62f50f4b5b93d8c8",
"1700_l2_cache.yaml": "ca170444952edf6d05ce69e47e894e9f",
"1700_l2_cache.yaml": "c7b84d54dc60a3ebe71220fc18e5a51f",
"1800_l2_cache_per_channel.yaml": "c4c6b0990499b445608c46d1a051b9f6",
"2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,9 @@ Compute Speed-of-Light:
VALU IOPs:
rst: 'The total integer operations executed per second on the :ref:`VALU <desc-valu>`. This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from :ref:`MFMA <desc-mfma>` instructions.'
unit: GIOPs
MFMA FLOPs (F8):
rst: 'The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` operations executed per second. Note: this does not include any 8-bit floating point operations from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent of the peak theoretical F8 MFMA operations achievable on the specific accelerator.'
unit: GFLOPs
MFMA FLOPs (BF16):
rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations executed per second. Note: this does not include any 16-bit brain floating point operations from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
unit: GFLOPs
Expand Down Expand Up @@ -467,6 +470,9 @@ L1I Speed-of-Light:
Bandwidth Utilization:
rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I cycles <total-l1i-cycles>`.
unit: Percent
Cache Hit Rate:
rst: The percent of L1I requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests.
unit: Percent
L1I-L2 Bandwidth Utilization:
rst: The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth achieved. Calculated as the ratio of the total number of requests from the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
unit: Percent
Expand Down Expand Up @@ -497,6 +503,9 @@ Scalar L1D Speed-of-Light:
Bandwidth Utilization:
rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D cycles <total-sl1d-cycles>`.
unit: Percent
Cache Hit Rate:
rst: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit over the number of all sL1D requests.
unit: Percent
sL1D-L2 BW Utilization:
rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived. Calculated as total number of bytes read from, written to, or atomically updated across the sL1D - L2 interface.
unit: Percent
Expand Down Expand Up @@ -534,6 +543,9 @@ Scalar L1D cache accesses:
Read Req (16 DWord):
rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), per :ref:`normalization unit <normalization-units>`.
unit: Requests per Normalization Unit
Atomic Req:
rst: The total number of atomic requests to the sL1D, per :ref:`normalization unit <normalization-units>`. Typically unused on current CDNA accelerators.
unit: Requests per Normalization Unit
Scalar L1D Cache - L2 Interface:
sL1D-L2 BW:
rst: The total number of bytes read from, written to, or atomically updated across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration. Note that sL1D writes and atomics are typically unused on current CDNA accelerators, so in the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth.
Expand Down Expand Up @@ -828,6 +840,9 @@ L2 - Fabric interface detailed metrics:
Read (64B):
rst: The total number of L2 requests to Infinity Fabric to read 64B of data from any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
unit: Requests per Normalization Unit
Read (128B):
rst: The total number of L2 requests to Infinity Fabric to read 128B of data from any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
unit: Requests per Normalization Unit
Read (Uncached):
rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached data <memory-type>` from any memory location, per :ref:`normalization unit <normalization-units>`. 64B requests for uncached data are counted as two 32B uncached data requests. See :ref:`l2-request-flow` for more detail.
unit: Requests per Normalization Unit
Expand Down
Loading