[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx942 metrics #3027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

vedithal-amd wants to merge 1 commit into develop from users/vedithal/rocprofiler-compute-mi300-metric-descriptions

+41 −4

projects/rocprofiler-compute/CHANGELOG.md

-Original file line number
+Diff line change
@@ Expand Up @@
     * Synced latest metric descriptions to public facing documentation
         * Updated metric units to be more human readable in public facing documentation
+    * Added missing metric descriptions for gfx942 architecture
     ### Changed
     * Default output format for the underlying ROCprofiler-SDK tool has been changed from ``csv`` to ``rocpd``.
@@ Expand Down @@

...ocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/0300_memory_chart.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -213,6 +213,8 @@ Panel Config: @@
           normalization unit.
         sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
           line, per normalization unit.
+        sL1D Lat: The time-averaged number of cycles scalar L1D cache requests spent in
+          flight before data was returned to a CU.
         sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
           unit.
         sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
@@ Expand Down @@

.../src/rocprof_compute_soc/analysis_configs/gfx942/1100_compute_units_compute_pipeline.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -261,6 +261,11 @@ Panel Config: @@
           also presented as a percent of the peak theoretical IOPs achievable on the
           specific accelerator. Note: this does not include any integer operations from
           MFMA instructions.
+        MFMA FLOPs (F8): >-
+          The total number of 8-bit floating point MFMA operations executed per
+          second. Note: this does not include any 8-bit floating point operations from
+          VALU instructions. This is also presented as a percent of the peak theoretical
+          F8 MFMA operations achievable on the specific accelerator.
         MFMA FLOPs (BF16): >-
           The total number of 16-bit brain floating point MFMA operations executed
           per second. Note: this does not include any 16-bit brain floating point operations
@@ Expand Down Expand Up / @@ -325,6 +330,8 @@ Panel Config: @@
           the VALU or MFMA units, per normalization unit.
         IOPs (Total): The total number of integer operations executed on either the VALU
           or MFMA units, per normalization unit.
+        F8 OPs: The total number of 8-bit floating-point MFMA operations executed, per
+          normalization unit.
         F16 OPs: The total number of 16-bit floating-point operations executed on either
           the VALU or MFMA units, per normalization unit.
         BF16 OPs: The total number of 16-bit brain floating-point operations executed
@@ Expand Down @@

..._soc/analysis_configs/gfx942/1500_address_processing_unit_and_data_return_path_ta_td.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -194,6 +194,15 @@ Panel Config: @@
           sending write/atomic data further into the vL1D pipeline.
         "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
           processor was stalled waiting to send command data to the data processor.
+        "Sequencer \u2192 TA Address Stall": The number of cycles the sequencer was stalled
+          waiting to send address requests to the address processor due to a full address
+          FIFO, per normalization unit.
+        "Sequencer \u2192 TA Command Stall": The number of cycles the sequencer was stalled
+          waiting to send commands to the address processor due to a full command FIFO,
+          per normalization unit.
+        "Sequencer \u2192 TA Data Stall": The number of cycles the sequencer was stalled
+          waiting to send write data to the address processor due to a full data FIFO,
+          per normalization unit.
         Total Instructions: The total number of memory instructions executed by the address
           processer over all compute units on the accelerator, per normalization unit.
         Global/Generic Instructions: The total number of global & generic memory instructions
@@ Expand Down @@

...ts/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx942/1700_l2_cache.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -491,6 +491,8 @@ Panel Config: @@
           data from any memory location, per normalization unit.
         Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
           data from any memory location, per normalization unit.
+        Read (128B): The total number of L2 requests to Infinity Fabric to read 128B
+          of data from any memory location, per normalization unit.
         Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
           data from any memory location, per normalization unit. 64B requests for uncached
           data are counted as two 32B uncached data requests.
@@ Expand Down @@

projects/rocprofiler-compute/src/utils/.config_hashes.json

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -98,19 +98,19 @@
  
            "0000_top_stats.yaml": "2819d96f5b1c3704f2ac50868a246a7f",

            "0100_system_info.yaml": "cefae2b10db8cf4b0d3a971cff5e82c8",

            "0200_system_speed_of_light.yaml": "1bceb9c4727b953a474f92a2f9cfe35d",

            "0300_memory_chart.yaml": "0a57cdf55be606799ee8d7b42a993027",

            "0300_memory_chart.yaml": "d34910b7300bd5920ae8ecedc9d52198",

            "0400_roofline.yaml": "318c3e774d41a639628a7f72c2462375",

            "0500_command_processor_cpc_cpf.yaml": "a049849fd5031e509b216614225e3a99",

            "0600_workgroup_manager_spi.yaml": "b12975cfb14c5f06a495c74163f8b8f3",

            "0700_wavefront.yaml": "ba89cee91714d3ca8005ed0bc9d1a70a",

            "1000_compute_units_instruction_mix.yaml": "1c9b9237908dc461991e8bb3b092519d",

            "1100_compute_units_compute_pipeline.yaml": "4fa8e3dd97b6f305294b224a993a7865",

            "1100_compute_units_compute_pipeline.yaml": "b034dcb67b272de2271407905aafd1f8",

            "1200_local_data_share_lds.yaml": "4d34d6c4618833e394fb8fdd0ac4e7cf",

            "1300_instruction_cache.yaml": "e616b2e4ec05c2d91df43cdaabfc9fea",

            "1400_scalar_l1_data_cache.yaml": "393c4aea974c05e45590f3053d66e12e",

            "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "0a95f88d901d89e72fc353a2db39aacb",

            "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "2a0325e72f5240e33c2a2cc124113cdd",

            "1600_vector_l1_data_cache.yaml": "2a539ff492d3a83b62f50f4b5b93d8c8",

            "1700_l2_cache.yaml": "ca170444952edf6d05ce69e47e894e9f",

            "1700_l2_cache.yaml": "c7b84d54dc60a3ebe71220fc18e5a51f",

            "1800_l2_cache_per_channel.yaml": "c4c6b0990499b445608c46d1a051b9f6",

            "2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"

          }

...cts/rocprofiler-compute/tools/per_arch_metric_definitions/gfx942_metrics_description.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -348,6 +348,9 @@ Compute Speed-of-Light: @@
       VALU IOPs:
         rst: 'The total integer operations executed per second on the :ref:`VALU <desc-valu>`. This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from :ref:`MFMA <desc-mfma>` instructions.'
         unit: GIOPs
+      MFMA FLOPs (F8):
+        rst: 'The total number of 8-bit floating point :ref:`MFMA <desc-mfma>` operations executed per second. Note: this does not include any 8-bit floating point operations from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent of the peak theoretical F8 MFMA operations achievable on the specific accelerator.'
+        unit: GFLOPs
       MFMA FLOPs (BF16):
         rst: 'The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>` operations executed per second. Note: this does not include any 16-bit brain floating point operations from :ref:`VALU <desc-valu>` instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
         unit: GFLOPs
@@ Expand Down Expand Up / @@ -467,6 +470,9 @@ L1I Speed-of-Light: @@
       Bandwidth Utilization:
         rst: The number of bytes looked up in the L1I cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over the :ref:`total L1I cycles <total-l1i-cycles>`.
         unit: Percent
+      Cache Hit Rate:
+        rst: The percent of L1I requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests.
+        unit: Percent
       L1I-L2 Bandwidth Utilization:
         rst: The percent of the peak theoretical L1I \u2192 L2 cache request bandwidth achieved. Calculated as the ratio of the total number of requests from the L1I to the L2 cache over the :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
         unit: Percent
@@ Expand Down Expand Up / @@ -497,6 +503,9 @@ Scalar L1D Speed-of-Light: @@
       Bandwidth Utilization:
         rst: The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the :ref:`total sL1D cycles <total-sl1d-cycles>`.
         unit: Percent
+      Cache Hit Rate:
+        rst: Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit over the number of all sL1D requests.
+        unit: Percent
       sL1D-L2 BW Utilization:
         rst: The percentage of the peak theoretical sL1D - L2 interface bandwidth acheived. Calculated as total number of bytes read from, written to, or atomically updated across the sL1D - L2 interface.
         unit: Percent
@@ Expand Down Expand Up / @@ -534,6 +543,9 @@ Scalar L1D cache accesses: @@
       Read Req (16 DWord):
         rst: The total number of sL1D read requests made for a sixteen dwords of data (64B), per :ref:`normalization unit <normalization-units>`.
         unit: Requests per Normalization Unit
+      Atomic Req:
+        rst: The total number of atomic requests to the sL1D, per :ref:`normalization unit <normalization-units>`. Typically unused on current CDNA accelerators.
+        unit: Requests per Normalization Unit
     Scalar L1D Cache - L2 Interface:
       sL1D-L2 BW:
         rst: The total number of bytes read from, written to, or atomically updated across the sL1D\u2194:doc:`L2 <l2-cache>` interface, divided by total duration. Note that sL1D writes and atomics are typically unused on current CDNA accelerators, so in the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth.
@@ Expand Down Expand Up / @@ -828,6 +840,9 @@ L2 - Fabric interface detailed metrics: @@
       Read (64B):
         rst: The total number of L2 requests to Infinity Fabric to read 64B of data from any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
         unit: Requests per Normalization Unit
+      Read (128B):
+        rst: The total number of L2 requests to Infinity Fabric to read 128B of data from any memory location, per :ref:`normalization unit <normalization-units>`. See :ref:`l2-request-flow` for more detail.
+        unit: Requests per Normalization Unit
       Read (Uncached):
         rst: The total number of L2 requests to Infinity Fabric to read :ref:`uncached data <memory-type>` from any memory location, per :ref:`normalization unit <normalization-units>`. 64B requests for uncached data are counted as two 32B uncached data requests. See :ref:`l2-request-flow` for more detail.
         unit: Requests per Normalization Unit
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx942 metrics #3027

Diff view

Diff view

There are no files selected for viewing

Uh oh!

[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx942 metrics #3027

Are you sure you want to change the base?

[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx942 metrics #3027

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!