# DCGM FIELD ,Prometheus metric type ,help message # Clocks DCGM_FI_DEV_SM_CLOCK ,gauge ,SM clock frequency (in MHz). DCGM_FI_DEV_MEM_CLOCK ,gauge ,Memory clock frequency (in MHz). # Temperature DCGM_FI_DEV_MEMORY_TEMP ,gauge ,Memory temperature (in C). DCGM_FI_DEV_GPU_TEMP ,gauge ,GPU temperature (in C). # Power DCGM_FI_DEV_POWER_USAGE ,gauge ,Power draw (in W). DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION ,counter ,Total energy consumption since boot (in mJ). # PCIE DCGM_FI_DEV_PCIE_REPLAY_COUNTER ,counter ,Total number of PCIe retries. # Utilization (the sample period varies depending on the product) DCGM_FI_DEV_GPU_UTIL ,gauge ,GPU utilization (in %). DCGM_FI_DEV_MEM_COPY_UTIL ,gauge ,Memory utilization (in %). DCGM_FI_DEV_ENC_UTIL ,gauge ,Encoder utilization (in %). DCGM_FI_DEV_DEC_UTIL ,gauge ,Decoder utilization (in %). # Errors and violations DCGM_FI_DEV_XID_ERRORS ,gauge ,Value of the last XID error encountered. # Memory usage DCGM_FI_DEV_FB_FREE ,gauge ,Framebuffer memory free (in MiB). DCGM_FI_DEV_FB_USED ,gauge ,Framebuffer memory used (in MiB). # NVLink DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL ,counter ,Total number of NVLink bandwidth counters for all lanes. # VGPU License status DCGM_FI_DEV_VGPU_LICENSE_STATUS ,gauge ,vGPU License status # Remapped rows DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS ,counter ,Number of remapped rows for uncorrectable errors DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS ,counter ,Number of remapped rows for correctable errors DCGM_FI_DEV_ROW_REMAP_FAILURE ,gauge ,Whether remapping of rows has failed # DCP metrics DCGM_FI_PROF_PCIE_TX_BYTES ,counter ,The number of bytes of active pcie tx data including both header and payload. DCGM_FI_PROF_PCIE_RX_BYTES ,counter ,The number of bytes of active pcie rx data including both header and payload. DCGM_FI_PROF_GR_ENGINE_ACTIVE ,gauge ,Ratio of time the graphics engine is active (in %). DCGM_FI_PROF_SM_ACTIVE ,gauge ,The ratio of cycles an SM has at least 1 warp assigned (in %). DCGM_FI_PROF_SM_OCCUPANCY ,gauge ,The ratio of number of warps resident on an SM (in %). DCGM_FI_PROF_PIPE_TENSOR_ACTIVE ,gauge ,Ratio of cycles the tensor (HMMA) pipe is active (in %). DCGM_FI_PROF_DRAM_ACTIVE ,gauge ,Ratio of cycles the device memory interface is active sending or receiving data (in %). DCGM_FI_PROF_PIPE_FP64_ACTIVE ,gauge ,Ratio of cycles the fp64 pipes are active (in %). DCGM_FI_PROF_PIPE_FP32_ACTIVE ,gauge ,Ratio of cycles the fp32 pipes are active (in %). DCGM_FI_PROF_PIPE_FP16_ACTIVE ,gauge ,Ratio of cycles the fp16 pipes are active (in %). # Datadog additional recommended fields DCGM_FI_DEV_COUNT ,counter ,Number of Devices on the node. DCGM_FI_DEV_FAN_SPEED ,gauge ,Fan speed for the device in percent 0-100. DCGM_FI_DEV_SLOWDOWN_TEMP ,gauge ,Slowdown temperature for the device. DCGM_FI_DEV_POWER_MGMT_LIMIT ,gauge ,Current power limit for the device. DCGM_FI_DEV_PSTATE ,gauge ,Performance state (P-State) 0-15. 0=highest DCGM_FI_DEV_FB_TOTAL ,gauge , DCGM_FI_DEV_FB_RESERVED ,gauge , DCGM_FI_DEV_FB_USED_PERCENT ,gauge , DCGM_FI_DEV_CLOCK_THROTTLE_REASONS ,gauge ,Current clock throttle reasons (bitmask of DCGM_CLOCKS_THROTTLE_REASON_*) DCGM_FI_PROCESS_NAME ,label ,The Process Name. DCGM_FI_CUDA_DRIVER_VERSION ,label , DCGM_FI_DEV_NAME ,label , DCGM_FI_DEV_MINOR_NUMBER ,label , DCGM_FI_DRIVER_VERSION ,label , DCGM_FI_DEV_BRAND ,label , DCGM_FI_DEV_SERIAL ,label ,