Skip to content

Commit 4d0e724

Browse files
Merge pull request #119 from amd/development
dev -> main
2 parents 72903f9 + babc685 commit 4d0e724

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+2637
-117
lines changed

.github/workflows/code_quality_checks.yml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,18 @@ on:
1010

1111
jobs:
1212
pre-commit:
13-
runs-on: ubuntu-latest
13+
runs-on: [ self-hosted ]
1414
container: python:3.9
1515

1616
steps:
1717
- uses: actions/checkout@v3
18+
with:
19+
fetch-depth: 0 # Fetch all history for pre-commit to work
20+
- name: Configure git for container
21+
run: |
22+
git config --global --add safe.directory /__w/node-scraper/node-scraper
23+
git config --global user.email "[email protected]"
24+
git config --global user.name "CI Bot"
1825
- name: setup environment and run pre-commit hooks
1926
shell: bash
2027
run: |

.github/workflows/functional-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ permissions:
1111

1212
jobs:
1313
run_tests:
14-
runs-on: ubuntu-latest
14+
runs-on: [ self-hosted ]
1515
container: python:3.9
1616

1717
steps:

.github/workflows/unit-test.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ permissions:
1111

1212
jobs:
1313
run_tests:
14-
runs-on: ubuntu-latest
14+
runs-on: [ self-hosted ]
1515
container: python:3.9
1616

1717
steps:

README.md

Lines changed: 54 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -232,19 +232,71 @@ This would produce the following config:
232232
"analysis_range_start": null,
233233
"analysis_range_end": null,
234234
"check_unknown_dmesg_errors": true,
235-
"exclude_category": null
235+
"exclude_category": null,
236+
"interval_to_collapse_event": 60,
237+
"num_timestamps": 3
236238
}
237239
}
238240
},
239241
"result_collators": {}
240242
}
241243
```
242244
245+
**Running DmesgPlugin with a dmesg log file:**
246+
247+
Instead of collecting dmesg from the system, you can analyze a pre-existing dmesg log file using the `--data` argument:
248+
249+
```sh
250+
node-scraper --run-plugins DmesgPlugin --data /path/to/dmesg.log --collection False
251+
```
252+
253+
This will skip the collection phase and directly analyze the provided dmesg.log file.
254+
255+
**Custom Error Regex Example:**
256+
257+
You can extend the built-in error detection with custom regex patterns. Create a config file with custom error patterns:
258+
259+
```json
260+
{
261+
"global_args": {},
262+
"plugins": {
263+
"DmesgPlugin": {
264+
"analysis_args": {
265+
"check_unknown_dmesg_errors": false,
266+
"interval_to_collapse_event": 60,
267+
"num_timestamps": 3,
268+
"error_regex": [
269+
{
270+
"regex": "MY_CUSTOM_ERROR.*",
271+
"message": "My Custom Error Detected",
272+
"event_category": "SW_DRIVER",
273+
"event_priority": 3
274+
},
275+
{
276+
"regex": "APPLICATION_CRASH: .*",
277+
"message": "Application Crash",
278+
"event_category": "SW_DRIVER",
279+
"event_priority": 4
280+
}
281+
]
282+
}
283+
}
284+
},
285+
"result_collators": {}
286+
}
287+
```
288+
289+
Save this to `dmesg_custom_config.json` and run:
290+
291+
```sh
292+
node-scraper --plugin-configs dmesg_custom_config.json run-plugins DmesgPlugin
293+
```
294+
243295
#### **'summary' sub command**
244296
The 'summary' subcommand can be used to combine results from multiple runs of node-scraper to a
245297
single summary.csv file. Sample run:
246298
```sh
247-
node-scraper summary --summary_path /<path_to_node-scraper_logs>
299+
node-scraper summary --search-path /<path_to_node-scraper_logs>
248300
```
249301
This will generate a new file '/<path_to_node-scraper_logs>/summary.csv' file. This file will
250302
contain the results from all 'nodescraper.csv' files from '/<path_to_node-scarper_logs>'.

dev-setup.sh

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,7 @@ source venv/bin/activate
1010

1111
python3 -m pip install --editable .[dev] --upgrade
1212

13-
pre-commit install
13+
# Only install pre-commit hooks if not in CI environment
14+
if [ -z "$CI" ]; then
15+
pre-commit install
16+
fi

docs/PLUGIN_DOC.md

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@
66
| --- | --- | --- | --- | --- | --- |
77
| AmdSmiPlugin | firmware --json<br>list --json<br>partition --json<br>process --json<br>ras --cper --folder={folder}<br>static -g all --json<br>static -g {gpu_id} --json<br>version --json | **Analyzer Args:**<br>- `check_static_data`: bool<br>- `expected_gpu_processes`: Optional[int]<br>- `expected_max_power`: Optional[int]<br>- `expected_driver_version`: Optional[str]<br>- `expected_memory_partition_mode`: Optional[str]<br>- `expected_compute_partition_mode`: Optional[str]<br>- `expected_pldm_version`: Optional[str]<br>- `l0_to_recovery_count_error_threshold`: Optional[int]<br>- `l0_to_recovery_count_warning_threshold`: Optional[int]<br>- `vendorid_ep`: Optional[str]<br>- `vendorid_ep_vf`: Optional[str]<br>- `devid_ep`: Optional[str]<br>- `devid_ep_vf`: Optional[str]<br>- `sku_name`: Optional[str]<br>- `expected_xgmi_speed`: Optional[list[float]]<br>- `analysis_range_start`: Optional[datetime.datetime]<br>- `analysis_range_end`: Optional[datetime.datetime] | [AmdSmiDataModel](#AmdSmiDataModel-Model) | [AmdSmiCollector](#Collector-Class-AmdSmiCollector) | [AmdSmiAnalyzer](#Data-Analyzer-Class-AmdSmiAnalyzer) |
88
| BiosPlugin | sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'<br>wmic bios get SMBIOSBIOSVersion /Value | **Analyzer Args:**<br>- `exp_bios_version`: list[str]<br>- `regex_match`: bool | [BiosDataModel](#BiosDataModel-Model) | [BiosCollector](#Collector-Class-BiosCollector) | [BiosAnalyzer](#Data-Analyzer-Class-BiosAnalyzer) |
9-
| CmdlinePlugin | cat /proc/cmdline | **Analyzer Args:**<br>- `required_cmdline`: Union[str, list]<br>- `banned_cmdline`: Union[str, list] | [CmdlineDataModel](#CmdlineDataModel-Model) | [CmdlineCollector](#Collector-Class-CmdlineCollector) | [CmdlineAnalyzer](#Data-Analyzer-Class-CmdlineAnalyzer) |
9+
| CmdlinePlugin | cat /proc/cmdline | **Analyzer Args:**<br>- `required_cmdline`: Union[str, List]<br>- `banned_cmdline`: Union[str, List]<br>- `os_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig]<br>- `platform_overrides`: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] | [CmdlineDataModel](#CmdlineDataModel-Model) | [CmdlineCollector](#Collector-Class-CmdlineCollector) | [CmdlineAnalyzer](#Data-Analyzer-Class-CmdlineAnalyzer) |
1010
| DeviceEnumerationPlugin | powershell -Command "(Get-WmiObject -Class Win32_Processor \| Measure-Object).Count"<br>lspci -d {vendorid_ep}: \| grep -i 'VGA\\|Display\\|3D' \| wc -l<br>powershell -Command "(wmic path win32_VideoController get name \| findstr AMD \| Measure-Object).Count"<br>lscpu<br>lshw<br>lspci -d {vendorid_ep}: \| grep -i 'Virtual Function' \| wc -l<br>powershell -Command "(Get-VMHostPartitionableGpu \| Measure-Object).Count" | **Analyzer Args:**<br>- `cpu_count`: Optional[list[int]]<br>- `gpu_count`: Optional[list[int]]<br>- `vf_count`: Optional[list[int]] | [DeviceEnumerationDataModel](#DeviceEnumerationDataModel-Model) | [DeviceEnumerationCollector](#Collector-Class-DeviceEnumerationCollector) | [DeviceEnumerationAnalyzer](#Data-Analyzer-Class-DeviceEnumerationAnalyzer) |
1111
| DimmPlugin | sh -c 'dmidecode -t 17 \| tr -s " " \| grep -v "Volatile\\|None\\|Module" \| grep Size' 2>/dev/null<br>dmidecode<br>wmic memorychip get Capacity | - | [DimmDataModel](#DimmDataModel-Model) | [DimmCollector](#Collector-Class-DimmCollector) | - |
1212
| DkmsPlugin | dkms status<br>dkms --version | **Analyzer Args:**<br>- `dkms_status`: Union[str, list]<br>- `dkms_version`: Union[str, list]<br>- `regex_match`: bool | [DkmsDataModel](#DkmsDataModel-Model) | [DkmsCollector](#Collector-Class-DkmsCollector) | [DkmsAnalyzer](#Data-Analyzer-Class-DkmsAnalyzer) |
1313
| DmesgPlugin | dmesg --time-format iso -x<br>ls -1 /var/log/dmesg* 2>/dev/null \| grep -E '^/var/log/dmesg(\.[0-9]+(\.gz)?)?$' \|\| true | **Built-in Regexes:**<br>- Out of memory error: `(?:oom_kill_process.*)\|(?:Out of memory.*)`<br>- I/O Page Fault: `IO_PAGE_FAULT`<br>- Kernel Panic: `\bkernel panic\b.*`<br>- SQ Interrupt: `sq_intr`<br>- SRAM ECC: `sram_ecc.*`<br>- Failed to load driver. IP hardware init error.: `\[amdgpu\]\] \*ERROR\* hw_init of IP block.*`<br>- Failed to load driver. IP software init error.: `\[amdgpu\]\] \*ERROR\* sw_init of IP block.*`<br>- Real Time throttling activated: `sched: RT throttling activated.*`<br>- RCU preempt detected stalls: `rcu_preempt detected stalls.*`<br>- RCU preempt self-detected stall: `rcu_preempt self-detected stall.*`<br>- QCM fence timeout: `qcm fence wait loop timeout.*`<br>- General protection fault: `(?:[\w-]+(?:\[[0-9.]+\])?\s+)?general protectio...`<br>- Segmentation fault: `(?:segfault.*in .*\[)\|(?:[Ss]egmentation [Ff]au...`<br>- Failed to disallow cf state: `amdgpu: Failed to disallow cf state.*`<br>- Failed to terminate tmr: `\*ERROR\* Failed to terminate tmr.*`<br>- Suspend of IP block failed: `\*ERROR\* suspend of IP block <\w+> failed.*`<br>- amdgpu Page Fault: `(amdgpu \w{4}:\w{2}:\w{2}\.\w:\s+amdgpu:\s+\[\S...`<br>- Page Fault: `page fault for address.*`<br>- Fatal error during GPU init: `(?:amdgpu)(.*Fatal error during GPU init)\|(Fata...`<br>- PCIe AER Error: `(?:pcieport )(.*AER: aer_status.*)\|(aer_status.*)`<br>- Failed to read journal file: `Failed to read journal file.*`<br>- Journal file corrupted or uncleanly shut down: `journal corrupted or uncleanly shut down.*`<br>- ACPI BIOS Error: `ACPI BIOS Error`<br>- ACPI Error: `ACPI Error`<br>- Filesystem corrupted!: `EXT4-fs error \(device .*\):`<br>- Error in buffered IO, check filesystem integrity: `(Buffer I\/O error on dev)(?:ice)? (\w+)`<br>- PCIe card no longer present: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...`<br>- PCIe Link Down: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...`<br>- Mismatched clock configuration between PCIe device and host: `pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(curren...`<br>- RAS Correctable Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`<br>- RAS Uncorrectable Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`<br>- RAS Deferred Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`<br>- RAS Corrected PCIe Error: `((?:\[Hardware Error\]:\s+)?event severity: cor...`<br>- GPU Reset: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`<br>- GPU reset failed: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`<br>- ACA Error: `(Accelerator Check Architecture[^\n]*)(?:\n[^\n...`<br>- ACA Error: `(Accelerator Check Architecture[^\n]*)(?:\n[^\n...`<br>- MCE Error: `\[Hardware Error\]:.+MC\d+_STATUS.*(?:\n.*){0,5}`<br>- Mode 2 Reset Failed: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)? (...`<br>- RAS Corrected Error: `(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....`<br>- SGX Error: `x86/cpu: SGX disabled by BIOS`<br>- GPU Throttled: `amdgpu \w{4}:\w{2}:\w{2}.\w: amdgpu: WARN: GPU ...`<br>- LNet: ko2iblnd has no matching interfaces: `(?:\[[^\]]+\]\s*)?LNetError:.*ko2iblnd:\s*No ma...`<br>- LNet: Error starting up LNI: `(?:\[[^\]]+\]\s*)?LNetError:\s*.*Error\s*-?\d+\...`<br>- Lustre: network initialisation failed: `LustreError:.*ptlrpc_init_portals\(\).*network ...` | [DmesgData](#DmesgData-Model) | [DmesgCollector](#Collector-Class-DmesgCollector) | [DmesgAnalyzer](#Data-Analyzer-Class-DmesgAnalyzer) |
1414
| FabricsPlugin | ibstat<br>ibv_devinfo<br>ls -l /sys/class/infiniband/*/device/net<br>mst start<br>mst status -v<br>ofed_info -s<br>rdma dev<br>rdma link | - | [FabricsDataModel](#FabricsDataModel-Model) | [FabricsCollector](#Collector-Class-FabricsCollector) | - |
15-
| JournalPlugin | journalctl --no-pager --system --output=short-iso | - | [JournalData](#JournalData-Model) | [JournalCollector](#Collector-Class-JournalCollector) | - |
15+
| JournalPlugin | journalctl --no-pager --system --output=short-iso<br>journalctl --no-pager --system --output=json | **Analyzer Args:**<br>- `check_priority`: Optional[int]<br>- `group`: bool | [JournalData](#JournalData-Model) | [JournalCollector](#Collector-Class-JournalCollector) | [JournalAnalyzer](#Data-Analyzer-Class-JournalAnalyzer) |
1616
| KernelPlugin | sh -c 'uname -a'<br>wmic os get Version /Value | **Analyzer Args:**<br>- `exp_kernel`: Union[str, list]<br>- `regex_match`: bool | [KernelDataModel](#KernelDataModel-Model) | [KernelCollector](#Collector-Class-KernelCollector) | [KernelAnalyzer](#Data-Analyzer-Class-KernelAnalyzer) |
1717
| KernelModulePlugin | cat /proc/modules<br>modinfo amdgpu<br>wmic os get Version /Value | **Analyzer Args:**<br>- `kernel_modules`: dict[str, dict]<br>- `regex_filter`: list[str] | [KernelModuleDataModel](#KernelModuleDataModel-Model) | [KernelModuleCollector](#Collector-Class-KernelModuleCollector) | [KernelModuleAnalyzer](#Data-Analyzer-Class-KernelModuleAnalyzer) |
1818
| MemoryPlugin | free -b<br>lsmem<br>numactl -H<br>wmic OS get FreePhysicalMemory /Value; wmic ComputerSystem get TotalPhysicalMemory /Value | **Analyzer Args:**<br>- `ratio`: float<br>- `memory_threshold`: str | [MemoryDataModel](#MemoryDataModel-Model) | [MemoryCollector](#Collector-Class-MemoryCollector) | [MemoryAnalyzer](#Data-Analyzer-Class-MemoryAnalyzer) |
@@ -275,6 +275,7 @@ Read journal log via journalctl.
275275

276276
- **SUPPORTED_OS_FAMILY**: `{<OSFamily.LINUX: 3>}`
277277
- **CMD**: `journalctl --no-pager --system --output=short-iso`
278+
- **CMD_JSON**: `journalctl --no-pager --system --output=json`
278279

279280
### Provides Data
280281

@@ -283,6 +284,7 @@ JournalData
283284
### Commands
284285

285286
- journalctl --no-pager --system --output=short-iso
287+
- journalctl --no-pager --system --output=json
286288

287289
## Collector Class KernelCollector
288290

@@ -866,6 +868,7 @@ Data model for journal logs
866868
### Model annotations and fields
867869

868870
- **journal_log**: `str`
871+
- **journal_content_json**: `list[nodescraper.plugins.inband.journal.journaldata.JournalJsonEntry]`
869872

870873
## KernelDataModel Model
871874

@@ -1248,6 +1251,16 @@ Check dmesg for errors
12481251
- - LNet: Error starting up LNI: `(?:\[[^\]]+\]\s*)?LNetError:\s*.*Error\s*-?\d+\...`
12491252
- - Lustre: network initialisation failed: `LustreError:.*ptlrpc_init_portals\(\).*network ...`
12501253

1254+
## Data Analyzer Class JournalAnalyzer
1255+
1256+
### Description
1257+
1258+
Check journalctl for errors
1259+
1260+
**Bases**: ['DataAnalyzer']
1261+
1262+
**Link to code**: [journal_analyzer.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/journal/journal_analyzer.py)
1263+
12511264
## Data Analyzer Class KernelAnalyzer
12521265

12531266
### Description
@@ -1413,8 +1426,10 @@ Check sysctl matches expected sysctl details
14131426

14141427
### Annotations / fields
14151428

1416-
- **required_cmdline**: `Union[str, list]`
1417-
- **banned_cmdline**: `Union[str, list]`
1429+
- **required_cmdline**: `Union[str, List]`
1430+
- **banned_cmdline**: `Union[str, List]`
1431+
- **os_overrides**: `Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig]`
1432+
- **platform_overrides**: `Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig]`
14181433

14191434
## Analyzer Args Class DeviceEnumerationAnalyzerArgs
14201435

@@ -1440,6 +1455,21 @@ Check sysctl matches expected sysctl details
14401455
- **dkms_version**: `Union[str, list]`
14411456
- **regex_match**: `bool`
14421457

1458+
## Analyzer Args Class JournalAnalyzerArgs
1459+
1460+
### Description
1461+
1462+
Arguments for journal analyzer
1463+
1464+
**Bases**: ['TimeRangeAnalysisArgs']
1465+
1466+
**Link to code**: [analyzer_args.py](https://github.com/amd/node-scraper/blob/HEAD/nodescraper/plugins/inband/journal/analyzer_args.py)
1467+
1468+
### Annotations / fields
1469+
1470+
- **check_priority**: `Optional[int]`
1471+
- **group**: `bool`
1472+
14431473
## Analyzer Args Class KernelAnalyzerArgs
14441474

14451475
**Bases**: ['AnalyzerArgs']

0 commit comments

Comments
 (0)