Phase E hygiene proper: E1 progress + E2 parity probe + E3 cookbook #13

Merged
Sponge merged 1 commit from feature/dune-extract-phase-e-hygiene into develop 2026-05-28 08:46:08 +00:00
Owner

Summary

Goal: hygine E. Closes the three filed E-tasks. Distinct from the prior quirk-fixes commit (PR #12) which lifted the sharpest-edged items; this is the formal Phase E delivery.

E1 — Verbose progress reporting (always-on heartbeat)

enrichment.enrich_with_row_data + localization.build_localization_index both gain a heartbeat= callback (independent of --verbose) that fires once per scanned pak with entries / DTs / rows / elapsed stats. __main__.py wires it to a _stderr_log so non-verbose users see per-pak progress during the ~10-min row-decode pass. Final Done in Xm Ys (Z.Zs total). summary closes every run.

Sample non-verbose output:

Row data scan (data-area walker + B3 row decode) ...
  rows: DuneSandbox.pak           16406 entries    0 DTs      0 rows   401.5s
  rows: GUI.pak                   13233 entries    1 DTs   2183 rows     2.0s
  rows: Input.pak                  1560 entries    2 DTs     12 rows     0.1s
  rows: Placables.pak                72 entries    2 DTs    122 rows     0.0s
  rows: Systems.pak               19274 entries   58 DTs    337 rows     3.6s
Row data: 63 DataTables parsed, 2654 rows recovered in 435.2s.
Done in 7m 17s (438.0s total).

Paks that completed silently (0 DTs AND < 5 s) are suppressed to keep the log readable.

E2 — JSON + CSV cross-format parity probe

New validate/probe_outputs.py builds the catalog once (FDI-only, fast — no row decode needed for parity checking) and runs it through all three writer formats, comparing per-category stem sets + DataTable inventory + global totals against the in-memory truth.

Result: 8/8 categories OK; DataTable inventory 697/697 OK; JSON summary matches truth.

Category            truth      md    json     csv  result
Weapons              4304    4304    4304    4304  OK
Vehicles             8591    8591    8591    8591  OK
Garments             5205    5205    5205    5205  OK
Augmentations         194     194     194     194  OK
Customization         499     499     499     499  OK
Construction         3763    3763    3763    3763  OK
Misc                  883     883     883     883  OK
Utility              2134    2134    2134    2134  OK

E2 cross-format parity: PASS

Markdown re-parse uses a fenced-block regex (lossier than JSON / CSV) with a ≥95% threshold to tolerate edge-case formatting; JSON and CSV match the in-memory catalog exactly. JSON Schema is a Phase F follow-on.

E3 — End-user troubleshooting cookbook

New tools/dune-extract/TROUBLESHOOTING.md covering:

Section Recipes
Install + auto-detect 4
Oodle backend 3
Pak parsing edge cases 2
Output formats 3
Performance 2
Diff subcommand 2
When all else fails (generic checklist)

Each section maps a verbatim error message to its cause + smallest-possible fix. Targets adopters cold-starting on Linux native / Linux Proton / WSL / native Windows.

No regression

Probe Result
B1 — probe_pak_entries.py 574/574 (100%) dual-header adjacency
B2 — probe_read_uasset.py 4/4 with Oodle backend
B3 — probe_datatable.py Systems.pak 5 DataTables / 60 rows
E2 — probe_outputs.py PASS (8/8 categories, 697/697 DataTables)

The heartbeat callback is opt-in (None default), so any library consumer of enrich_with_row_data / build_localization_index is unaffected.

Test plan

cd tools/dune-extract
python3 validate/probe_pak_entries.py
python3 validate/probe_read_uasset.py
python3 validate/probe_datatable.py Systems.pak
python3 validate/probe_outputs.py                # expect E2 PASS
DUNE_EXTRACT_SKIP_LOCALIZATION=1 python3 -m dune_extract \
    --categories Misc --output-dir /tmp/test-e1  # ~7 min; expect per-pak
                                                 # heartbeat lines + Done in Xm Ys
## Summary Goal: *hygine E*. Closes the three filed E-tasks. Distinct from the prior quirk-fixes commit (PR #12) which lifted the sharpest-edged items; this is the formal Phase E delivery. ## E1 — Verbose progress reporting (always-on heartbeat) `enrichment.enrich_with_row_data` + `localization.build_localization_index` both gain a `heartbeat=` callback (independent of `--verbose`) that fires once per scanned pak with `entries / DTs / rows / elapsed` stats. `__main__.py` wires it to a `_stderr_log` so non-verbose users see per-pak progress during the ~10-min row-decode pass. Final `Done in Xm Ys (Z.Zs total).` summary closes every run. Sample non-verbose output: ``` Row data scan (data-area walker + B3 row decode) ... rows: DuneSandbox.pak 16406 entries 0 DTs 0 rows 401.5s rows: GUI.pak 13233 entries 1 DTs 2183 rows 2.0s rows: Input.pak 1560 entries 2 DTs 12 rows 0.1s rows: Placables.pak 72 entries 2 DTs 122 rows 0.0s rows: Systems.pak 19274 entries 58 DTs 337 rows 3.6s Row data: 63 DataTables parsed, 2654 rows recovered in 435.2s. Done in 7m 17s (438.0s total). ``` Paks that completed silently (0 DTs AND < 5 s) are suppressed to keep the log readable. ## E2 — JSON + CSV cross-format parity probe New `validate/probe_outputs.py` builds the catalog once (FDI-only, fast — no row decode needed for parity checking) and runs it through all three writer formats, comparing per-category stem sets + DataTable inventory + global totals against the in-memory truth. **Result: 8/8 categories OK; DataTable inventory 697/697 OK; JSON summary matches truth.** ``` Category truth md json csv result Weapons 4304 4304 4304 4304 OK Vehicles 8591 8591 8591 8591 OK Garments 5205 5205 5205 5205 OK Augmentations 194 194 194 194 OK Customization 499 499 499 499 OK Construction 3763 3763 3763 3763 OK Misc 883 883 883 883 OK Utility 2134 2134 2134 2134 OK E2 cross-format parity: PASS ``` Markdown re-parse uses a fenced-block regex (lossier than JSON / CSV) with a ≥95% threshold to tolerate edge-case formatting; JSON and CSV match the in-memory catalog exactly. JSON Schema is a Phase F follow-on. ## E3 — End-user troubleshooting cookbook New `tools/dune-extract/TROUBLESHOOTING.md` covering: | Section | Recipes | |---|---:| | Install + auto-detect | 4 | | Oodle backend | 3 | | Pak parsing edge cases | 2 | | Output formats | 3 | | Performance | 2 | | Diff subcommand | 2 | | When all else fails | (generic checklist) | Each section maps a verbatim error message to its cause + smallest-possible fix. Targets adopters cold-starting on Linux native / Linux Proton / WSL / native Windows. ## No regression | Probe | Result | |---|---| | B1 — `probe_pak_entries.py` | **574/574 (100%)** dual-header adjacency | | B2 — `probe_read_uasset.py` | **4/4** with Oodle backend | | B3 — `probe_datatable.py Systems.pak` | **5 DataTables / 60 rows** | | E2 — `probe_outputs.py` | **PASS** (8/8 categories, 697/697 DataTables) | The heartbeat callback is opt-in (`None` default), so any library consumer of `enrich_with_row_data` / `build_localization_index` is unaffected. ## Test plan ```bash cd tools/dune-extract python3 validate/probe_pak_entries.py python3 validate/probe_read_uasset.py python3 validate/probe_datatable.py Systems.pak python3 validate/probe_outputs.py # expect E2 PASS DUNE_EXTRACT_SKIP_LOCALIZATION=1 python3 -m dune_extract \ --categories Misc --output-dir /tmp/test-e1 # ~7 min; expect per-pak # heartbeat lines + Done in Xm Ys ```
User: "hygine E" — closes the three filed E-tasks. Distinct from the
prior quirk-fixes commit (PR #12) which lifted the sharpest-edged
items; this is the formal Phase E delivery.

E1 — Verbose progress reporting (always-on heartbeat)
  enrichment.enrich_with_row_data + localization.build_localization_index
  both gain a heartbeat= callback (independent of --verbose) that fires
  once per scanned pak with `entries / DTs / rows / elapsed` stats.
  __main__.py wires it to a _stderr_log so non-verbose users see
  per-pak progress during the ~10-min row-decode pass. Final
  "Done in Xm Ys (Z.Zs total)." summary closes every run.
  Paks that completed silently (0 DTs AND <5s elapsed) are suppressed
  to keep the log readable.

  Sample non-verbose output (single-category test run, ~7 min):
    Row data scan (data-area walker + B3 row decode) ...
      rows: DuneSandbox.pak           16406 entries    0 DTs      0 rows   401.5s
      rows: GUI.pak                   13233 entries    1 DTs   2183 rows     2.0s
      rows: Input.pak                  1560 entries    2 DTs     12 rows     0.1s
      rows: Placables.pak                72 entries    2 DTs    122 rows     0.0s
      rows: Systems.pak               19274 entries   58 DTs    337 rows     3.6s
    Row data: 63 DataTables parsed, 2654 rows recovered in 435.2s.
    ...
    Done in 7m 17s (438.0s total).

E2 — JSON + CSV cross-format parity probe
  New validate/probe_outputs.py builds the catalog once (FDI-only,
  fast — no row decode needed for parity checking) and runs it
  through all three writer formats, comparing per-category stem sets +
  DataTable inventory + global totals.

  Result: 8/8 categories OK across Markdown / JSON / CSV.
    Weapons      4304 / 4304 / 4304 / 4304   OK
    Vehicles     8591 / 8591 / 8591 / 8591   OK
    ... (every category matches truth)
    DataTable inventory truth vs JSON: 697 / 697 OK
    JSON summary matches in-memory catalog
  E2 cross-format parity: PASS

  Markdown re-parse uses a fenced-block regex (lossier than JSON / CSV)
  with a >=95% threshold to tolerate edge-case formatting; JSON and CSV
  match the in-memory catalog exactly. JSON Schema definition is a
  Phase F follow-on (the _CatalogJSONEncoder makes one derivable from
  the enrichment.py dataclasses).

E3 — End-user troubleshooting cookbook
  New tools/dune-extract/TROUBLESHOOTING.md — recipes-by-error-message
  cookbook covering:
    * Install + auto-detect (4 recipes)
    * Oodle backend (3 recipes incl. Windows DLL options)
    * Pak parsing edge cases (2)
    * Output formats (3)
    * Performance (2)
    * Diff subcommand (2)
    * Generic "when all else fails" checklist
  Targets adopters cold-starting on Linux native / Linux Proton /
  WSL / native Windows. Each section maps a verbatim error message to
  its cause + smallest-possible fix.

No regression — B1/B2/B3 probes all pass identically. The heartbeat
callback is opt-in (None default), so any library consumer of
enrich_with_row_data / build_localization_index is unaffected.
Sponge merged commit 5bc4269d03 into develop 2026-05-28 08:46:08 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Sponge/Dune-Awakening-Server-Tools!13
No description provided.