Phase E hygiene: iron out the quirks (8 fixes + sample regen) #12

Merged
Sponge merged 1 commit from feature/dune-extract-quirk-fixes into develop 2026-05-28 07:41:26 +00:00
Owner

Summary

User goal: iron out the quirks. Closes the 8 ironable items from the post-Phase-D test+validate sweep. Doesn't fully complete E1/E2/E3 (those need broader scope) but lifts the sharpest-edged issues.

Fixes (8 + 2 bonuses)

# Fix Module
1 FieldFilter recurses into nested struct values — --include-X flags now actually strip nested-struct keys field_filter.py
2 Native struct decoders for FGameplayTagContainer/FGameplayTag/FPrimaryAssetId/FSoftClassPath/FSoftObjectPath (eliminates _raw_ for the common cases) uasset_parser.py
3 CompressionBlockSize decoded as stored_value >> 8 (Funcom's actual << 8 encoding) pak_extract.py
4 JSON output uses _CatalogJSONEncoder — tuples→arrays, sets→sorted, bytes→hex, dataclasses→dicts catalog_writer.py
5 items-rows.csv gains value column (raw JSON-encoded value) alongside value_display catalog_writer.py
6 Catalog timestamps now UTC ISO-8601 (2026-05-28T07:39:06Z) catalog_writer.py
7 --client-path fail-fast validation — checks for paks before proceeding __main__.py
8 Data-area walker rejects comp_method > 6 as garbage (fixes spurious entries from ST_Localization_Dialogue resync) pak_extract.py
+ ST_Localization_Dialogue.uexp investigated — non-standard chunk header layout; other 10 ST_* tables still parse (27,816 key→English pairs) docs
+ D3 heading marker [ ][x] (shipped in PR #10 but the checkbox wasn't toggled) DECOMPOSED.md

Sample catalog regenerated

DataExtract/ITEMS/ re-emitted with all fixes baked in:

  • ITEMS-Loot-Tables.md shows RequiredTags={ tags=[] }, ForbiddenTags={ tags=[] } (was =_raw_)
  • ITEMS-INDEX.md header: Generated by dune-extract v0.1.0 on 2026-05-28T07:39:06Z (UTC)
  • 11 markdown files, ~6 MB total committed

Validated

py_compile across all modules                              clean
B1 / B2 / B3 probes                                        identical to prior pass
Localization.pak walker (after #8)                         21 entries (was 22 w/ bogus #7)
                                                           0 entries with comp_method > 6
FieldFilter recursion test                                 nested Icon/Mesh stripped at default
                                                           kept under --include-asset-paths
JSON encoder smoke (tuple/bytes/Path/dataclass/set)        all clean
--client-path /tmp (no paks)                               fail-fast in ~10 ms

Deferred (still documented as known limitations)

  • ~11-min wall-time — DuneSandbox.pak + MiB-aligned gaps; parallelism is Phase E proper.
  • DataTable categorization heuristic — directory-driven; broader design work in phase F.
  • FGameplayTagQuery (TagDictionary/QueryTokenStream) — more involved than the simple containers, deferred.
  • ST_Localization_Dialogue — non-standard chunk header, deferred.

Test plan

cd tools/dune-extract
python3 validate/probe_pak_entries.py
python3 validate/probe_read_uasset.py
python3 validate/probe_datatable.py Systems.pak
python3 -m dune_extract --client-path /tmp 2>&1 | head -3        # fail-fast
python3 -m dune_extract --dry-run                                # auto-detect
grep -c 'tags=\[' DataExtract/ITEMS/ITEMS-Loot-Tables.md          # GameplayTagContainer hits
grep '^>' DataExtract/ITEMS/ITEMS-INDEX.md | head -1               # UTC timestamp
## Summary User goal: *iron out the quirks*. Closes the 8 ironable items from the post-Phase-D test+validate sweep. Doesn't fully complete E1/E2/E3 (those need broader scope) but lifts the sharpest-edged issues. ## Fixes (8 + 2 bonuses) | # | Fix | Module | |---|---|---| | 1 | `FieldFilter` recurses into nested struct values — `--include-X` flags now actually strip nested-struct keys | `field_filter.py` | | 2 | Native struct decoders for `FGameplayTagContainer`/`FGameplayTag`/`FPrimaryAssetId`/`FSoftClassPath`/`FSoftObjectPath` (eliminates `_raw_` for the common cases) | `uasset_parser.py` | | 3 | `CompressionBlockSize` decoded as `stored_value >> 8` (Funcom's `actual << 8` encoding) | `pak_extract.py` | | 4 | JSON output uses `_CatalogJSONEncoder` — tuples→arrays, sets→sorted, bytes→hex, dataclasses→dicts | `catalog_writer.py` | | 5 | `items-rows.csv` gains `value` column (raw JSON-encoded value) alongside `value_display` | `catalog_writer.py` | | 6 | Catalog timestamps now UTC ISO-8601 (`2026-05-28T07:39:06Z`) | `catalog_writer.py` | | 7 | `--client-path` fail-fast validation — checks for paks before proceeding | `__main__.py` | | 8 | Data-area walker rejects `comp_method > 6` as garbage (fixes spurious entries from `ST_Localization_Dialogue` resync) | `pak_extract.py` | | + | `ST_Localization_Dialogue.uexp` investigated — non-standard chunk header layout; other 10 ST_* tables still parse (27,816 key→English pairs) | docs | | + | D3 heading marker `[ ]` → `[x]` (shipped in PR #10 but the checkbox wasn't toggled) | `DECOMPOSED.md` | ## Sample catalog regenerated `DataExtract/ITEMS/` re-emitted with all fixes baked in: - `ITEMS-Loot-Tables.md` shows `RequiredTags={ tags=[] }, ForbiddenTags={ tags=[] }` (was `=_raw_`) - `ITEMS-INDEX.md` header: `Generated by dune-extract v0.1.0 on 2026-05-28T07:39:06Z` (UTC) - 11 markdown files, ~6 MB total committed ## Validated ``` py_compile across all modules clean B1 / B2 / B3 probes identical to prior pass Localization.pak walker (after #8) 21 entries (was 22 w/ bogus #7) 0 entries with comp_method > 6 FieldFilter recursion test nested Icon/Mesh stripped at default kept under --include-asset-paths JSON encoder smoke (tuple/bytes/Path/dataclass/set) all clean --client-path /tmp (no paks) fail-fast in ~10 ms ``` ## Deferred (still documented as known limitations) - **~11-min wall-time** — DuneSandbox.pak + MiB-aligned gaps; parallelism is Phase E proper. - **DataTable categorization heuristic** — directory-driven; broader design work in phase F. - **`FGameplayTagQuery`** (TagDictionary/QueryTokenStream) — more involved than the simple containers, deferred. - **`ST_Localization_Dialogue`** — non-standard chunk header, deferred. ## Test plan ```bash cd tools/dune-extract python3 validate/probe_pak_entries.py python3 validate/probe_read_uasset.py python3 validate/probe_datatable.py Systems.pak python3 -m dune_extract --client-path /tmp 2>&1 | head -3 # fail-fast python3 -m dune_extract --dry-run # auto-detect grep -c 'tags=\[' DataExtract/ITEMS/ITEMS-Loot-Tables.md # GameplayTagContainer hits grep '^>' DataExtract/ITEMS/ITEMS-INDEX.md | head -1 # UTC timestamp ```
User: "iron out the quirks" — closes the 8 ironable items from the
post-Phase-D test+validate sweep. Doesn't fully complete E1/E2/E3
(those need broader work) but lifts the sharpest-edged issues.

Fixes:

  1. FieldFilter recurses into nested struct values
     --include-asset-paths / --include-lore / --include-dev-commentary
     now actually strip those buckets from inside StaticData = { ... }
     and similar nested tagged-structs. Pseudo-dict VALUES (FText
     {_string_table, _key}, SoftObject, Object, raw fallbacks) are
     recognised by their `_`-marker keys and never recursed into.

  2. Native struct decoders for common types previously emitting _raw_
     - FGameplayTagContainer -> {tags:[...]}
     - FGameplayTag -> {tag:...}
     - FPrimaryAssetId -> {type, name}
     - FSoftClassPath / FSoftObjectPath -> {_soft_object, _sub}
     Top _raw_ culprits before: TagDictionary (3312), QueryTokenStream
     (3312), RequiredTags (1656), ForbiddenTags (1656), ItemTemplateId
     (714). RequiredTags / ForbiddenTags / GameplayTag / SoftObjectPath
     now decode. FGameplayTagQuery internals (TagDictionary /
     QueryTokenStream) remain _raw_ — Phase E follow-on.

  3. CompressionBlockSize decoded properly as stored_value >> 8
     (Funcom encodes per-block uncompressed size as actual << 8 in the
     in-pak header). Previously hardcoded UE-default 64 KiB —
     empirically the same value but now derived from the on-disk field.
     Future-proofs against builds where Funcom changes block size.

  4. JSON output uses _CatalogJSONEncoder instead of default=str.
     - tuples -> arrays (was Python-repr strings)
     - sets -> sorted arrays
     - bytes -> hex strings
     - dataclasses -> dicts (via asdict)
     - Path -> str

  5. items-rows.csv gains a `value` column carrying the raw
     JSON-encoded value (programmatic consumers) alongside the existing
     display-formatted column (renamed to value_display — ✓/✗ bools,
     truncated strings, etc., for spreadsheet humans).

  6. Catalog timestamps now UTC ISO-8601 (2026-05-28T07:39:06Z)
     instead of local-time strftime. Two extracts from different
     timezones produce stable headers that diff cleanly.

  7. --client-path fail-fast validation — checks the target actually
     contains Dune Awakening paks via _has_paks() before proceeding.
     Catches typos in ~10 ms instead of failing 3 seconds later.

  8. Data-area walker rejects entries with comp_method > 6 as garbage.
     Previously the resync after a non-standard chunk
     (ST_Localization_Dialogue.uexp's unknown chunk header) yielded a
     spurious entry with comp_method=12,632,304. Now those are
     correctly skipped.

  + ST_Localization_Dialogue.uexp investigated: uses a non-standard
    chunk header (leading 32-byte table of offsets we don't decode).
    The other 10 ST_* tables parse cleanly (27,816 key->English pairs);
    Dialogue keys remain unresolved. Documented; cracking the format
    is a Phase E follow-on if dialogue text becomes important.

  + D3 heading [ ] -> [x] correction (shipped in PR #10 but the
    checkbox marker wasn't toggled).

Sample catalog regen: DataExtract/ITEMS/ re-emitted with all fixes
baked in. ITEMS-Loot-Tables.md shows RequiredTags/ForbiddenTags
decoding to {tags=[]} (was =_raw_). ITEMS-INDEX.md header now reads
"2026-05-28T07:39:06Z" UTC.

Validated:
  - py_compile across all modules: clean
  - B1 / B2 / B3 probes: identical to prior pass
  - Localization.pak walker: 21 entries (was 22 with bogus #7); 0
    entries with comp_method > 6
  - FieldFilter recursion test: nested Icon/Mesh stripped when
    --include-asset-paths is off, kept when on
  - JSON encoder smoke: tuple/bytes/Path/dataclass/set all clean
  - --client-path /tmp (no paks): fail-fast error in ~10 ms
Sponge merged commit 4f12a975b0 into develop 2026-05-28 07:41:26 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
Sponge/Dune-Awakening-Server-Tools!12
No description provided.