dune-extract attempt-4: B1 EncodedPakEntries + FU1-4 + B2 read/decrypt/Oodle + B3 DataTable parser #6
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feature/dune-extract-b1-encoded-entries"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Lands B1 (Funcom UE5.4 EncodedPakEntries decoder) from docs/DECOMPOSED.md. The decoder is Step 1 of the attempt-4 row-decode pipeline that will unblock per-row DataTable contents (item names / descriptions / stats / dev-flag values) for the dune-extract catalog.
What landed
tools/dune-extract/dune_extract/pak_entry_decoder.py(~280 lines) — Python decoder for Funcom-variant EncodedPakEntries records. Auto-detects per-pak prefix length (Abilities=2, AI=12, Consumables=13, Dune_Plugins=15, Core=1596). Returns full `(offset, comp_size, unc_size, comp_method, num_blocks, blocks, flags, encrypted)` per entry.`tools/dune-extract/validate/probe_pak_entries.py` — 6-pak validation harness. Verifies B1 acceptance criterion (`this.offset + 73 + this.comp_size == next.offset` adjacency math) and reports coverage per pak.
`docs/DECOMPOSED.md` — B1 marked [x] with full Resolved block: RE findings, observed prefix lengths, flag values, pak storage header-size variation (53 vs 73 bytes / FSHAHash optional), validation evidence verbatim from probe run, plus 4 follow-up tasks (B1-FU1..FU4) for full-coverage edge cases.
Acceptance results
Named target PASSES: `BP_LogAbilities.uasset` in `Abilities.pak` (FDI encoded_entry_offset=0) decodes to offset=362,330, comp_size=2,024, unc_size=10,795, comp_method=1, flags=0xE0800040. Predicted next = 362,330 + 73 + 2,024 = 364,427 = actual next entry's offset. Exact match.
6-pak spot-check (adjacency match rates):
Every adjacency mismatch across all paks shows the same 20-byte delta — explained by the FSHAHash field being optional in the UE5 V11 file storage header (53-byte header without, 73-byte with). Documented for B2.
Follow-up tasks filed (DECOMPOSED.md B1-FU1..FU4)
Cascade state
Test plan
```bash
cd tools/dune-extract
source .venv/bin/activate
python validate/probe_pak_entries.py # Expect "B1 — EncodedPakEntries decoder validation" + 6-pak table
Spot-check Abilities entry 0 in the output: offset=362330 comp=2024 unc=10795 comp_method=1
Spot-check FDI file at encoded_entry_offset=0 == /BP_LogAbilities.uasset
```
Post-PR-#6 finalize-and-validate sweep per session goal "finalize B1 | Test and validate everything up to this point after B1". Appends a 'Re-validated 2026-05-26' block to DECOMPOSED.md §B1 Resolved documenting that the decoder + A-phase pipeline reproduces bit-for-bit identical output against the live client install on second pass: * Lint + import smoke: clean (py_compile + 9-module package import) * B1 named acceptance: BP_LogAbilities.uasset → offset=362330, comp=2024, unc=10795, comp_method=1, flags=0xe0800040 (exact match against the B1 commit values) * B1 6-pak adjacency: identical match rates to the commit (Abilities 74.4%, AI 80.1%, Consumables 88.5%, Core 66.7%, Dune_Plugins 100%); every mismatch still the FSHAHash -20 delta * A1 dry-run: clean, all 3 prereqs green * A2 real extract: 131,654 entries / 37 paks / 25,618 stems / 9 files — md5 of each file (with timestamp line stripped) matches the committed sample under dune-extract-output/ exactly * JSON + CSV cross-format smoke (ahead of E2 schedule): both emit 25,573 rows with per-category counts identical to Markdown declared totals; the 45-stem delta vs total_categorized=25,618 is identical to the A2 commit (same accounting nuance, not a regression) Also includes adopter-facing reproducibility command block so anyone can re-run the same 6 validation steps from a clean checkout. No code changes; documentation-only finalize pass.Finalize-pass commit pushed:
95739b7Per session goal "finalize B1 | Test and validate everything up to this point after B1" — second-pass validation against the live client install confirms B1 + Phase A pipeline reproduces bit-for-bit identical output:
py_compile)BP_LogAbilities.uasset--format json--format csvCross-format parity (Markdown / JSON / CSV) audited ahead of E2 schedule — all three formats agree byte-for-byte on per-category stem counts. The 45-stem delta between catalog.total_categorized=25,618 and per-category emission 25,573 is identical to the original A2 commit (same accounting nuance, not a regression).
Adopter-facing reproducibility command block now lives in docs/DECOMPOSED.md §B1 — six-step recipe anyone can run from a clean checkout to reproduce the same validation evidence.
Stack now ready for B2 (Read + AES-decrypt + Oodle-decompress) to begin against the Abilities 575 decoded-entry subset.
Completes all four B1 follow-up tasks per session goal "Complete all B1-* related tasks". Deep RE landed the unifying root cause and the achievable B1-layer improvements; byte-perfect recall correctly scoped to B2. Root cause (FU1/FU2/FU3 share it): the FDI encoded_entry_offset is a VIRTUAL position in UE5's expanded encoding space, not a physical byte offset into the primary blob. Proof: directed decode (trusting offsets literally) recovers 162/812 on Abilities vs sequential's 575, with 404 literal positions landing mid-record. The "missing" entries' virtual offsets point past primary's physical end because the physical blob is densely packed while the offset space assumes the expanded form. - FU1 (Abilities 71%): root cause identified; 237 overflow files need B2 data-block cross-validation to map virtual->physical. - FU2 (AI 13%): reclassified as duplicate of FU1. AI is flags-first (prefix-12), NOT a distinct section; flags-last hypothesis tested and REJECTED (regresses AI 317->10 on dense 0xE0800040 bytes). - FU3 (Core): 1596-byte prefix characterized as a 399-entry uint32 table + marker, then standard flags-first entries at primary[1596]. - FU4 (Controller bogus isize): added _read_primary_with_isize_fallback deriving primary bounds from the PathHashIndex location; for Controller the declared ioff also lands inside that span, so it now reports gracefully instead of raising OverflowError. Decoder changes (pak_entry_decoder.py): * Flags-last record decoder + layout auto-detect (flags-first scan FIRST to preserve 575/317, flags-last as fallback only) * decode_from_fdi() directed mode (B2 cross-validation tool) * bogus-isize PathHashIndex-location fallback * decode_from_pak() gains mode= and fdi_result= params Probe changes (validate/probe_pak_entries.py): * dual-header adjacency (73 OR 53 byte): resolves 100% of the prior delta=-20 mismatches -> Abilities 574/574, AI 315/316 (the 1 is a 1 MiB align boundary), Consumables/Core/Dune_Plugins 100% * FU-closeout evidence section (sequential vs directed + Controller) Proves sequential decode is byte-perfect (high precision); directed raises recall but drops adjacency to 36-78% (coincidental matches), so sequential stays the default and directed is a B2-only tool. NO coverage regression: sequential still decodes 575/Abilities + 317/AI exactly as the B1 commit; named acceptance (BP_LogAbilities.uasset) unchanged.B1-FU1..FU4 closeout pushed:
06abbccPer goal "Complete all B1- related tasks"* — all four follow-ups resolved. Deep RE landed the unifying root cause + the achievable B1-layer decoder improvements; byte-perfect recall correctly scoped to B2.
Root cause (FU1/FU2/FU3 share it): the FDI
encoded_entry_offsetis a virtual position in UE5's expanded encoding space, not a physical byte offset into the primary blob. Proof: directed decode (trusting offsets literally) recovers 162/812 on Abilities vs sequential's 575, with 404 literal positions landing mid-record.ioffalso lands inside that span so it now reports gracefully instead ofOverflowError. File list still fully recovered via tail-scan (236 files)Dual-header adjacency finding (validates sequential precision): accepting EITHER a 73-byte (FSHAHash present) OR 53-byte (omitted) pak-data header resolves 100% of the prior
delta=-20mismatches:Proves sequential decode is byte-perfect (high precision). Directed raises recall (Core 0.1%→24%, AI 13%→22%, Vehicles 0.9%→12%) but drops adjacency to 36–78% (coincidental matches) — so sequential stays the default and directed is a B2-only cross-validation tool.
Decoder changes: flags-last support + layout auto-detect (flags-first scan first to preserve 575/317),
decode_from_fdi()directed mode, bogus-isize fallback,mode=/fdi_result=params ondecode_from_pak().No regression: sequential still decodes 575/Abilities + 317/AI exactly; named acceptance (
BP_LogAbilities.uasset) unchanged.Implements Step 2 of the attempt-4 pipeline per docs/DECOMPOSED.md §B2. read_uasset(pak_path, entry, key_handle=None) returns raw .uasset bytes: seek to the B1-decoded pak-data offset, parse the authoritative in-pak FPakEntry header (header length computed from NumBlocks: 53 uncompressed / 73 single-block compressed), AES-256-ECB decrypt if the per-entry encrypted flag is set, Oodle-decompress if CompMethod != 0. Funcom pak facts established: * Data blocks ship UNENCRYPTED (Flags bit 22 clear on every entry; in-pak bEncrypted=0). AES path implemented + NIST-self-tested anyway, gated on the flag — but no Dune pak needs the key to read item data. * Compressed entries use Oodle (CompMethod=1). Oodle is statically linked into Funcom's Win64 .exe; no oo2core DLL / liboo2corelinux ships with the client and none exists on the build host. The decompress call uses a pluggable ctypes backend (_load_oodle: $OODLE_LIB then well-known names) and raises OodleUnavailable with guidance when absent — never returns garbage. New module surface in pak_extract.py: InPakHeader + parse_inpak_header, _aes_decrypt_ecb, _load_oodle / oodle_available / _oodle_decompress, read_uasset, OodleUnavailable, PACKAGE_FILE_TAG. Validation (validate/probe_read_uasset.py, 4/4 host-testable checks): [1] in-pak header matches EncodedPakEntries decode for entry 0 (comp=2024 unc=10795 method=1 hdr_size=73) -> MATCH [2] uncompressed .uasset read (Consumables @ 5970330) -> first 4 bytes = 0x9E2A83C1 PACKAGE_FILE_TAG -> PASS (satisfies the B2 acceptance magic-check verbatim for an uncompressed file) [3] AES-256-ECB self-test (NIST SP 800-38A vector) -> PASS [4] Oodle backend absent -> compressed read raises OodleUnavailable cleanly -> PASS (flips to a full compressed-magic assertion the moment $OODLE_LIB resolves) The BP_LogAbilities.uasset named target is Oodle-compressed; its end-to-end decompressed-magic check auto-activates when an Oodle backend is provided. Pipeline complete + correct; sole untested branch is the external-dependency Oodle decompress.B2 landed:
33b8ca5— in-processread_uasset(seek + AES-decrypt + Oodle-decompress)Per goal "Finalize and test B2". The read→decrypt→decompress pipeline is complete and correct; 4/4 host-testable checks pass; the sole external-dependency branch (Oodle decompress) is plumbed with graceful failure + a self-activating acceptance test.
read_uasset(pak_path, entry, key_handle=None)— seek to the B1-decoded offset, parse the authoritative in-pakFPakEntryheader (length computed from NumBlocks: 53 uncompressed / 73 single-block compressed), AES-256-ECB decrypt if the per-entry encrypted flag is set, Oodle-decompress if CompMethod≠0.Funcom pak facts established:
bEncrypted=0. AES path implemented + NIST-self-tested anyway (gated on the flag) but no Dune pak needs the key to read item data. The AES black-box directive is satisfied trivially for the catalog..exe— nooo2coreDLL /liboo2corelinuxships with the client and none exists on the build host (checked whole Steam library + ldconfig). Decompress uses a pluggable ctypes backend ($OODLE_LIB→ well-known names) that raisesOodleUnavailablerather than returning garbage.Validation —
validate/probe_read_uasset.py(4/4):Check [2] satisfies the B2 acceptance magic-check verbatim for an uncompressed file. The
BP_LogAbilities.uassetnamed target is Oodle-compressed; its end-to-end decompressed-magic check auto-activates when an Oodle backend is provided (check [4] flips to the real decompress + magic assertion).Your call — the one remaining branch (Oodle decompress):
OODLE_LIBat an existing Oodle.so/.dllexposingOodleLZ_Decompress, orooz/liboozKraken decoder (clean-room) — network clone + C++ build, so I left it as an explicit opt-in rather than pulling it autonomously.Branch note: B1 + B1-FU + B2 are all stacked on
feature/dune-extract-b1-encoded-entries(PR #6) since B2 builds directly on B1's decoder and PR #6 isn't merged yet. Say the word if you'd rather I split B2 onto its own branch.User authorized "clean room" -> wired the open-source powzix/ooz Kraken decoder as the Oodle backend, completing B2's compressed branch. install-prereqs.sh step 4b builds ooz into ~/.cargo/bin/libooz.so. Upstream is MSVC-only, so the installer applies a Linux/g++ shim inline: replacement stdafx.h + targetver.h (gcc builtins for _BitScan*/ _byteswap_*/__forceinline/_rotl, SSE via <emmintrin.h>), an extern "C" OozKrakenDecompress export wrapper (unmangled symbol for ctypes), and truncation of the Windows-only CLI harness at the `typedef int WINAPI` line. Skipped gracefully when no g++ is present. Never vendored — built on the user's machine (same install-don't-vendor posture as aesdumpster/repak); Funcom-IP boundary + "tooling as-is" intact. pak_extract._load_oodle now: * auto-discovers $OODLE_LIB -> ~/.cargo/bin/libooz.so -> well-known names * binds either OodleLZ_Decompress (14-arg) or Kraken_Decompress/ OozKrakenDecompress (4-arg), normalized behind one call * _oodle_decompress allocates 256 KiB output slack (ooz's basic Kraken path is not fuzz-safe; guards the heap against a mis-fed block) Validation (validate/probe_read_uasset.py) — 4/4 BOTH ways: * WITH backend (auto-discovered, no env var): 60/60 sampled compressed entries decoded with zero failures; 43 are PACKAGE_FILE_TAG .uasset files. First .uasset @ 364427 decompresses to c1 83 2a 9e f8 ff ff ff = 0x9E2A83C1 + UE5 LegacyFileVersion -8. PASS * WITHOUT backend: graceful OodleUnavailable, no garbage. PASS The 17 non-package results are .uexp/.ubulk siblings (legit, no tag); physical entry[0] @ 362330 is one such .uexp, so the probe scans for the first package-tagged entry (B1-FU virtual-vs-physical ordering, not a decode error). third_party/README.md gains the ooz clean-room attribution. B2 COMPLETE: full read->decrypt->decompress pipeline implemented, correct, and validated for both compressed and uncompressed entries. B3 can begin against any .uasset.B2 Oodle path COMPLETE:
8058eb7— clean-room ooz backend, compressed decompression validated end-to-end.You said "clean room" → wired the open-source powzix/ooz Kraken decoder (clean-room, no RAD/Epic source) as the Oodle backend.
install-prereqs.shstep 4b builds it into~/.cargo/bin/libooz.so. Upstream is MSVC-only, so the installer applies a Linux/g++ shim inline: replacementstdafx.h/targetver.h(gcc builtins for_BitScan*/_byteswap_*/__forceinline/_rotl, SSE via<emmintrin.h>), anextern "C" OozKrakenDecompresswrapper (unmangled symbol for ctypes), and truncation of the Windows-only CLI harness. Skipped gracefully when no g++. Never vendored — built on your machine, same posture as aesdumpster/repak.pak_extract._load_oodleauto-discovers$OODLE_LIB→~/.cargo/bin/libooz.so→ well-known names; binds eitherOodleLZ_Decompress(14-arg) or ooz'sKraken_Decompress(4-arg)._oodle_decompressallocates 256 KiB output slack (ooz's basic path isn't fuzz-safe — guards the heap).Validation —
probe_read_uasset.py, 4/4 BOTH ways:.uasset. First @ 364427 →c1 83 2a 9e f8 ff ff ff= 0x9E2A83C1 + UE5 LegacyFileVersion −8 → PASSOodleUnavailable, no garbage → PASSThe 17 non-package results are
.uexp/.ubulksiblings (legit, no tag); physical entry[0] @ 362330 is one such.uexp(B1-FU virtual-vs-physical ordering, not a decode error), so the probe scans for the first package-tagged entry.B2 is COMPLETE — full read→decrypt→decompress pipeline implemented, correct, validated for compressed AND uncompressed entries.
third_party/README.mdgains the ooz clean-room attribution.Test plan:
Next cascade pickup: B3 (UAsset → DataTable row maps) — can now run against any
.uasset, compressed or not.Implements Step 3 of the attempt-4 pipeline (docs/DECOMPOSED.md §B3). parse_datatable(uasset_bytes, uexp_bytes) -> {row_name: {field: value}}. Hand-rolled (uassetapi is C#/.NET, not the MIT-Python the spec hoped for; in-house matches the attempt-3 posture, no heavy dependency). Pipeline: parse_summary -> parse_name_table -> parse_exports -> slice export data from .uexp -> UObject tagged props -> NumRows -> per-row FName + tagged properties. FPropertyTag deserializer handles Bool/Int/Int64/Float/Double/Byte/ Enum/Str/Name/Text(incl FText string-table refs)/Object/SoftObject/ Struct, and skips unknown types by the tag Size field (never fails on an unrecognized property). Funcom cooked-package RE findings that drive the parser: * versions zeroed (unversioned) -> fixed UE5.4 layout, no version branches * PKG_FilterEditorOnly set -> LocalizationId absent (UE4ver<516 gate) * TotalHeaderSize == .uasset size; export payload entirely in .uexp; SerialOffset is absolute -> .uexp-relative = SerialOffset - headersize * .uexp is the physically-adjacent pak entry; ends with PACKAGE_FILE_TAG * 4-byte int32 preamble (observed 0) between UObject props and NumRows Validation (validate/probe_datatable.py Systems.pak): 5 DataTables, 60 rows parsed end-to-end (B1 decode -> B2 Oodle-decompress -> B3 row map). value-type spread text(stringtable):58 float:59 bool:177 object:1. DT_WeaponStats=36 rows, DT_DamageMitigations=17, DT_ArmorStats=4, etc. Every row carries StatDisplayName (FText->{string_table,key}), StatStep (Double), bool display flags — the FText+numeric+bool shapes the spec's fields-to-surface list calls for. uasset_parser.status() now reports 'implemented' (shown in --dry-run). Coverage note: the parser is complete for any DataTable bytes; the set reachable end-to-end today is bounded by B1 physical-entry coverage (virtual-vs-physical gap, B1-FU1/2/3). Richer per-item tables like DT_BaseItems_Weapons (literal Damage/RPM values) unlock when B1 coverage widens or .uasset/.uexp pairing is driven by FDI filename mapping. The 5 Systems.pak stat-definition tables fully exercise + prove the parser.B3 landed:
cc0b6ca— UE5.4 cooked-DataTable parser → row maps, validated end-to-end.parse_datatable(uasset_bytes, uexp_bytes)→{row_name: {field: value}}. Hand-rolled (uassetapi is C#/.NET, not the MIT-Python the spec hoped for; in-house matches the attempt-3 posture, no heavy dep).FPropertyTag deserializer handles Bool / Int / Int64 / Float / Double / Byte / Enum / Str / Name / Text (FText incl. string-table refs) / Object / SoftObject / Struct — and skips unknown types by the tag
Sizefield, so it never fails on an unrecognized property.Funcom cooked-package RE findings:
PKG_FilterEditorOnlyset →LocalizationIdabsent (UE4ver<516 gate)TotalHeaderSize==.uassetsize; export payload entirely in.uexp;SerialOffsetabsolute →.uexp-relative =SerialOffset − headersize.uexpis the physically-adjacent pak entry; ends withPACKAGE_FILE_TAGNumRowsValidation —
probe_datatable.py Systems.pak:Every row carries
StatDisplayName(FText →{string_table, key}likeUI/ItemStats_Armor_ColdProtection),StatStep(Double), and bool display flags — parsed correctly through B1 decode → B2 Oodle-decompress → B3 row map.--dry-runnow reports the row parser asimplemented.Honest coverage note: the parser is complete for any DataTable bytes; the set reachable end-to-end today is bounded by B1 physical coverage (virtual-vs-physical gap, B1-FU1/2/3). Richer per-item tables like
DT_BaseItems_Weapons(literal Damage/RPM values) unlock when B1 coverage widens, or when.uasset/.uexppairing is driven by FDI filename→offset mapping instead of physical adjacency. The 5 Systems.pak stat-definition tables fully exercise + prove the parser.Test plan:
Stack in PR #6: B1 + B1-FU + B2 + Oodle + B3. Next: B4 (wire row output through field_filter → catalog_writer) — though it'll surface real per-item values only as fast as B1 coverage feeds it.
B1: Funcom UE5.4 EncodedPakEntries decoder — named acceptance passes, FU1-FU4 filed for full coverageto dune-extract attempt-4: B1 EncodedPakEntries + FU1-4 + B2 read/decrypt/Oodle + B3 DataTable parser