Paste: gemini

Author: erg
Mode: factor
Date: Fri, 25 Jul 2025 18:11:06
Plain Text |
Excellent. The GDB backtrace provides the exact context of the crash and confirms the area of suspicion. This is much more precise than general analysis.

Let's break down the backtrace. This is a classic "wrong type" error during a critical VM initialization phase.

### Debugging the Backtrace

1.  **Frame #4: The Crash Site**
    *   `factor::untag_fixnum (tagged=1) at vm/layouts.hpp:94`
    *   This is the exact point of failure. The `untag_fixnum` function was called with the value `1`.
    *   In the Factor VM, `1` is the tagged representation of the boolean `f`. The tag is `1` (`F_TYPE`).
    *   `untag_fixnum` asserts that the tag of its input is `0` (`FIXNUM_TYPE`).
    *   The assertion `TAG(1) == FIXNUM_TYPE` evaluates to `1 == 0`, which is false, causing the `abort()`.

2.  **Frame #5: The Caller**
    *   `factor::factor_vm::lookup_external_address` is the function that called `untag_fixnum`.
    *   Looking at its source code in `vm/code_blocks.cpp`, we can see the exact line:
        ```cpp
        // vm/code_blocks.cpp
        cell factor_vm::lookup_external_address(...) {
          switch (rel_type) {
            // ...
            case RT_VM:
              return (cell)this + untag_fixnum(array_nth(parameters, index)); // This is the call site
            // ...
          }
        }
        ```
    *   The backtrace tells us `rel_type` was `RT_VM`. This case is for relocations that refer to the `factor_vm` object itself. It expects a **fixnum offset** from the `parameters` array.
    *   Instead of a fixnum offset, `array_nth(parameters, index)` returned `1` (the value for `f`).

3.  **Frames #8-18: The Context**
    *   The crash happens inside `initial_code_block_visitor`, which is called from `initialize_code_block`, which is called from `update_word_references`, which is called from `prepare_boot_image`.
    *   This confirms the crash occurs during the "Stage 2 early init" phase. The VM is iterating through all the pre-compiled code blocks from the boot image and "linking" them by filling in relocation placeholders with actual memory addresses.

### The Root Cause: A Bug in the ARM64 Assembler Relocation Logic

The backtrace tells us that a code block with an `RT_VM` relocation was created with `f` as its parameter instead of a fixnum offset. We need to find where this relocation is generated.

Searching for `rel-vm` in `basis/bootstrap/assembler/arm.64.factor`, we find it in the `unwind-native-frames` sub-primitive:

*   **File:** `basis/bootstrap/assembler/arm.64.factor`
*   **Word:** `{ unwind-native-frames ... }`

```factor
{ unwind-native-frames [
    SP arg2 MOV
    VM LDR= rel-vm
    CTX VM vm-context-offset [+] LDR
    ...
] }
```

The instruction is `VM LDR= rel-vm`. Let's look at the definition of `LDR=` in the same file:

*   **File:** `basis/bootstrap/assembler/arm.64.factor`

```factor
: LDR= ( Rt -- word class ) f swap (LDR=) ;
```

**This is the bug.**

The word `LDR=` **unconditionally pushes `f` onto the stack** before calling the helper `(LDR=)`. This `f` is intended to be the literal value that will be loaded into the register.

However, for a relocation like `rel-vm`, there is no literal value to be loaded. The value is resolved at runtime by the VM's relocation logic. The `rel-vm` word is supposed to put a *relocation parameter* (a fixnum offset into the `factor_vm` struct) into the `parameter-table`.

Because `LDR=` pushes `f`, this `f` is what ends up in the `parameter-table` for the `RT_VM` relocation generated by `rel-vm`. When the VM later tries to initialize this code block, it reads the `f`, tries to untag it as a fixnum, and aborts.

The same bug exists for `rel-dlsym` used in `JIT-PRIMITIVE`:

```factor
[
    jit-save-context
    arg1 VM MOV
    f LDR=BLR rel-dlsym  ! Uses LDR=BLR which likely has a similar issue
    jit-restore-context
] JIT-PRIMITIVE jit-define
```

The x86 backend, for comparison, handles this differently. It doesn't have this generic `LDR=` word that pushes `f`. Its relocations are handled more directly.

**The Fix:**
You need to separate the logic for loading a literal from the logic for creating a relocation that will be resolved later.

1.  Create a new word, let's call it `LDR-REL`, that does not push `f`. It should only set up the machine code for the `LDR` and allow the following `rel-*` word to add its relocation information.
2.  Change `rel-vm` and `rel-dlsym` to be used with `LDR-REL` instead of `LDR=`.

**Conceptual Fix:**

```factor
! In basis/bootstrap/assembler/arm.64.factor

! A new word for relocations that don't have an immediate literal value.
: LDR-REL ( Rt -- word class )
    ! This should do what (LDR=) does but without expecting 'f' on the stack
    ... ;

! Update the primitive to use the new word
{ unwind-native-frames [
    SP arg2 MOV
    VM LDR-REL rel-vm
    ...
] }
```

**Summary and Action Plan:**

1.  **Immediate Fix:** Refactor the ARM64 assembler logic to distinguish between loading a literal (`LDR=`) and setting up a relocation to be resolved by the VM (`LDR-REL`). Modify `rel-vm`, `rel-dlsym`, and any similar words to use the new relocation-specific loading word.

New Annotation

Summary:
Author:
Mode:
Body: