Nautilus FFI Memory Contract

Low-medium relevance to Cortana. The FFI page documents the C-ABI contract between Rust core and the Cython/PyO3 layer that wraps it - rules for CVec, PyCapsule, Box-backed *_API wrappers, and the abort_on_panic envelope around every extern "C" symbol. End-user Cortana strategy code never crosses this boundary directly: every Strategy.on_quote_tick callback, every cache.position(...) lookup, every submit_order(...) call is a Python method that the framework has already wrapped in PyO3 plumbing. The FFI contract matters to us only as cost-of-call awareness - each Python ↔ Rust hop is a real transition, types like Price/Quantity/Money are precision-aware Rust objects with PyO3-bound accessors, and naive code (e.g., calling .as_double() on the same Price 78 times inside a tight scoring loop) burns measurable cycles per crossing. The page is also where the framework’s “panics never unwind across extern "C"” rule is codified, which explains why the v1 release wheels run panic = abort and why a Rust-side invariant violation kills the process rather than raising a Python exception. Read this page if Cortana ever authors a Rust crate (UW adapter, custom indicator). Otherwise the framework hides the FFI; the relevant takeaway is “minimize hot- loop crossings, cache .as_double() results once.”

Why this page exists

nautilus-rust.md answers “do we have to write Rust?” (no). nautilus-architecture.md answers “how is Python wired to Rust?” (PyO3

  • static link via Cython). This page covers what the official FFI Memory Contract doc actually says - the rules a contributor must follow when adding FFI symbols and, more importantly for Cortana, the cost model implied by those rules so we can reason about hot-path performance without authoring Rust ourselves.

The five FFI rules (verbatim mechanics)

The official page documents exactly five mechanics. Each one is a hard rule for any contributor; the cost implications fall out of them.

1. Fail-fast panics at the FFI boundary

“Rust panics must never unwind across extern "C" functions. Unwinding into C or Python is undefined behaviour and can corrupt the foreign stack or leave partially-dropped resources behind. To enforce the fail-fast architecture we wrap every exported symbol in crate::ffi::abort_on_panic, which executes the body and calls process::abort() if a panic occurs.”

Mechanics: every exported symbol looks like

#[unsafe(no_mangle)]
pub extern "C" fn some_ffi_fn(args) -> ReturnType {
    abort_on_panic(|| {
        // body
    })
}

The panic message is logged before the abort, so debug output is preserved. This is the framework-level reason panic = abort is the recommended release config: a panic anywhere in the Rust core takes the process down cleanly so the orchestrator (launchd / systemd / docker) can restart from the last persisted state.

2. CVec lifecycle (the Vec<T> exchange protocol)

CVec is the canonical container for variable-length data crossing the boundary - a repr(C) thin wrapper around Vec<T> passed by value.

StepOwnerAction
1RustBuild Vec<T>, convert with .into() - leaks the vec, transfers raw allocation to foreign code.
2Foreign (Cython / PyO3 / C)Read while the CVec is in scope. Do not modify ptr, len, cap.
3ForeignExactly once, call the type-specific drop helper (e.g. vec_drop_book_levels, vec_drop_book_orders, vec_time_event_handlers_drop).

The drop helper reconstructs the original Vec<T> with Vec::from_raw_parts and lets it drop normally.

“If step 3 is forgotten the allocation is leaked for the remainder of the process; if it is performed twice the program will double-free and likely crash.”

The framework explicitly removed the old generic cvec_drop because it always treated the buffer as Vec<u8> - calling it on any other element type produces a size-mismatch and corrupts the allocator’s bookkeeping. Use the type-specific helper. If none exists, add one in crates/core/src/ffi/cvec.rs.

3. Capsules created on the Python side

Some Cython helpers allocate buffers with PyMem_Malloc, wrap them into a CVec, and return the address inside a PyCapsule. Every such capsule is created with a destructor (capsule_destructor or capsule_destructor_deltas) that frees both the buffer and the CVec. The Python caller therefore must NOT free the memory manually

  • the destructor handles it on collection. Manual free → double-free crash.

4. Capsules created on the Rust side (PyO3 bindings)

When Rust pushes a heap-allocated value into Python, it MUST use PyCapsule::new_with_destructor:

use pyo3::types::PyCapsule;
 
Python::attach(|py| {
    let my_data = Box::new(MyStruct::new());
    let ptr = Box::into_raw(my_data);
    let capsule = PyCapsule::new_with_destructor(
        py,
        ptr,
        None,
        |ptr, _| {
            // Reconstruct the Box and let it drop, freeing the alloc
            let _ = unsafe { Box::from_raw(ptr) };
        },
    ).expect("capsule creation failed");
    // ... pass `capsule` back to Python ...
});

“Do not use PyCapsule::new(…, None); that variant registers no destructor and will leak memory unless the recipient manually extracts and frees the pointer (something we never rely on).”

The codebase has been audited so every Rust→Python capsule has a destructor. New FFI modules must follow the same pattern.

5. Box-backed *_API wrappers (owned Rust objects)

For complex objects (OrderBook, SyntheticInstrument, TimeEventAccumulator), Rust allocates the value with Box::new and returns a small repr(C) wrapper whose only field is the Box:

#[repr(C)]
pub struct OrderBook_API(Box<OrderBook>);
 
#[unsafe(no_mangle)]
pub extern "C" fn orderbook_new(id: InstrumentId, book_type: BookType)
    -> OrderBook_API
{
    OrderBook_API(Box::new(OrderBook::new(id, book_type)))
}
 
#[unsafe(no_mangle)]
pub extern "C" fn orderbook_drop(book: OrderBook_API) {
    drop(book); // frees the heap allocation
}

Mandatory rules:

  • Every *_new constructor must have a matching *_drop.
  • Validate parameters before heap allocation (fail fast on bad input).
  • The Python/Cython binding must guarantee *_drop runs exactly once.

Two acceptable patterns for guaranteeing single-drop:

  • Preferred (new code): wrap the pointer in a PyCapsule with destructor (rule 4).
  • Legacy (v1 Cython only): call the helper explicitly in __del__ / __dealloc__:
cdef class OrderBook:
    cdef OrderBook_API _mem
    def __cinit__(self, ...):
        self._mem = orderbook_new(...)
    def __del__(self):
        if self._mem._0 != NULL:
            orderbook_drop(self._mem)

Forgetting drop → leak. Calling twice → crash.

What types cross the boundary cleanly (and why)

The page itself doesn’t enumerate this, but the conventions documented across nautilus-value-types.md, nautilus-architecture.md, and the FFI page imply a clear hierarchy of “cheap” vs “expensive” crossings:

Cheap (small, repr(C), value-typed):

  • Primitives: u64, i64, f64, bool - pass-by-value, no allocation.
  • UnixNanos, fixed-precision integer scalars - same.
  • repr(C) enums - OrderSide, OrderType, BookType, etc.
  • Identifier strings backed by interned IDs (InstrumentId, StrategyId) - equality is a fixed-size hash compare.

Cheap-ish (Box-backed *_API wrappers, single allocation, single crossing):

  • Price, Quantity, Money - fixed-precision scalars but with a precision tag; .as_double() is the conversion to plain f64.
  • Order, Position, Account - when handed by reference.

Expensive (heap allocation per crossing, possibly per-call):

  • String ↔ Python str - UTF-8 validation + new alloc both directions.
  • Arbitrary Python dict / list / tuple constructed Rust-side - every element crosses the GIL-locked allocator.
  • Custom Python classes accessed from Rust by attribute lookup - each getattr is a full Python interpreter round-trip.
  • CVec of complex elements - fine for one bulk hand-off, painful if done every tick.

Expensive and easy to do accidentally:

  • Calling .as_double() on the same Price repeatedly inside a hot loop - every call is an FFI crossing.
  • Constructing a Python dict or list per tick to pass to a custom data type instead of using @customdataclass.
  • Reading attributes off Python user objects from Rust callbacks via getattr instead of through a typed PyO3 binding.

GIL implications

The PyO3 model: the GIL is held whenever Python code runs. The Nautilus pattern - single-threaded kernel, all callbacks dispatched on that thread - means strategy callbacks already run with the GIL held. The implications:

  • Inside on_quote_tick, on_bar, on_data, the strategy holds the GIL. Heavy NumPy / pandas / ML inference work blocks the kernel. If a callback takes 50 ms, that’s 50 ms of no other dispatch.
  • Background services (Tokio I/O, DataFusion queries) do NOT hold the GIL. They run on separate threads and re-enter the kernel via MPSC channels; the kernel callback then re-acquires the GIL when the Python handler runs.
  • Python::attach(|py| { ... }) is the PyO3 incantation for any Rust code that wants to touch Python. It’s effectively with gil: from the Rust side. Any Cortana Rust extension would need this to publish events into the Python message bus.

The takeaway for Cortana: don’t put long synchronous compute inside a strategy callback. Either (a) keep the callback fast and decision- focused, or (b) push the heavy lift to a background actor that emits a result event the strategy subscribes to.

Async-callback-from-Rust patterns

The framework’s pattern, paraphrased from nautilus-architecture.md + the FFI doc:

  1. Rust adapter receives WebSocket frame on a Tokio task (no GIL).
  2. Tokio task parses the frame and constructs a domain event (QuoteTick, custom Data).
  3. Tokio task sends DataEvent::Data(...) through an MPSC channel to the kernel thread.
  4. Kernel thread receives event, writes to Cache, publishes to MessageBus topic.
  5. MessageBus dispatch invokes Python handler - at this point the thread acquires the GIL via Python::attach, calls the bound Python method, releases the GIL on return.

For Cortana, this means UW WebSocket events ingested by a Rust adapter would arrive at on_data(UWFlowAlert) already parsed and validated - no GIL contention with the WebSocket reader, no risk of the strategy callback blocking the network read.

Cortana MK3 implications

The FFI cost model rarely bites Cortana directly because Cortana writes Python, the framework already pays the FFI cost on our behalf, and IBKR + UW + scoring run at human timescales (signal cadence: seconds, not nanoseconds).

That said, three concrete behaviors should be baked into the M1 strategy implementation to avoid accidentally pessimizing the hot path:

  1. Cache .as_double() results inside callbacks. A scoring feature that references mid-price 30 times shouldn’t call quote.bid_price.as_double() 30 times; pull it once into a local f64 at the top of the handler. Same for quote.ask_price, position sizes, premium values. (nautilus-value-types.md makes the lossy-but-cheap .as_double() API explicit; the FFI page explains why it’s not free - every call is an FFI crossing.)

  2. Use @customdataclass for ScoreUpdate / MetaProb / Regime, not raw Python dicts. The @customdataclass decorator (per nautilus-custom-data.md) generates the PyO3 boilerplate so the custom event passes through the message bus the same way native QuoteTick does - single allocation, typed fields, no per-key getattr costs. A naive “publish a dict” approach would cross the FFI boundary once per field per dispatch.

  3. Never iterate the entire 78-feature scoring vector inside a Rust callback. If we ever author a Rust scoring actor (post-spike, maybe), the right shape is “Rust receives raw quote, computes the 8 meta-selected features in Rust, publishes a typed ScoreUpdate with those 8 floats.” The Python strategy then reads ScoreUpdate fields cheaply because the type is repr(C)-friendly. The wrong shape is “Rust populates a Python dict with 78 entries every tick” - that’s 78 FFI crossings + 78 dict insertions per dispatch.

The single most expensive accidental pattern Cortana could hit: calling .as_double() (or any other PyO3 accessor) repeatedly on the same value object inside a tight loop or per-tick handler. Each call crosses the FFI boundary. Mitigation: at the top of every strategy handler, pull every value into a local float/int/bool once, then operate on locals. This pattern is cheap to enforce in code review and removes nearly all accidental FFI cost.

When (if ever) Cortana would author Rust to skip FFI cost:

  • UW WebSocket parsing under flow bursts. If UW emits 1000+ events/sec during a real flow burst and Python parsing falls behind, a Rust adapter under crates/adapters/unusual_whales/ is the documented pattern. The Rust side owns parsing, rate-limit, retry, and emits typed UWFlowAlert events to the bus. Python strategy subscribes; FFI cost is one crossing per event, not one per JSON field.
  • A custom microstructure indicator that needs to inspect every NBBO tick on a deep chain. If profiling shows the per-tick Python callback eating real time, the indicator moves to Rust under crates/indicators/. Defer until measured.

Neither case applies to the M1 scoring strategy. Stick to Python.

Spike DX premise verdict

nautilus-rust.md confirmed “Python-first; Rust core fades into the background.” This page reinforces the same conclusion from the FFI angle: the contract is documented, the framework already follows it internally, and Cortana doesn’t have to read or implement any of it. The discipline that does matter is the cheap one - don’t churn typed-value accessors in tight loops, prefer @customdataclass over ad-hoc dicts. Both are caught by code review, neither requires Rust fluency.

See Also


Timeline

  • 2026-05-07 | Cody - Filed during pre-spike concept mastery sweep batch 7 (developer guide).