MemoryWAM: Persistent Memory for Robot Manipulation

A team led by researchers from several labs reports a world action model that mixes recent frames, event-boundary anchors, and compact gist tokens to handle memory-dependent manipulation without the cost of storing full histories.

A preprint posted to arXiv on 18 June 2026 describes MemoryWAM, a world action model (WAM) built to carry memory across long-horizon robotic manipulation tasks without paying the full storage and compute cost of retaining an entire observation history. The paper, listed under the robotics category cs.RO and authored by Sizhe Yang, Juncheng Mu, Tianming Wei, Chenhao Lu, Xiaofan Li, Linning Xu, Zhengrong Xue, Zhecheng Yuan, Dahua Lin, Jiangmiao Pang, and Huazhe Xu, frames the work as a response to a specific structural trade-off in how WAMs handle observations over time.

World action models, as the authors describe them, jointly model visual foresight and actions conditioned on both current and historical observations. That joint formulation is what distinguishes a WAM from a controller that reacts only to the present frame: the model predicts what the scene will look like and what action to take, drawing on what it has already seen. The paper situates this as a promising paradigm for manipulation, then identifies where it strains. Methods tuned for efficient inference condition on only a bounded window of recent observations, which the authors say leaves them struggling in non-Markovian environments where the right action depends on events that have already scrolled out of view. Methods that preserve long histories avoid that blind spot but, per the paper, incur time and space costs that grow substantially with sequence length.

"MemoryWAM uses a hybrid memory design that combines recent frames, event-boundary anchor frames, and compact gist tokens that summarize long-range history."— arXiv:2606.20562, source

The hybrid memory is the core of what the paper proposes. Rather than choosing between a short window and a full log, MemoryWAM keeps three kinds of representation at once. Recent frames preserve fine detail about the immediate situation. Event-boundary anchor frames mark points the model treats as significant transitions, retaining specific moments from further back rather than a uniform sample. Compact gist tokens, the third component, summarize long-range history in a condensed form so that older context survives without storing every frame that produced it. The design is an attempt to keep both the resolution of recent observation and the reach of long history while bounding what has to be carried forward.

How the memory is retrieved

A memory store is only useful if the model can pull the right part of it at the right moment. The paper describes a tailored attention mechanism that enables retrieval of both detailed short-term context and compressed long-term context. In the authors' framing, this supports memory-dependent decision-making while reducing inference latency and GPU memory usage. The claim being made is not that the model remembers more in absolute terms than a full-history method, but that it remembers what matters at a cost that does not balloon with the length of the task. The two halves of the retrieval, detailed short-term and compressed long-term, map onto the two failure modes the paper opened with: the bounded-window method that forgets too much and the long-history method that costs too much.

Non-Markovian manipulation is the setting where this distinction is meant to pay off. A Markovian task is one in which the current observation contains everything needed to choose the next action; many manipulation problems are not like that. If a robot has to remember that it already placed an object somewhere out of frame, or that an earlier step changed the state of the workspace, a model conditioned only on recent frames has no access to the information that determines the correct move. The authors position MemoryWAM's event-boundary anchors and gist tokens as the mechanism for carrying exactly that kind of cross-time dependency.

What the paper reports

The evaluation, as described in the abstract, spans long-horizon, memory-dependent manipulation tasks in both simulation and the real world. The authors state that MemoryWAM outperforms strong vision-language-action (VLA) and WAM baselines on those tasks while maintaining favorable computational efficiency. The comparison set is notable for including both VLA models, which couple language and vision to action, and other world action models, the same family MemoryWAM belongs to. The reported result is therefore a claim of improvement against both a broad multimodal approach and the closest methodological neighbors.

The abstract does not, in the text fetched here, report the specific task suite, the numerical margins, the hardware, or the baseline model names. Those details would sit in the full paper rather than the summary, and this brief reports only what the posted record states. What the record establishes is the shape of the contribution: a memory architecture with three tiers, a retrieval mechanism designed to read all three, and a reported outcome on memory-dependent manipulation that the authors characterize as favorable on both task performance and compute.

The framing the authors use places MemoryWAM within an active line of work on world models for embodied control, where the question of how to handle history efficiently recurs across methods. By naming the trade-off explicitly—efficient short windows versus expensive long histories—and proposing a hybrid as the resolution, the paper stakes out a position on that question rather than only reporting a benchmark number.

Two design choices in the abstract carry most of that position. The first is the decision to retain event-boundary anchor frames rather than a uniform or recency-based sample of older observations. Anchoring memory to transitions implies a model that treats some moments as more worth keeping than others, which is a different stance from methods that decay or truncate history by age alone. The second is the use of gist tokens to summarize long-range history in compact form. Compression of older context is what lets the method claim favorable GPU memory usage against full-history baselines: the long tail of the task is represented, but not stored frame by frame. The tailored attention mechanism then has to read across these heterogeneous representations—detailed recent frames, sparse anchors, and compressed gist—within a single retrieval step, which the paper presents as the enabling piece for memory-dependent decisions at bounded cost.

Readers tracking the area can follow the canonical record at the arXiv abstract page, where the authors, category, and full summary are listed, and where the full PDF is linked for the methodological detail—task suite, baselines, and numerical margins—that the abstract leaves out.

MemoryWAM Pairs Long-Horizon Memory With World-Action Modeling for Robot Manipulation

How the memory is retrieved

What the paper reports

Comments