ARW

Caching Layers

Updated: 2025-09-16 Type: Explanation

This document outlines a multi‑layer caching strategy for ARW, blending research‑informed ideas with practical, incremental implementations. It aims for high ROI first (latency, throughput, stability), while keeping privacy and determinism front‑and‑center.

Layers

LLM inference‑level (high ROI)
- Prefix/KV reuse for repeated system/prompt prefixes. With llama.cpp, enable cache_prompt: true (client) and persistent --prompt-cache (server). With vLLM, rely on PagedAttention and automatic prefix caching.
- Memory policy: schedule batches to share prefixes and keep KV blocks defragmented; prefer fixed‑size paging to reduce fragmentation.
Tool/action cache (Bazel‑style)
- Treat each tool invocation as a deterministic action keyed by a content hash of: tool id/version, canonical input (RFC‑8785 JSON), and an environment signature.
- Store outputs in a content‑addressed store (CAS) and map action_key → digest. Replay on hit; execute on miss. Emit tool.cache events and publish lightweight counters.
Semantic response cache (planned)
- Cache Q→A pairs per project/user keyed by embeddings with a verifier gate. Only reuse when a thresholded match passes a quick check; otherwise seed the model with the cached answer for speculative decoding.
- Context‑aware keys include turn‑level features and referenced tool outputs.
Retrieval caches (planned)
- Cache frequent ANN results and maintain a negative cache for “no useful doc” to save vector queries. Use LSH/SimHash buckets as a prefilter before expensive comparisons.
Read‑models over SSE
- Maintain small, incremental read‑models in process (e.g., route stats, models metrics) and publish RFC‑6902 JSON Patch deltas over SSE.
- Coalesce bursts (250ms default) and publish an idle refresh (2s default). Resume with Last‑Event‑ID via the standard SSE API.
In‑memory + on‑disk persistence
- In‑memory: modern, robust eviction (W‑TinyLFU/S3‑FIFO). In ARW, we use Moka (W‑TinyLFU) for the Action Cache.
- On‑disk: content‑addressed store under {state_dir}. Consider RocksDB with uncompressed+compressed block caches and a secondary flash tier for large, hot blobs.
Edge & HTTP caching
- Emit strong validators and immutable Cache-Control for digest‑addressed blobs. ARW serves ETag:"<sha256>", Last-Modified, and public, max-age=31536000, immutable for /admin/models/by-hash/:sha256, and honors If-None-Match.
- See also: HTTP Caching Semantics
- Stampede protection: coalesce identical misses with a singleflight mechanism.

What’s implemented in ARW today

llama.cpp client requests include cache_prompt: true enabling KV reuse.
Tool Action Cache with Moka front + disk CAS back; RFC‑8785‑like canonicalization, singleflight, counters, Prometheus metrics, and admin stats.
CAS blob serving with validators and 304 handling.
Read‑models and deltas: models metrics and route stats publish RFC‑6902 patches with coalescing; UI panels consume them live.

Metrics & measurement

Report hit ratio (by layer), P95/P99 latency saved, bytes saved (post‑compression), stampede suppression rate, semantic false‑hit rate, and recompute budget.
In ARW:
- Tool Action Cache: /admin/tools/cache_stats, tool.cache events, and /metrics arw_tools_cache_*.
- Models metrics: /state/models_metrics and /metrics arw_models_download_*.
- Route stats: /state/route_stats and overlays in /debug (p95/ewma/hits/errors).

Configuration knobs

Action Cache: ARW_TOOLS_CACHE_TTL_SECS, ARW_TOOLS_CACHE_CAP.
Route stats: ARW_ROUTE_STATS_COALESCE_MS, ARW_ROUTE_STATS_PUBLISH_MS.
Models metrics: ARW_MODELS_METRICS_COALESCE_MS, ARW_MODELS_METRICS_PUBLISH_MS.
CAS blob serving: no special knobs; follows the digest semantics automatically.

Next steps (tracked in Backlog)

Verified semantic cache (per‑project, per‑user; privacy‑preserving learning of thresholds).
Context‑aware keys and SimHash prefilter for semantic caches.
RocksDB tier for persistent hot sets (tools/semantic/embeddings) with Zstd dictionaries for small JSON types.
Peer/edge CAS for artifacts (opt‑in; IPLD/libp2p style gossip for multi‑host dev).

This site is open source. Improve this page.