Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.subconscious.dev/llms.txt

Use this file to discover all available pages before exploring further.

Subconscious cache helps the inference system detect when context engineering happens in agent reasoning runs by matching the prefix and suffix of cached tokens against new inputs. The goal is to preserve the memory of the pruned tokens implicitly within the latent states of suffix tokens and improve the cache hit rate.

How It Works

To hit the subconscious cache, the cached tokens and new inputs need to satisfy two criteria:
  1. The cached chain can be precisely split into three sections A, B, C
  2. The new input chain can be precisely split into three sections A, C, D, such that A and C match the prefix A and suffix C in the cache and len(C) > threshold. We usually set threshold = 8 tokens to avoid matching the suffix of chat templates.

Manually Triggering Subconscious Cache

Subconscious LLM API enables auto-compaction by default. Under the auto-compaction mode, developers should just send any message list to the LLM API and the inference system will detect prunable messages. Message pruning happened in the auto-mode can automatically hit subconscious cache. If you want to manually hit subconscious by controlling the context by yourself instead of auto-compaction, simply disable auto compaction in the chat kwards
from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.subconscious.dev/v1",
)

response = client.chat.completions.create(
    model="subconscious/tim-qwen3.6-27b",
    messages=[{"role": "user", "content": "What is 127 * 849 + 3621?"}],
    extra_body={
        "chat_template_kwargs": {"enable_auto_compaction": False},
    },
)

print(response.choices[0].message.content)

When to Turn Off Auto Compaction

If you turn off auto compaction, you need to manually construct inputs that can hit the subconscious cache. Just make sure you only prune one continuous token sequence from the message list. If there is no context pruning, the new input will simply hit prefix cache. If more than one chunks are pruned, we cannot find suffix tokens satisfying the subconscious rules. Use Auto Compaction for:
  • Programming tasks, where assistant-tool-user messages keeps growing in a message list
  • Multi-turn conversation - rigid context pruning rule cannot handle arbitrary user inputs
Skip auto compaction for:
  • ReACT multi-modal reasoning: Subconscious cache works perfectly when you only keep latest turns / images in the message list
  • Other applications where you need to carefully control context engineering.