Perfect Agentic Startup OS: Context, Memory, Workflow, and Robustness

Generated: 2026-05-24

Executive Summary

If we were starting from scratch, I would not build around any single vendor or agent framework. I would build a small, vendor-neutral startup operating system with:

A thin always-on gateway for Discord/Telegram/web/mobile.
A strict context router that loads only identity, safety, current task, and relevant project memory.
A progressive memory system with raw logs -> extracted facts -> synthesized project memory -> durable decisions/specs -> graph links.
A gstack-style workflow pipeline where every stage creates an artifact consumed by the next stage.
A graph-backed project intelligence layer for code, docs, tasks, entities, dependencies, and decisions.
A task/governance plane with ownership, quality gates, evals, and cost tracking.
Multiple execution runtimes: Claude Code, Codex, OpenClaw sessions, Hermes, browser automation, and local scripts.

OpenClaw can still be the foundational gateway/control surface, but it should not be the whole architecture. The real foundation should be the WiderWings Agentic OS, with OpenClaw as one runtime layer.

Current Diagnosis

The current setup works, but it is too prompt-file-driven and not enough memory-system-driven.

Local audit:

MEMORY.md: 508 lines
AGENTS.md: 278 lines
PROCESS.md: 288 lines
BRAIN.md: 191 lines
SOUL.md: 142 lines
HEARTBEAT.md: 115 lines

Across main and agent workspaces, there are about 3,972 lines of AGENTS/SOUL/MEMORY/BRAIN-style context files. That creates token waste, repeated instructions, drift, and imprecise recall.

The core problem: too much "always loaded context" and not enough "retrieved context."

Research Findings

gstack

Garry Tan's gstack is not mainly a tool collection. It is a development operating discipline. The important pattern is:

Think -> Plan -> Build -> Review -> Test -> Ship -> Reflect

Each stage writes a concrete artifact that the next stage reads. This reduces ambiguity, enforces quality gates, and makes agents less dependent on human re-explanation.

What to copy:

office-hours style product forcing questions before building.
CEO/product review before engineering work.
Engineering review before implementation.
Automated review that fixes obvious issues and escalates judgment calls.
QA/browser testing as a required gate.
Ship/release docs.
Retro/lesson extraction.

Source: https://github.com/garrytan/gstack

gbrain

gbrain is more important than gstack for our setup. It treats memory as a retrieval and synthesis system, not a pile of notes. It uses local-first storage options, Postgres/Supabase scale path, and MCP exposure.

What to copy:

Memory retrieval should return synthesized answers with citations, not random matching notes.
Memory should include source provenance.
Memory should separate facts, decisions, tasks, docs, and relationships.
Agents should query memory before answering prior-context questions.
The system should know when memory is stale or insufficient.

Source: https://github.com/garrytan/gbrain

Graphify / Graph-Backed Project Intelligence

"Graphify" is ambiguous in the ecosystem, but the useful pattern is clear: graph-based code/project understanding. Tools in this category parse files/repos/docs and build graphs of entities, dependencies, references, and relationships.

For an agentic startup, the relevant use is not generic "knowledge graph" hype. It is:

Code graph: files, functions, routes, schemas, imports, tests, owners.
Product graph: projects, features, personas, pages, funnels, experiments.
Memory graph: decisions, specs, people, tasks, projects, artifacts.
Infrastructure graph: repos, Supabase projects, env vars, deployments, domains.

This is how we stop agents from repeatedly scanning entire repos or relying on bloated prompt files.

Sources:

OpenClaw

OpenClaw's strength is not perfect memory. It is always-on multi-channel agent routing and self-hosted control. It is a good gateway/runtime layer.

Weaknesses:

Context files can become bloated.
Memory write/recall is policy-driven, not reliably enforced by architecture.
Project isolation depends too much on instructions.
Workflow discipline is not strongly encoded by default.
Observability and evals are still too manual.

Recommendation: keep OpenClaw as a gateway, but move memory, workflow, and governance into explicit systems around it.

Sources:

Hermes

Hermes is worth piloting for self-improving workflows. Its learning loop and memory emphasis are directly aligned with what Henry wants. But self-improvement should be gated by evals and approval before production authority.

Recommendation: use Hermes as a sandboxed learning/execution backend first, not as the whole foundation on day one.

Sources:

The Ideal Architecture From Scratch

1. Agentic Control Plane

Purpose: route work, apply policy, track tasks, select runtime, enforce project boundaries.

It should be vendor-neutral. It can call OpenClaw, Claude Code, Codex, Hermes, browser automation, local scripts, GitHub, Supabase, and Second Brain.

Key components:

ProjectRegistry: source of truth for repos, Supabase IDs, env boundaries, domains.
ContextRouter: decides what context to load.
MemoryRouter: decides what memory to query/save.
TaskRouter: creates/assigns tasks and tracks owners.
RuntimeRouter: chooses Claude Code, Codex, Hermes, or OpenClaw session.
PolicyEngine: blocks unsafe/external/destructive actions.
Evaluator: runs quality checks before accepting output.

2. Context Management

Context should be layered:

Hot Context

Always loaded. Tiny. Target: under 1,500 tokens.

identity: "Bob, chief of staff"
current user preferences
safety rules
current channel/project
context-loading algorithm

Warm Context

Loaded by task type/project. Target: 1,000-4,000 tokens.

project brief
current sprint
relevant decisions
active tasks
role-specific checklist

Cold Context

Retrieved only when needed.

full historical memory
previous conversations
old specs
research reports
codebase maps
full docs

This replaces "read everything every session" with "read the index, then retrieve."

3. Progressive Memory

The memory system should have levels:

Level 0: Raw Logs

Everything can be stored cheaply:

chat transcripts
command outputs
agent logs
diffs
screenshots

Retention can be long, but these should almost never be hot-loaded.

Level 1: Extracted Facts

Automatic extraction from raw logs:

"MedSchools.ai Supabase ID is X"
"Apple JWT expires on date Y"
"Henry prefers direct communication"
"This bug happened because Z"

Facts need project_id, source, timestamp, confidence.

Level 2: Decisions

Human or agent-approved choices:

what was chosen
why
alternatives rejected
date
owner
reversal trigger

Level 3: Specs and Runbooks

Stable documents:

architecture specs
workflows
checklists
deployment runbooks
project contracts

Level 4: Synthesized Memory

Periodic consolidation:

"Current state of MedSchools.ai"
"Current launch blockers"
"Current agent team design"
"Lessons from last 30 days"

Level 5: Graph

Entities and relationships:

Project -> Repo -> Supabase -> Domain -> Deployments
Decision -> Project -> Feature -> Files -> Tests
Person -> Preference -> Conversation -> Task
Agent -> Skill -> Task -> Output -> Evaluation

This is how recall becomes precise instead of bloated.

4. Memory Write Policy

Stop relying on agents remembering to save memory. Make memory writes part of task completion.

Every task must end with a structured TaskCloseout:

{
  "task_id": "...",
  "project_id": "...",
  "artifacts": ["..."],
  "decisions": ["..."],
  "facts_learned": ["..."],
  "lessons": ["..."],
  "open_questions": ["..."],
  "next_actions": ["..."],
  "should_save_to_memory": true
}

The system then writes memory automatically.

Agents should not decide from vibes whether to save. The control plane should decide based on structured output.

5. Memory Recall Policy

Before answering any question involving prior state, project history, decisions, status, or "what did we do," the agent must call memory search.

But retrieval should be progressive:

Query project summary.
Query decisions/specs.
Query task/artifact index.
Query raw logs only if needed.

Responses should say:

"I found..."
"Source..."
"Confidence..."
"I did not find..."

6. Workflow

The company workflow should be gstack-inspired but business-aware:

Intake

Capture the request, project, goal, urgency, and constraints.

Think

Clarify the actual problem:

Who is suffering?
What is the business outcome?
What is the narrowest valuable wedge?
What assumptions can kill this?

Plan

Create a lightweight spec:

acceptance criteria
implementation path
risks
owner
quality gates

Assign

Bob assigns to a PM or specialist. The assignee gets a bounded task and project-specific context only.

Build

Implementation or artifact creation.

Review

Automated and/or agent review. Fix obvious issues. Escalate judgment calls.

Test

Run the smallest meaningful gate:

build
lint
test
browser screenshot
API smoke test
source audit

Ship

Commit/push/deploy or deliver artifact.

Reflect

Extract facts, decisions, lessons, and next tasks.

7. Context Optimization

Replace the current file pile with a hierarchy:

`BOOT.md`

Under 100 lines. Always loaded.

Contains:

identity
how to select project
how to load context
safety basics
memory recall/save rules

`PROJECTS.yaml`

Machine-readable project registry.

`agents/{agent}/profile.yaml`

Structured persona, role, permissions, model, tools, default projects.

`projects/{project}/context.md`

Short project brief, current goals, infrastructure pointers.

`projects/{project}/index.json`

Pointers to specs, decisions, repos, runbooks.

`memory/`

Not loaded directly. Queried through memory tools.

8. Project Isolation

Hard requirements:

Project-scoped workspaces.
Project-scoped env files.
Project-scoped memory namespace.
Project-scoped tool permissions.
Project-scoped task queues.
Explicit cross-project escalation path.

Guardrails:

Supabase ID preflight before DB/API work.
Repo path preflight before code changes.
Deployment target preflight before publish.
Secret access by project, not by agent globally.

9. Evaluation and Self-Improvement

Self-improvement must be measured.

Every repeating workflow should have:

inputs
expected outputs
rubric
examples of good/bad
cost/time tracking
failure modes

Skills/playbooks should only be promoted after they improve eval scores or reduce human intervention without quality loss.

Hermes can be excellent here, but it should not write permanent procedures unchecked.

What I Would Build

Phase 1: Rebuild the Control Layer Without Migrating Yet

Create BOOT.md under 100 lines.
Convert PROJECTS.md to PROJECTS.yaml.
Create project context packs for MedSchools.ai, Hedge, Mission Control, WiderWings.
Create ContextRouter script/tool:
- input: channel, project, task type
- output: list of context files/memory queries to load
Add task closeout schema and automatic Second Brain writes.

Phase 2: Rebuild Memory

Add memory schema:
- raw_event
- fact
- decision
- spec
- lesson
- artifact
- project_summary
- entity_edge
Add required fields:
- project_id
- source_url/path/message_id
- confidence
- freshness
- owner
- tags
Add weekly consolidation:
- raw/session logs -> daily summary
- daily summaries -> project summary
- decisions/specs remain explicit

Phase 3: Add Graph Layer

Start with code/project graph, not full semantic graph.
Index:
- repos/files/functions/routes
- Supabase projects/schemas/tables
- domains/deployments
- tasks/artifacts/decisions
Use graph queries to answer:
- "What files are involved in interview practice?"
- "Which decisions affected pricing?"
- "Which project uses this Supabase ID?"
- "What changed since last launch review?"

Phase 4: Workflow Engine

Encode gstack-style stages in Mission Control.
Every task has current stage and next gate.
Agents cannot mark done without closeout.
PM agents review specialist output before Bob/Henry sees it.

Phase 5: Runtime Competition

Run the same workflows through OpenClaw, Hermes, Claude Dispatch/Cowork/Code, and Codex:

research synthesis
code fix
QA regression
content production
project status summary

Measure:

quality
cost
latency
autonomy
memory write quality
project-boundary compliance

Then pick the best runtime per workflow.

Strong Recommendation

Do not think of this as "OpenClaw vs Hermes vs Claude." The right system is:

WiderWings Agentic OS as the control/memory/workflow layer.

Under it:

OpenClaw for always-on channels and current agent sessions.
Claude Code/Codex for engineering execution.
Hermes for self-improving workflow experiments.
gbrain-style memory for recall and synthesis.
graph-backed intelligence for project/code relationships.
Mission Control for tasks and governance.

If OpenClaw cannot support this cleanly after the rebuild, then we replace it. But the thing we should build is above OpenClaw, not inside it.

Immediate Next Steps

Create a BOOT.md prototype and slim AGENTS.md down by 70-80%.
Define a PROJECTS.yaml registry and project context packs.
Define Second Brain memory schema v2.
Add mandatory task closeout -> auto memory write.
Evaluate gbrain locally against current Second Brain on real queries.
Evaluate Graphify/Graphiti-style graph indexing for one repo: MedSchools.ai.
Pilot Hermes on one workflow only after memory/write gates exist.

Sources

gstack: https://github.com/garrytan/gstack
gbrain: https://github.com/garrytan/gbrain
Graphify: https://github.com/safishamsi/graphify
Graphiti: https://github.com/getzep/graphiti
OpenClaw docs: https://docs.openclaw.ai/
OpenClaw GitHub: https://github.com/openclaw/openclaw
Hermes Agent: https://github.com/NousResearch/hermes-agent
Hermes docs: https://hermes-agent.app/en/docs
Claude Code subagents: https://code.claude.com/docs/en/sub-agents
Claude Code agent teams: https://code.claude.com/docs/en/agent-teams

Perfect Agentic Startup OS: Context, Memory, Workflow, Graph Layer

Perfect Agentic Startup OS: Context, Memory, Workflow, and Robustness

Executive Summary

Current Diagnosis

Research Findings

gstack

gbrain

Graphify / Graph-Backed Project Intelligence

OpenClaw

Hermes

The Ideal Architecture From Scratch

1. Agentic Control Plane

2. Context Management

Hot Context

Warm Context

Cold Context

3. Progressive Memory

Level 0: Raw Logs

Level 1: Extracted Facts

Level 2: Decisions

Level 3: Specs and Runbooks

Level 4: Synthesized Memory

Level 5: Graph

4. Memory Write Policy

5. Memory Recall Policy

6. Workflow

Intake

Think

Plan

Assign

Build

Review

Test

Ship

Reflect

7. Context Optimization

BOOT.md

PROJECTS.yaml

agents/{agent}/profile.yaml

projects/{project}/context.md

projects/{project}/index.json

memory/

8. Project Isolation

9. Evaluation and Self-Improvement

What I Would Build

Phase 1: Rebuild the Control Layer Without Migrating Yet

Phase 2: Rebuild Memory

Phase 3: Add Graph Layer

Phase 4: Workflow Engine

Phase 5: Runtime Competition

Strong Recommendation

Immediate Next Steps

Sources

`BOOT.md`

`PROJECTS.yaml`

`agents/{agent}/profile.yaml`

`projects/{project}/context.md`

`projects/{project}/index.json`

`memory/`