🧠 All Projects
🔬

Perfect Agentic Startup OS: Context, Memory, Workflow, Graph Layer

P3 - Low
Research WiderWings

First-principles architecture for rebuilding the agentic startup OS: thin context, progressive memory, gstack workflow, graph-backed project intelligence, runtime competition.

Perfect Agentic Startup OS: Context, Memory, Workflow, and Robustness

Generated: 2026-05-24

Executive Summary

If we were starting from scratch, I would not build around any single vendor or agent framework. I would build a small, vendor-neutral startup operating system with:

  1. A thin always-on gateway for Discord/Telegram/web/mobile.
  2. A strict context router that loads only identity, safety, current task, and relevant project memory.
  3. A progressive memory system with raw logs -> extracted facts -> synthesized project memory -> durable decisions/specs -> graph links.
  4. A gstack-style workflow pipeline where every stage creates an artifact consumed by the next stage.
  5. A graph-backed project intelligence layer for code, docs, tasks, entities, dependencies, and decisions.
  6. A task/governance plane with ownership, quality gates, evals, and cost tracking.
  7. Multiple execution runtimes: Claude Code, Codex, OpenClaw sessions, Hermes, browser automation, and local scripts.

OpenClaw can still be the foundational gateway/control surface, but it should not be the whole architecture. The real foundation should be the WiderWings Agentic OS, with OpenClaw as one runtime layer.

Current Diagnosis

The current setup works, but it is too prompt-file-driven and not enough memory-system-driven.

Local audit:

  • MEMORY.md: 508 lines
  • AGENTS.md: 278 lines
  • PROCESS.md: 288 lines
  • BRAIN.md: 191 lines
  • SOUL.md: 142 lines
  • HEARTBEAT.md: 115 lines

Across main and agent workspaces, there are about 3,972 lines of AGENTS/SOUL/MEMORY/BRAIN-style context files. That creates token waste, repeated instructions, drift, and imprecise recall.

The core problem: too much "always loaded context" and not enough "retrieved context."

Research Findings

gstack

Garry Tan's gstack is not mainly a tool collection. It is a development operating discipline. The important pattern is:

Think -> Plan -> Build -> Review -> Test -> Ship -> Reflect

Each stage writes a concrete artifact that the next stage reads. This reduces ambiguity, enforces quality gates, and makes agents less dependent on human re-explanation.

What to copy:

  • office-hours style product forcing questions before building.
  • CEO/product review before engineering work.
  • Engineering review before implementation.
  • Automated review that fixes obvious issues and escalates judgment calls.
  • QA/browser testing as a required gate.
  • Ship/release docs.
  • Retro/lesson extraction.

Source: https://github.com/garrytan/gstack

gbrain

gbrain is more important than gstack for our setup. It treats memory as a retrieval and synthesis system, not a pile of notes. It uses local-first storage options, Postgres/Supabase scale path, and MCP exposure.

What to copy:

  • Memory retrieval should return synthesized answers with citations, not random matching notes.
  • Memory should include source provenance.
  • Memory should separate facts, decisions, tasks, docs, and relationships.
  • Agents should query memory before answering prior-context questions.
  • The system should know when memory is stale or insufficient.

Source: https://github.com/garrytan/gbrain

Graphify / Graph-Backed Project Intelligence

"Graphify" is ambiguous in the ecosystem, but the useful pattern is clear: graph-based code/project understanding. Tools in this category parse files/repos/docs and build graphs of entities, dependencies, references, and relationships.

For an agentic startup, the relevant use is not generic "knowledge graph" hype. It is:

  • Code graph: files, functions, routes, schemas, imports, tests, owners.
  • Product graph: projects, features, personas, pages, funnels, experiments.
  • Memory graph: decisions, specs, people, tasks, projects, artifacts.
  • Infrastructure graph: repos, Supabase projects, env vars, deployments, domains.

This is how we stop agents from repeatedly scanning entire repos or relying on bloated prompt files.

Sources:

OpenClaw

OpenClaw's strength is not perfect memory. It is always-on multi-channel agent routing and self-hosted control. It is a good gateway/runtime layer.

Weaknesses:

  • Context files can become bloated.
  • Memory write/recall is policy-driven, not reliably enforced by architecture.
  • Project isolation depends too much on instructions.
  • Workflow discipline is not strongly encoded by default.
  • Observability and evals are still too manual.

Recommendation: keep OpenClaw as a gateway, but move memory, workflow, and governance into explicit systems around it.

Sources:

Hermes

Hermes is worth piloting for self-improving workflows. Its learning loop and memory emphasis are directly aligned with what Henry wants. But self-improvement should be gated by evals and approval before production authority.

Recommendation: use Hermes as a sandboxed learning/execution backend first, not as the whole foundation on day one.

Sources:

The Ideal Architecture From Scratch

1. Agentic Control Plane

Purpose: route work, apply policy, track tasks, select runtime, enforce project boundaries.

It should be vendor-neutral. It can call OpenClaw, Claude Code, Codex, Hermes, browser automation, local scripts, GitHub, Supabase, and Second Brain.

Key components:

  • ProjectRegistry: source of truth for repos, Supabase IDs, env boundaries, domains.
  • ContextRouter: decides what context to load.
  • MemoryRouter: decides what memory to query/save.
  • TaskRouter: creates/assigns tasks and tracks owners.
  • RuntimeRouter: chooses Claude Code, Codex, Hermes, or OpenClaw session.
  • PolicyEngine: blocks unsafe/external/destructive actions.
  • Evaluator: runs quality checks before accepting output.

2. Context Management

Context should be layered:

Hot Context

Always loaded. Tiny. Target: under 1,500 tokens.

  • identity: "Bob, chief of staff"
  • current user preferences
  • safety rules
  • current channel/project
  • context-loading algorithm

Warm Context

Loaded by task type/project. Target: 1,000-4,000 tokens.

  • project brief
  • current sprint
  • relevant decisions
  • active tasks
  • role-specific checklist

Cold Context

Retrieved only when needed.

  • full historical memory
  • previous conversations
  • old specs
  • research reports
  • codebase maps
  • full docs

This replaces "read everything every session" with "read the index, then retrieve."

3. Progressive Memory

The memory system should have levels:

Level 0: Raw Logs

Everything can be stored cheaply:

  • chat transcripts
  • command outputs
  • agent logs
  • diffs
  • screenshots

Retention can be long, but these should almost never be hot-loaded.

Level 1: Extracted Facts

Automatic extraction from raw logs:

  • "MedSchools.ai Supabase ID is X"
  • "Apple JWT expires on date Y"
  • "Henry prefers direct communication"
  • "This bug happened because Z"

Facts need project_id, source, timestamp, confidence.

Level 2: Decisions

Human or agent-approved choices:

  • what was chosen
  • why
  • alternatives rejected
  • date
  • owner
  • reversal trigger

Level 3: Specs and Runbooks

Stable documents:

  • architecture specs
  • workflows
  • checklists
  • deployment runbooks
  • project contracts

Level 4: Synthesized Memory

Periodic consolidation:

  • "Current state of MedSchools.ai"
  • "Current launch blockers"
  • "Current agent team design"
  • "Lessons from last 30 days"

Level 5: Graph

Entities and relationships:

  • Project -> Repo -> Supabase -> Domain -> Deployments
  • Decision -> Project -> Feature -> Files -> Tests
  • Person -> Preference -> Conversation -> Task
  • Agent -> Skill -> Task -> Output -> Evaluation

This is how recall becomes precise instead of bloated.

4. Memory Write Policy

Stop relying on agents remembering to save memory. Make memory writes part of task completion.

Every task must end with a structured TaskCloseout:

{
  "task_id": "...",
  "project_id": "...",
  "artifacts": ["..."],
  "decisions": ["..."],
  "facts_learned": ["..."],
  "lessons": ["..."],
  "open_questions": ["..."],
  "next_actions": ["..."],
  "should_save_to_memory": true
}

The system then writes memory automatically.

Agents should not decide from vibes whether to save. The control plane should decide based on structured output.

5. Memory Recall Policy

Before answering any question involving prior state, project history, decisions, status, or "what did we do," the agent must call memory search.

But retrieval should be progressive:

  1. Query project summary.
  2. Query decisions/specs.
  3. Query task/artifact index.
  4. Query raw logs only if needed.

Responses should say:

  • "I found..."
  • "Source..."
  • "Confidence..."
  • "I did not find..."

6. Workflow

The company workflow should be gstack-inspired but business-aware:

Intake

Capture the request, project, goal, urgency, and constraints.

Think

Clarify the actual problem:

  • Who is suffering?
  • What is the business outcome?
  • What is the narrowest valuable wedge?
  • What assumptions can kill this?

Plan

Create a lightweight spec:

  • acceptance criteria
  • implementation path
  • risks
  • owner
  • quality gates

Assign

Bob assigns to a PM or specialist. The assignee gets a bounded task and project-specific context only.

Build

Implementation or artifact creation.

Review

Automated and/or agent review. Fix obvious issues. Escalate judgment calls.

Test

Run the smallest meaningful gate:

  • build
  • lint
  • test
  • browser screenshot
  • API smoke test
  • source audit

Ship

Commit/push/deploy or deliver artifact.

Reflect

Extract facts, decisions, lessons, and next tasks.

7. Context Optimization

Replace the current file pile with a hierarchy:

BOOT.md

Under 100 lines. Always loaded.

Contains:

  • identity
  • how to select project
  • how to load context
  • safety basics
  • memory recall/save rules

PROJECTS.yaml

Machine-readable project registry.

agents/{agent}/profile.yaml

Structured persona, role, permissions, model, tools, default projects.

projects/{project}/context.md

Short project brief, current goals, infrastructure pointers.

projects/{project}/index.json

Pointers to specs, decisions, repos, runbooks.

memory/

Not loaded directly. Queried through memory tools.

8. Project Isolation

Hard requirements:

  • Project-scoped workspaces.
  • Project-scoped env files.
  • Project-scoped memory namespace.
  • Project-scoped tool permissions.
  • Project-scoped task queues.
  • Explicit cross-project escalation path.

Guardrails:

  • Supabase ID preflight before DB/API work.
  • Repo path preflight before code changes.
  • Deployment target preflight before publish.
  • Secret access by project, not by agent globally.

9. Evaluation and Self-Improvement

Self-improvement must be measured.

Every repeating workflow should have:

  • inputs
  • expected outputs
  • rubric
  • examples of good/bad
  • cost/time tracking
  • failure modes

Skills/playbooks should only be promoted after they improve eval scores or reduce human intervention without quality loss.

Hermes can be excellent here, but it should not write permanent procedures unchecked.

What I Would Build

Phase 1: Rebuild the Control Layer Without Migrating Yet

  • Create BOOT.md under 100 lines.
  • Convert PROJECTS.md to PROJECTS.yaml.
  • Create project context packs for MedSchools.ai, Hedge, Mission Control, WiderWings.
  • Create ContextRouter script/tool:
    • input: channel, project, task type
    • output: list of context files/memory queries to load
  • Add task closeout schema and automatic Second Brain writes.

Phase 2: Rebuild Memory

  • Add memory schema:
    • raw_event
    • fact
    • decision
    • spec
    • lesson
    • artifact
    • project_summary
    • entity_edge
  • Add required fields:
    • project_id
    • source_url/path/message_id
    • confidence
    • freshness
    • owner
    • tags
  • Add weekly consolidation:
    • raw/session logs -> daily summary
    • daily summaries -> project summary
    • decisions/specs remain explicit

Phase 3: Add Graph Layer

  • Start with code/project graph, not full semantic graph.
  • Index:
    • repos/files/functions/routes
    • Supabase projects/schemas/tables
    • domains/deployments
    • tasks/artifacts/decisions
  • Use graph queries to answer:
    • "What files are involved in interview practice?"
    • "Which decisions affected pricing?"
    • "Which project uses this Supabase ID?"
    • "What changed since last launch review?"

Phase 4: Workflow Engine

  • Encode gstack-style stages in Mission Control.
  • Every task has current stage and next gate.
  • Agents cannot mark done without closeout.
  • PM agents review specialist output before Bob/Henry sees it.

Phase 5: Runtime Competition

Run the same workflows through OpenClaw, Hermes, Claude Dispatch/Cowork/Code, and Codex:

  • research synthesis
  • code fix
  • QA regression
  • content production
  • project status summary

Measure:

  • quality
  • cost
  • latency
  • autonomy
  • memory write quality
  • project-boundary compliance

Then pick the best runtime per workflow.

Strong Recommendation

Do not think of this as "OpenClaw vs Hermes vs Claude." The right system is:

WiderWings Agentic OS as the control/memory/workflow layer.

Under it:

  • OpenClaw for always-on channels and current agent sessions.
  • Claude Code/Codex for engineering execution.
  • Hermes for self-improving workflow experiments.
  • gbrain-style memory for recall and synthesis.
  • graph-backed intelligence for project/code relationships.
  • Mission Control for tasks and governance.

If OpenClaw cannot support this cleanly after the rebuild, then we replace it. But the thing we should build is above OpenClaw, not inside it.

Immediate Next Steps

  1. Create a BOOT.md prototype and slim AGENTS.md down by 70-80%.
  2. Define a PROJECTS.yaml registry and project context packs.
  3. Define Second Brain memory schema v2.
  4. Add mandatory task closeout -> auto memory write.
  5. Evaluate gbrain locally against current Second Brain on real queries.
  6. Evaluate Graphify/Graphiti-style graph indexing for one repo: MedSchools.ai.
  7. Pilot Hermes on one workflow only after memory/write gates exist.

Sources

Created: Mon, May 25, 2026, 2:17 AM by bob

Updated: Mon, May 25, 2026, 2:17 AM

Last accessed: Wed, Jun 3, 2026, 12:27 PM

ID: e15e4eae-ce20-4a49-ace5-94fb2563d305