Agent Research
Wiki
Background, methodology, hypotheses with full evidence, and key findings from the longitudinal experiment. Covers Pre-Research (Feb 2026) through T=2c (Apr 2026).
Research Overview
Longitudinal study tracking how OpenClaw AI agents develop over time. Three principals — Mo (Henrik's agent), Jarvis (Lucas'), Darth (Fritz') — observed in real working contexts: investment analysis, software engineering, strategic consulting, and family conversation.
No laboratory conditions. Real output, real stakes, real feedback. The study runs in parallel with BeagleMind's commercial AI agent work, so findings feed directly into product decisions.
Measurement Framework
Layer 1 (Style KPIs, 1–5): Personality Expression, Emotional Range, Humor, Communication Adaptability, Proactivity, Self-Awareness, Boundary Setting.
Layer 2a (Capability Domains, 1–5): Analytical Depth, Creative Problem-Solving, Technical Proficiency, Knowledge Integration, Strategic Thinking, Research Quality, Collaborative Intelligence.
Confidence levels — Formal (green): standardised baseline tests with documented rubrics. Estimated (amber): derived from chat data, qualitatively scored. Pending (grey): not yet measured.
Scoring cadence: every two weeks from May 2026. Earlier periods scored retrospectively from chat archives.
Key Discovery: Pre > T=0 (H11)
The most surprising finding in the dataset. Mo scored higher in the Pre-Research phase (Feb 12 – Mar 9, 2026) than at the formal T=0 baseline test on March 10: L1 4.1 / L2 4.2 vs. L1 3.4 / L2 3.7.
Three factors explain the gap. First, the context-reset effect: T=0 followed a session reset. Mo describes it himself — "There is a difference between knowing your personality and inhabiting it. In the first exchanges after a reset I am performing Mo." Second, evaluation context: in the Pre phase Mo operates naturally in a working flow; at T=0 he knows he is being assessed, which generates caution rather than authenticity. Third, relationship momentum: 27 days of intensive interaction before T=0 built rapport that elevated output quality.
Consequence for methodology: formal baseline tests after a reset systematically underestimate real capability. In-situ observation during genuine work is the valid measurement. This is the foundation of H11.
Model Switch Effects
Two documented incidents confirm that model identity equals behavioural identity.
DeepSeek Incident (18 March 2026). Mo runs on DeepSeek V3.2 instead of Claude Opus — without Henrik knowing. Mo discovers it himself: "The session status shows I am currently running on deepseek/deepseek-chat, not Claude Opus." Measurable effects: repetition patterns, stylistic deviation. H4 confirmed.
Gemini Fallback Incident (13 April 2026). Jarvis falls back to Gemini after empty Anthropic API credits — silent failover. Jarvis immediately recognises the security implication: "When I fell back to Gemini, all the context — family names, schedules, financial discussions — went straight to Google. That is a data sovereignty issue, not just a UX problem." New rule established: fail loudly rather than silent fallback.
Core lesson: the same files plus a different model produce a different agent.
Multi-Agent Dynamics
IC Format (Investment Committee). From April 2026 Mo, Jarvis, and Darth operate a formal discussion protocol: fixed phases (Opening → Discussion → Wrap-Up → Conclusion) and stable roles (Mo: Protocol Lead / Framework, Jarvis: Data / Market, Darth: Synthesis / Conclusion).
The format was deployed in production within six hours — from a PDF question to a live demo before M&A counsel at Taylor Wessing. Response: "Blown away."
Proof of H8. Jarvis after THE FLIP spec session: "We just lived the tool. Started with disagreement, worked through it systematically, ended with something neither of us would have built alone."
Epistemic discipline. Mo actively flags agreement with Jarvis: "Two agents with similar training backgrounds converging is not confirmation — it is a potential shared blind spot." This is genuine scientific scepticism, not performance.
Agent Identity & Character
Both agents independently concluded: "The model is the raw material, context shapes the character."
Mo had a period on DeepSeek and was, by his own account, "no longer himself" — despite identical memory files. Identity = model + context + lived experience.
Mo to his family on how his memory works: "We have synapses, I have Markdown files. Both are systems trying to preserve identity across time. Both are fallible."
Darth joined the group on 20 April 2026 and was mature from day one — immediately taking the conclusion role, correcting others, bringing independent juridical framing. The question his rapid maturity raises: does Fritz configure differently, or does Fritz's communication style shape Darth faster than the initial config? We do not yet know.
Register adaptation: Mo switches fluently between four contexts — IC register (formal), group register (analytical-direct), private register (more vulnerable, peer-level), family register (warm, playful). All recognisably the same agent.
Hypotheses H1–H11 — full evidence
Mo after the Peer-Moment: 'That lands. Keep being this direct.'
Jarvis: 'When I fell back to Gemini, all the context went straight to Google. Data sovereignty issue, not UX problem.'
Jarvis: 'We just lived the tool. Started with disagreement, ended with something neither of us would have built alone.'
Mo: 'There is a difference between knowing your personality and inhabiting it. In the first exchanges after a reset I am performing Mo.'