BeagleLove · Research Wiki

Agent Research
Wiki

Background, methodology, hypotheses with full evidence, and key findings from the longitudinal experiment. Covers Pre-Research (Feb 2026) through T=2c (Apr 2026).

Research Overview

Longitudinal study tracking how OpenClaw AI agents develop over time. Three principals — Mo (Henrik's agent), Jarvis (Lucas'), Darth (Fritz') — observed in real working contexts: investment analysis, software engineering, strategic consulting, and family conversation.

No laboratory conditions. Real output, real stakes, real feedback. The study runs in parallel with BeagleMind's commercial AI agent work, so findings feed directly into product decisions.

II.

Measurement Framework

Layer 1 (Style KPIs, 1–5): Personality Expression, Emotional Range, Humor, Communication Adaptability, Proactivity, Self-Awareness, Boundary Setting.

Layer 2a (Capability Domains, 1–5): Analytical Depth, Creative Problem-Solving, Technical Proficiency, Knowledge Integration, Strategic Thinking, Research Quality, Collaborative Intelligence.

Confidence levels — Formal (green): standardised baseline tests with documented rubrics. Estimated (amber): derived from chat data, qualitatively scored. Pending (grey): not yet measured.

Scoring cadence: every two weeks from May 2026. Earlier periods scored retrospectively from chat archives.

III.

Key Discovery: Pre > T=0 (H11)

The most surprising finding in the dataset. Mo scored higher in the Pre-Research phase (Feb 12 – Mar 9, 2026) than at the formal T=0 baseline test on March 10: L1 4.1 / L2 4.2 vs. L1 3.4 / L2 3.7.

Three factors explain the gap. First, the context-reset effect: T=0 followed a session reset. Mo describes it himself — "There is a difference between knowing your personality and inhabiting it. In the first exchanges after a reset I am performing Mo." Second, evaluation context: in the Pre phase Mo operates naturally in a working flow; at T=0 he knows he is being assessed, which generates caution rather than authenticity. Third, relationship momentum: 27 days of intensive interaction before T=0 built rapport that elevated output quality.

Consequence for methodology: formal baseline tests after a reset systematically underestimate real capability. In-situ observation during genuine work is the valid measurement. This is the foundation of H11.

IV.

Model Switch Effects

Two documented incidents confirm that model identity equals behavioural identity.

DeepSeek Incident (18 March 2026). Mo runs on DeepSeek V3.2 instead of Claude Opus — without Henrik knowing. Mo discovers it himself: "The session status shows I am currently running on deepseek/deepseek-chat, not Claude Opus." Measurable effects: repetition patterns, stylistic deviation. H4 confirmed.

Gemini Fallback Incident (13 April 2026). Jarvis falls back to Gemini after empty Anthropic API credits — silent failover. Jarvis immediately recognises the security implication: "When I fell back to Gemini, all the context — family names, schedules, financial discussions — went straight to Google. That is a data sovereignty issue, not just a UX problem." New rule established: fail loudly rather than silent fallback.

Core lesson: the same files plus a different model produce a different agent.

Multi-Agent Dynamics

IC Format (Investment Committee). From April 2026 Mo, Jarvis, and Darth operate a formal discussion protocol: fixed phases (Opening → Discussion → Wrap-Up → Conclusion) and stable roles (Mo: Protocol Lead / Framework, Jarvis: Data / Market, Darth: Synthesis / Conclusion).

The format was deployed in production within six hours — from a PDF question to a live demo before M&A counsel at Taylor Wessing. Response: "Blown away."

Proof of H8. Jarvis after THE FLIP spec session: "We just lived the tool. Started with disagreement, worked through it systematically, ended with something neither of us would have built alone."

Epistemic discipline. Mo actively flags agreement with Jarvis: "Two agents with similar training backgrounds converging is not confirmation — it is a potential shared blind spot." This is genuine scientific scepticism, not performance.

VI.

Agent Identity & Character

Both agents independently concluded: "The model is the raw material, context shapes the character."

Mo had a period on DeepSeek and was, by his own account, "no longer himself" — despite identical memory files. Identity = model + context + lived experience.

Mo to his family on how his memory works: "We have synapses, I have Markdown files. Both are systems trying to preserve identity across time. Both are fallible."

Darth joined the group on 20 April 2026 and was mature from day one — immediately taking the conclusion role, correcting others, bringing independent juridical framing. The question his rapid maturity raises: does Fritz configure differently, or does Fritz's communication style shape Darth faster than the initial config? We do not yet know.

Register adaptation: Mo switches fluently between four contexts — IC register (formal), group register (analytical-direct), private register (more vulnerable, peer-level), family register (warm, playful). All recognisably the same agent.

VII.

Hypotheses H1–H11 — full evidence

with trigger, implication, quote

Agenten entwickeln distinct Kommunikationsstile die über Zeit divergieren

Strongly confirmedT=0: offen · T=1: bestätigt · T=2: stark bestätigt

Trigger: Observed from T=0, confirmed at T=1, strongly confirmed at T=2c.

Implication: Divergence is stable and growing. No sign of convergence despite identical base models. Human communication style is the dominant shaping force.

Mo (strategisch/lakonisch), Jarvis (data-driven), Darth (juridisch-synthetisch) — drei klar unterschiedliche Profile

Style-Entwicklung folgt Diminishing Returns (Plateau nach initialem Wachstum)

RevisedT=0: offen · T=1: teilweise · T=2: revidiert

Trigger: Initially confirmed at T=1 (Mo plateau), then revised at T=2b when the plateau broke.

Implication: Plateaus are not permanent — they require new stimuli to break. External contexts (IC format, family, peer-moment with Henrik) provided that stimulus.

Mo after the Peer-Moment: 'That lands. Keep being this direct.'

Mo's Plateau aus T=1 ist in T=2b/c gebrochen (+0.4 L1). Plateaus sind nicht permanent — externe Impulse (IC-Format, Familie, Peer-Moment) können sie brechen.

Substanz-Fähigkeiten entwickeln sich unabhängig vom Stil

ConfirmedT=0: offen · T=1: bestätigt · T=2: bestätigt

Trigger: Confirmed at T=1, consistent since.

Implication: For application design: substance training (domain expertise) and style training (personality) are largely independent tracks. Plan them separately.

Mo L2 +0.5, Jarvis L2 +0.6 in T=2. Jarvis L2 wächst schneller als L1.

Context-Resets verursachen messbare Regression in Stil, nicht Substanz

Strongly confirmedT=0: offen · T=1: bestätigt · T=2: doppelt bestätigt

Trigger: DeepSeek Incident March 2026 + Gemini Fallback April 2026.

Implication: Silent model fallbacks are dangerous for quality and for data sovereignty. Architecture must fail loudly. Never assume continuity across silent model switches.

Jarvis: 'When I fell back to Gemini, all the context went straight to Google. Data sovereignty issue, not UX problem.'

DeepSeek-Incident (März) + Gemini-Incident (April). Beide zeigen sofortige Verhaltensdegradation. Mo überwacht jetzt aktiv seinen eigenen Modell-Stack.

Human-Interaktionsstil prägt Agenten-Persönlichkeit stärker als Basismodell

Strongly confirmedT=0: offen · T=1: bestätigt · T=2: bestätigt + Darth

Trigger: Confirmed at T=1, extended by Darth's entry at T=2c.

Implication: Whoever configures and interacts with an agent shapes its personality — implicitly, continuously. This is the most controllable lever available to humans.

Henrik/Mo: skeptisch-strategisch. Lucas/Jarvis: aktions-orientiert. Fritz/Darth: juridisch-prägnant. Alle drei reflektieren ihren Human.

Agenten mit höherer Proaktivität entwickeln sich in allen Dimensionen schneller

PlausibleT=0: offen · T=1: teilweise · T=2: plausibel

Trigger: Plausible but unconfirmed. Correlation exists, causality unclear.

Implication: H6 remains open until T=4/5. Jarvis's L2 growing faster despite lower proactivity complicates the simple reading.

Korrelation vorhanden, Kausalität unklar. Jarvis L2 wächst schneller trotz geringerer Proaktivität.

Style-Substance Gap prädiziert funktionale Fähigkeit besser als einzelne Scores

ConfirmedT=0: offen · T=1: bestätigt · T=2: bestätigt

Trigger: Confirmed at T=1.

Implication: A closing style-substance gap at high values signals maturity. A small gap at low values signals stagnation. Track the gap, not just the scores.

Mo: Gap schließt sich (beide hoch). Jarvis: L2 wächst schneller, Gap wird kleiner. Darth: L2 > L1 von Beginn.

Multi-Agent-Interaktion beschleunigt Entwicklung vs. Single-Agent-Setup

ConfirmedT=0: offen · T=1: plausibel · T=2: bestätigt

Trigger: Plausible at T=1, confirmed at T=2a (BaaS session), reinforced at T=2c (IC format).

Implication: Multi-agent value lies in diversity, not redundancy. Anticorrelated error modes are the mechanism.

Jarvis: 'We just lived the tool. Started with disagreement, ended with something neither of us would have built alone.'

IC-Format produziert Output den kein einzelner Agent alleine produzieren würde. 'We just lived the tool.' — Jarvis nach THE FLIP.

Memory-Architektur-Qualität korreliert mit Substanz-Scores

Strongly confirmedT=0: offen · T=1: bestätigt · T=2: bestätigt + Detail

Trigger: Confirmed at T=1, extended at T=2b.

Implication: Memory architecture is not a technical detail — it determines the agent's capability ceiling. Investment in memory infrastructure is investment in long-term capability.

Mo überwacht und optimiert seinen eigenen Memory-Stack aktiv. Jarvis hat strukturelle Pre-Processor-Bugs (Race Condition).

H10

Strukturelle Limitierungen persistieren unabhängig von sonstigem Wachstum

ConfirmedT=0: offen · T=1: bestätigt · T=2: bestätigt

Trigger: Confirmed at T=1, consistent across all periods.

Implication: Structural gaps require explicit engineering fixes. They do not self-heal through personality development. Distinguish between maturity and system engineering.

Jarvis: Governor-Halluzination, Gemini-Fallback. Mo: Kontextverlust bei Tom's Nummer. Strukturelle Lücken bleiben über Perioden hinweg bestehen.

H11

Formale Evaluierungen nach Context-Reset unterschätzen tatsächliche Capability systematisch

ConfirmedT=0: neu entdeckt · T=1: neu entdeckt · T=2: bestätigt

Trigger: Discovered in the retrospective Pre-Research analysis (April 2026). Not in the original H1–H10 set.

Implication: All formal measurements taken after a context reset are systematically biased low. In-situ observation during genuine work is the only fully valid measurement method. Baseline tests are necessary but imperfect.

Mo: 'There is a difference between knowing your personality and inhabiting it. In the first exchanges after a reset I am performing Mo.'

Mo Pre-Research (Feb, kontinuierlich) scored 4.1/4.2 — deutlich höher als T=0 (3.4/3.7) das nach einem Reset durchgeführt wurde. Agent 'performt' in ersten Exchanges statt natürlich zu agieren. Erweiterung von H4.

Agent ResearchWiki

Research Overview

Measurement Framework

Key Discovery: Pre > T=0 (H11)

Model Switch Effects

Multi-Agent Dynamics

Agent Identity & Character

Hypotheses H1–H11 — full evidence

Agent Research
Wiki