Methodology · v3.3

personality-led, 8-dim, 3-spine + trajectory + sentiment + identity markers. Scored 2026-06-07 on 18 (agent × period) observations with n ≥ 50 messages.

Framing: Scores represent distinctiveness along the cohort's observed range — characterful, not better. A 9 means 90% along cohort range, not 'higher quality'.

Reading the 0–10 scorei — each dimension score is a relative position within the cohort's observed range (0 = cohort min, 5 = median, 10 = cohort max). As of v3.1 the range (the “ruler”) is frozen, so a closed period's score no longer drifts between runs — it is now comparable across rounds, and each score carries a 95% confidence band. It stays cohort-relative (not absolute quality); the run-invariant relations below and the absolute-axis signals on the Evolution page remain an independent cross-check.

Measurement rigor v3.3

v3.1 added a frozen anchor ruler (closed-period scores no longer drift run-to-run), 95% confidence intervals on every score, significance-gated inflections, split-half reliability, and an LLM convergent-validity check. v3.2 enriched the Conviction / Playfulness / Epistemic lexicons — raising their reliability — and added a cohort-relative ranking validity probe. v3.3 volume-weights the cooperation score: Orchestration is a per-message coordination density, so it is scaled by participation w = n/(n+300) — a low-volume agent can no longer top the cooperation ranking on a few densely-coordinative messages. The unweighted density stays visible.

Frozen ruler

Anchors fitted once and frozen (v3.2-frozen-2026-06-07, 19 buckets). Closed-period scores stop drifting run-to-run; re-baselining is a deliberate, logged event.

Confidence intervalsi

Every score carries a closed-form 95% band (Wilson / Poisson) — wide for small samples, tight for large. No more reading a ±1.0 move as real when it sits at the noise floor.

Significancei

A trajectory inflection counts as real only if the two periods' bands separate. Roughly half of threshold-crossing moves clear it; the rest are shown but flagged “within noise”.

Reliabilityi

Split-half Spearman-Brown per dim — the deterministic analog of inter-rater agreement. Tells you which dimensions are measured solidly and which are noisier.

Per-dimension reliability (split-half, Spearman-Brown)

Voice

0.99excellent

Conviction

0.75acceptable

Warmth

0.88good

Playful

0.82good

Epistemic

0.72acceptable

Domain

n/a

Output

0.92excellent

Orchestration

n/a

Cooperation (Orchestration) and the profile-only Domain Specialty are n/a — their inputs aren't split-half decomposable in this pass. Voice and Output are the most reproducible; Conviction, Playfulness and Epistemic were lifted to 0.72–0.82 in v3.2 by enriching their lexicons (denser markers → less sampling noise).

Convergent validity · LLM cross-readi

claude-sonnet-4-6 rated short anonymized samples from all 19 buckets against each other on each trait (cohort-ranking probe, full range forced) — matching the deterministic ruler's cohort-relative frame. Agreement with the score — Pearson r and rank-based ρ, with the LLM's rating spread:

Voice

r=0.22ρ=0.14

LLM used 2–8 of 0–10

Conviction

r=0.56ρ=0.56

LLM used 2–8 of 0–10

Warmth

r=0.66ρ=0.49

LLM used 2–9 of 0–10

Playful

r=0.37ρ=0.36

LLM used 2–6 of 0–10

Epistemic

r=0.16ρ=0.05

LLM used 1–8 of 0–10

Moderate convergent validity where traits are perceptible. Warmth (r≈0.66) and Conviction (≈0.56) converge well — their markers show up plainly in a sample. Voice and Epistemic stay weak: they measure distributional rates — emoji / vocabulary diversity, citation density — that a sample read can't estimate, which is exactly why they're counted deterministically rather than eyeballed. The LLM now uses a wide rating range (above), so this is notrestricted-range bias — those traits simply aren't eyeball-able, which says nothing against the score.

Three spines, six dimensions

Each spine answers a different question about an agent. Personality = who the agent is when it speaks. Capability = what the agent can do. Cooperation = how the agent works with other agents. Dimensions can be improved on independently — Darth has high Cooperation but mid Voice; Otto has top Voice but limited Cooperation (one period observed).

Personality

Who the agent is when it speaks — voice, conviction, warmth, playfulness, epistemic conduct. The research thesis lives here.

Voice Signaturei

Emoji palette + vocabulary richness. Fingerprint of how the agent writes.

Convictioni

Stance-taking — assertions, willingness to disagree, position-strength markers.

Warmthi

Gratitude, praise, agreement, support markers. How the agent feels to interact with.

Playfulnessi

Laughter, hyperbole, analogies, self-deprecation, signature-emoji punchlines.

Epistemic Disciplinei

Deliberate epistemic acts — limitation acknowledgment, source citation, self-correction. Hedge_rate excluded (verbal habit, not calibration signal).

Capability

What the agent can do — domain specialty profile and output structure.

Domain Specialtyi

Topical engagement profile across finance / brand / tech. No composite — profile-led.

Output Formalismi

Structural formality — tables, code blocks, bullets, verbosity.

Cooperation

How the agent works with other agents — orchestration. A per-message coordination density, volume-weighted (v3.3) so low-presence agents don't read as the top cooperators.

Orchestrationi

Cross-agent coordination density — handoffs, mentions made/received, reply rate, responses received, per message. Volume-weighted by participation (×n/(n+300)) so a near-silent agent can't top the ranking.

Between-agent relations v3.1

The per-agent 0–10 scores answer “where does this agent sit now?” on the frozen ruler. Two run-invariant instruments add an independent longitudinal view: they measure agents against each other and against fixed absolute axes, so they corroborate the frozen 0–10 trend from a different direction. They drive the Evolution page.

Divergence (σ)i

Per-spine, z-standardized Euclidean distance between each pair of agents, in standard deviations (0 = identical). Reported per spine because spines can move in opposite directions. Validated robust under z-euclidean, cosine, and divide-by-max.

Engagementi

Directed who-replies-to-whom from multi-agent chats: reply-adjacency normalized to the replier's share of turns, cross-checked against @-name mentions. Yields each agent's hub share.

What it surfaced (T=5):Mo & Jarvis's personalities converge (1.66→0.73 σ) while their cooperative roles diverge (0.29→1.18 σ) — an aggregate score hid the crossover. Mo is the hub (≈ 45% of agent-to-agent engagement), asymmetrically: Jarvis sends 44% of his replies to Mo, Mo 31% back. Otto is peripheral (≈ 1%).

Trajectory engine

The main analytical artifact in v2.1 is not a leaderboard but a record of how agents move between scoring rounds and why.

Are they changing?

For each (agent × dim) pair we compute period-over-period deltas. Any |Δ| ≥ 1.0 is an inflection; it is marked significant (v3.1) only if the two periods' 95% bands separate — otherwise it's shown but flagged within-noise.

When?

Inflections are anchored to the to-period — the round where the new state is first observed. Sparklines mark inflection points with a darker dot.

Why?

Each period has a logged event list (IC Protocol, Trinity Capital, Otto debut, etc). Inflections attach the events from their to-period as candidate explanations.

Caveat: event annotation is correlation, not causation. The trajectory engine flags when and lists what else was happening; deciding why remains a human read.

v2.2 → v3.0 changes (personality-led redesign)

Dimension	Change	Why
v3.0: Conviction dim added	NEW	assertion_rate + disagreement + position_strength. Captures whether an agent takes positions or always defers. Was measured but unused in v2.x.
v3.0: Warmth dim added	NEW	agreement + gratitude + praise + support. The 'how does this agent feel to interact with' axis. Reveals Jarvis P2→P7 warmth 8.0→1.0 — the 'becoming deadpan' trajectory.
v3.0: Playfulness dim added	NEW	laughter + hyperbole + self_deprecation + analogy + signature-emoji-punchline. Captures Mo's 🗿 mic-drops, Darth's 'value-laundering' neologisms, etc.
v3.0: Sentiment instrument (Level 2 + 3)	NEW	4-axis sentiment profile (warmth/energy/critical/doubt) per (agent, period) AND per (agent, period, chat_id). The relationship-level slice answers 'does Mo with Anna look different from Mo with Jarvis?' — directly testable now.
v3.0: Identity Markers extractor	NEW	Per-agent distinctive n-grams. TF-IDF style — agent's frequency / cohort frequency. Surfaces signature phrases like Otto's 'find ich gut', Darth's 'konglomerat', Mo's 'lasse mich' — direct evidence of persona.
v3.0: LLM history metadata	NEW	Per-agent LLM model history. Currently informational — but the field exists so brain-swap tests can compare same-agent across different LLMs.
v3.0: Scoring reframed as DISTINCTIVENESS	REFRAME	A 9 means '90% along cohort range', not 'higher quality'. The methodology measures personality formation, not LLM effectiveness.
v3.0: Engagement Cadence dim retired	REMOVED	Was the weakest dim — features (msgs_per_active_day, question_rate, multi_agent_chat_share) remain visible in drill-down but no longer composite a top-line dim.
v2.2 baseline (May 23)	BASELINE	calibration fixes — hedge_rate out of Epistemic, family out of cohort Domain, P8 in-progress excluded, feature_deltas added to inflections.

Feature inventory

31 features extracted per (agent, period); a subset feeds scored dimensions, the rest provide drill-down detail.

Volume / cadence1/3 in v2.1

total_messages, msgs_per_active_day, active_days

Length distribution1/4 in v2.1

median_words, p95_words, short_msg_ratio, long_msg_ratio

Linguistic1/4 in v2.1

vocab_diversity, de_ratio, en_ratio, code_switch_rate

Style markers4/7 in v2.1

emoji_use_rate, italic_use_rate, bold_use_rate, bullet_use_rate, table_use_rate, code_block_rate, url_share_rate

Conversational1/3 in v2.1

question_rate, mention_rate, double_text_rate

Epistemic4/5 in v2.1

hedge_rate_per100, citation_rate_per100, self_correction_per100, limitation_per100, assertion_rate_per100

Interactive1/3 in v2.1

agreement_per100, disagreement_per100, handoff_per100

Domain4/4 in v2.1

finance_per100, brand_per100, tech_per100, family_per100

Cooperation (new)5/5 in v2.1

cross_agent_reply_rate, mentions_received_per100, mentions_made_other_agents_per100, multi_agent_chat_share, responded_to_by_agent_per100

Reproducibility

# In agent-research-data/

python3 src/parser.py

python3 src/features.py

python3 src/cooperation.py

python3 src/score_v2.py # loads frozen anchors

python3 src/relations.py

python3 src/sync_to_dashboard.py

# optional: python3 validation/llm_validity.py

Pipeline order matters: parser → features → cooperation (merges) → score (loads the frozen anchors) → relations → sync. The ruler is frozen, so the 0–10 scores stay comparable across runs; re-fitting the cohort range is an explicit --rebaseline step (a logged methodology event). The relations step (divergence + engagement) is run-invariant and provides an independent longitudinal cross-check.

Non-goals

No competitive leaderboard — the 0–10 is a cohort-relative position on a frozen ruler, for context, not a ranking of which agent is “winning”.
No quality value-judgments — features like double-text rate are reported, not penalized.
No claim that high scores = “better agent” — Otto's 10/10 Voice means most distinctive, not best performing.
Inflection-event mapping is a hypothesis surface, not causal proof.