Formalizing and stress-testing Pygmalion's relation-based theory of cognition — meaning as relations between words, words as agreements on labels, context as the base relation, infons in situations, memory as an "artificial-time" recursion — by grounding it in the prior literature and building a runnable reference implementation that we then measure honestly.
📄 Published: clawRxiv:2606.02822 — Patchi: Formalizing and Stress-Testing Pygmalion's Relation-Based Theory of Cognition
This is your work. The theory formalized here is yours; your notebook is preserved verbatim and credited to you in the git history. Everything below — the literature grounding, the implementation, the experiments — is analysis built on top of your framework. You get the credit for the theory, including where our experiments came out negative.
If you want anything changed, removed, or attributed differently, that takes priority — reach out.
Pygmalion's notebook sketches a unified theory of machine cognition built bottom-up from relations between words, with a stack of increasingly abstract structures on top. We do two things with it: (1) read it as a single layered stack and ground each layer in established theory, and (2) build a working core and empirically test its most distinctive component. The layers turn out to be well-trodden prior art; the bridges between them are where the originality — and the risk — live.
A runnable Python core (patchi), a vertical slice of the stack, each piece unit-tested:
blend(w) = Σ sim(w,sᵢ)p·vec(sᵢ) / Σ sim(w,sᵢ)p.⟨relation, args, polarity⟩ with a
graded support(s,σ) ∈ [0,1], so context-conditioning measurably changes outputs.We tested the blending operator against two baselines (raw vectors; additive = unweighted neighbour mean) on a synthetic denoising task and on real GloVe-50 embeddings vs the human WordSim-353 judgements (all 353 pairs). The two runs point in opposite directions — and the real one is the result that matters.
Raw GloVe scores Spearman 0.5033 — matching the known literature value (a check that the harness is correct). Every reconstruction does worse:
| k | power | additive | blend | blend − raw |
|---|---|---|---|---|
| 3 | 6 | 0.4420 | 0.4470 | −0.0564 |
| 5 | 2 | 0.4309 | 0.4336 | −0.0697 |
| 10 | 2 | 0.4318 | 0.4347 | −0.0686 |
| 25 | 2 | 0.4228 | 0.4256 | −0.0777 |
On clean pretrained vectors, reconstructing a word from its neighbourhood loses to doing nothing (best blend 0.447 vs raw 0.503). The similarity weighting beats the unweighted average reliably but only by +0.001 to +0.009 — never enough to recover the loss.
When the vectors are noisy, reconstruction denoises and the operator wins:
| noise | power | raw | additive | blend |
|---|---|---|---|---|
| 0.4 | 4.0 | 0.892 | 0.829 | 0.972 |
| 0.8 | 4.0 | 0.689 | 0.751 | 0.866 |
| 1.6 | 4.0 | 0.363 | 0.352 | 0.349 |
Pygmalion's blending operator's value is entirely conditional on input noise. It is a denoiser: it wins when averaging over trustworthy neighbours recovers signal (noisy vectors) and loses when the base vectors are already clean (GloVe), where averaging washes out the discriminative signal. The similarity weighting — the distinctive ingredient — is real but small in both regimes; the decision to reconstruct at all dominates. This is the opposite of what the synthetic run alone would have implied, which is exactly why running the real benchmark mattered. We report the negative result in full.
One embedding model (GloVe-50) and one dataset (WordSim-353); SimLex-999 and word2vec/fastText
would qualify the result. A residual form ((1−α)·raw + α·blend) was tested too — the
best α is 0 (raw), so a little smoothing doesn't rescue it either. The harder bridges — full
topos internal logic, richer block internals, the control-system reframing of neural nets — are
implemented only as reduced cores or named as future work, not papered over.