TL;DR
This project treats AI quality as a consequence of environment design, evaluation
rigor,
and training signal quality. The simulator provides deterministic ground truth. The corpus
pipeline turns
heterogeneous expert knowledge into semantically atomic JSONL. The long-term policy loop learns against real
outcomes,
not weak proxies.
Example training records appear later in this article.
The strongest present-tense artifact is the domain training pipeline: curated rules, expert
gameplay insights,
strategy articles, and card-specific research transformed into a canonical schema designed to support both
human-curated knowledge and future simulator-generated preference data.
Current Dataset Snapshot
The corpus is intentionally engineered rather than scraped. Strategy articles,
expert gameplay analysis, rules references, and card-level research are curated,
reviewed, and normalized into a shared schema so the resulting dataset supports
retrieval, supervised learning, and future preference generation without
changing the data model.
Strategy Articles
271 curated articles from The EPIC Storm website
Expert Gameplay Review
30+ hours of Bryant Cook gameplay analyzed with Gemini extraction + manual verification
Card Research Corpora
24 curated card-focused deep research corpora
Rules Corpus
Curated subset of comprehensive Magic rules reduced to deck-relevant mechanics
Normalization Target
Canonical shared schema across heterogeneous knowledge sources
Training Output
Compact semantically atomic JSONL records with one clear concept per entry
A decision space where each choice can collapse entire strategies
In high-level play, small resource differences completely change which lines are valid. A single land
drop,
an extra artifact, or the availability of a discard-compatible payoff can transform a position from dead
to winning.
In The EPIC Storm, the system must reason about mana colors, storm count, graveyard
accessibility,
tutor chains, conditional sequencing, and interactions like:
- Echo of Eons reshaping the future decision surface via a seven-card redraw
- Lion’s Eye Diamond creating mana while invalidating hand-based lines
- Chrome Mox forcing color-specific imprint decisions with opportunity cost
- Fetch lands + dual lands expanding color reach through conditional search space
- Tendrils of Agony making sequencing quality inseparable from final outcome
That makes the domain useful for AI engineering for the same reason many toy benchmarks are not:
success depends on structured reasoning under hard constraints, not just language
fluency.
The result is a concrete environment where the central question is:
given state S, does decision sequence A convert more reliably than decision sequence B?
This is the key framing shift: the project is not “an LLM for a card game.”
It is an evaluation-driven decision system built inside a brutally constrained,
high-branching environment that makes good measurement non-optional.
Branching explosion in real decision spaces
The central difficulty of this domain is not rule complexity alone — it is
the combinatorial explosion of legal decision sequences. Even a modest mid-game
position can produce dozens of legal actions, each of which changes the
available future action space.
Initial state S₀
│
│ Battlefield: Lotus Petal
│ Hand: Brainstorm, Gamble, Beseech the Mirror, Dark Ritual, Chrome Mox, Echo of Eons
│
├─ crack Lotus Petal → generate BLUE
│ │
│ └─ cast Brainstorm
│ │
│ ├─ draw 3: Lion's Eye Diamond, Scalding Tarn, & Mox Opal
│
│ Hand after draw:
│ A: Gamble
│ B: Beseech the Mirror
│ C: Dark Ritual
│ D: Chrome Mox
│ E: Echo of Eons
│ F: Lion's Eye Diamond
│ G: Scalding Tarn
│ H: Mox Opal
│
│ (choose any 2 cards from {A–H} to put back; 28 branches)
│
│ ├─ put back A & {B C D E F G H}
│ ├─ put back B & {C D E F G H}
│ ├─ put back C & {D E F G H}
│ ├─ put back D & {E F G H}
│ ├─ put back E & {F G H}
│ ├─ put back F & {G H}
│ └─ put back G & {H}
│
│ (all branches continue with updated hand states)
│
│ │
│ └─ if Scalding Tarn (G) ∉ {X,Y}
│ │
│ ├─ play Scalding Tarn
│ │ ├─ do not crack
│ │ │ └─ continue with known top cards (X,Y)
│ │ │
│ │ └─ crack Scalding Tarn
│ │ ├─ fetch Badlands
│ │ ├─ fetch Underground Sea
│ │ └─ fetch Taiga
│ │
│ │ shuffle library (destroy Brainstorm information)
│ │
│ │ effect:
│ │ ├─ removes known top cards (X,Y)
│ │ └─ randomizes next draws
│ │
│ └─ proceed with updated state
│
├─ crack Lotus Petal → generate RED
│ │
│ └─ cast Gamble
│ │
│ └─ search for Lion's Eye Diamond
│ │
│ ├─ discard Brainstorm
│ ├─ discard Beseech the Mirror
│ ├─ discard Dark Ritual
│ ├─ discard Chrome Mox
│ ├─ discard Echo of Eons
│ └─ discard Lion's Eye Diamond → DEAD END
│
│ (remaining branches continue with LED in hand)
│
│ │
│ ├─ cast Lion's Eye Diamond
│ │ └─ crack Lion's Eye Diamond → generate 3 BLUE
│ │ └─ cast Echo of Eons → shuffle graveyard & hand into library and draw 7 cards
│ │
│ └─ pass the turn
│
├─ crack Lotus Petal → generate BLACK
│ │
│ └─ cast Dark Ritual
│ │
│ └─ cast Chrome Mox
│ │
│ ├─ imprint Brainstorm
│ ├─ imprint Gamble
│ ├─ imprint Echo of Eons
│ └─ imprint Beseech the Mirror → DEAD END
│
│ │
│ └─ remaining branches may cast
│ │
│ └─ Beseech the Mirror
│ │
│ ├─ search & cast Gaea's Will (Recursive Engine)
│ ├─ search & cast Song of Creation (Draw Engine)
│ └─ search & cast Tendrils of Agony (Win Condition)
│
└─ pass the turn
Each branch produces a new state with a different mana pool, card availability,
and future line feasibility. Within only a few turns the total number of
reachable states becomes enormous.
The role of the neural policy is not to replace the simulator.
It is to guide exploration toward promising regions of this decision space
so the deterministic engine can evaluate them precisely.
A naïve exhaustive search is infeasible. The simulator must aggressively prune
impossible or dominated branches while still preserving the sequences that
represent legitimate winning lines.
Decision branching
Typical mid-combo states can produce dozens of legal actions, each
spawning further sequencing branches.
State sensitivity
Small resource differences — a single mana source or card —
can completely change which lines remain viable.
Outcome brittleness
Seemingly minor sequencing differences often determine whether a
combo line succeeds or collapses.
Engineering implication
Efficient simulation, branch pruning, and candidate ranking become
mandatory for tractable evaluation.
System architecture
1) Deterministic symbolic layer
A high-throughput .NET simulator represents full game state, enumerates legal actions under constraints,
performs deterministic transitions, and computes reproducible outcome quality.
- Deterministic state transitions
- Strict legality and invariant enforcement
- Branch-sensitive evaluation
- Reproducible output for identical inputs
2) Neural policy layer
Python-based model training learns to rank candidate decisions and generalize across states the symbolic
layer can evaluate but cannot cheaply search exhaustively.
- Supervised fine-tuning
- Preference optimization
- Future closed-loop policy improvement
- ONNX export for in-process inference
DOMAIN KNOWLEDGE PIPELINE
┌────────────────────────────────────────────────────────────────┐
│ Rules · Strategy Articles · Expert Gameplay · Research │
└────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────┐
│ Canonical normalization │
│ + schema enforcement │
└──────────────────────────┘
│
▼
┌──────────────────────────┐
│ Semantically atomic JSONL│
│ training corpus │
└──────────────────────────┘
│
▼
NEURAL TRAINING LAYER
┌──────────────────────────┐
│ Python / PyTorch pipeline│
│ SFT · Preference · Eval │
└──────────────────────────┘
│
▼
ONNX MODEL EXPORT
│
▼
INFERENCE + EVALUATION
┌─────────────────────────────────┐
│ Deterministic C# simulation │
│ - state transitions │
│ - legality enforcement │
│ - candidate line evaluation │
└─────────────────────────────────┘
│
▼
Comparative outcomes / preferences
│
└──────────────┐
▼
Policy retraining
Engineering the domain knowledge pipeline
Source curation
Comprehensive rules, expert gameplay video, long-form strategy articles, and per-card deep research are
collected
because each source contributes different kinds of signal: formal rules, tactical nuance, sequencing
heuristics,
and card-specific edge cases.
Model-assisted extraction with human verification
Gemini and ChatGPT Deep Research are used as extraction tools, not authorities. Their outputs are manually
reviewed,
corrected, trimmed, and rewritten where needed before entering the pipeline.
Canonical normalization
Raw artifacts are cleaned into a shared schema designed to support both prose-heavy knowledge sources and
future simulation-derived records without changing the downstream interface.
Semantically atomic JSONL generation
Each record encodes one clear concept, constraint, or strategic principle, making the output useful for
retrieval, fine-tuning, preference construction, and auditability.
Raw sources
-> extracted notes / draft markdown
-> manual verification + deletion of weak content
-> cleaned markdown artifact
-> canonical structured representation
-> semantically atomic JSONL
-> training / retrieval / future preference generation
Closed-loop learning design
The long-term learning loop is intentionally evaluation-first. The policy does not learn from vibes, static
correctness labels, or the kind of proxy metrics that create "accuracy theater". It learns from deterministic
comparative outcomes.
1. Policy proposes candidate lines
A neural policy ranks decision sequences for a given state.
2. Simulator evaluates outcomes deterministically
Each line is executed against the same symbolic environment with identical rules and constraints.
3. Comparative supervision is generated
Better/worse line pairs, conversion metrics, and structured failure reasons are derived from actual
outcomes.
4. Policy is retrained and re-benchmarked
New SFT and DPO data is folded back into training, then measured longitudinally against prior policy
versions.
CURRENT GAME STATE
│
▼
Neural policy proposes lines
┌──────────────┼──────────────┐
▼ ▼ ▼
Line A Line B Line C
│ │ │
▼ ▼ ▼
Deterministic simulator executes each line
│ │ │
▼ ▼ ▼
Outcome metrics: success rate · resource use · stability
│
▼
Comparative supervision generated
(preference pairs / regret / failure reasons)
│
▼
Policy retraining
│
▼
New policy benchmarked vs prior versions
Model Architecture & Training Configuration
The policy model uses a three-stage training curriculum: domain knowledge
fine-tuning, behavioral tuning on simulator outcomes, and preference
optimization on counterfactual line comparisons.
Base Model
Qwen2.5-14B (INT4 / QLoRA)
Training Hardware
Single NVIDIA RTX 4090 (24GB)
Precision
BF16 + gradient checkpointing
Training Curriculum
Domain SFT → Outcome SFT → Counterfactual DPO
Domain Knowledge → Simulator Outcomes → Counterfactual DPO
Stage 1: Domain Knowledge
High-capacity adapter trained on curated domain corpus.
- Method: SFT (QLoRA)
- LoRA rank: 48 (α=96)
- Target modules: q, k, v, o, gate, up, down
- Effective batch: 32
- Learning rate: 1e-4
- Epochs: 3
Stage 2: Simulator Outcomes
Behavioral tuning on preferred action sequences.
- Method: SFT (QLoRA)
- LoRA rank: 16 (α=32)
- Target modules:
attention projections (q, v)
- Effective batch: 32
- Learning rate: 2e-4
- Epochs: 2
Stage 3: Counterfactual DPO
Preference optimization on simulator-derived line comparisons.
- Method: Direct Preference Optimization
- β: 0.05
- Effective batch: 32
- Learning rate: 3e-6
- Epochs: 1
Curriculum rationale:
Stage 1 injects domain knowledge using a high-capacity adapter across all
projection layers. Stage 2 narrows the adapter to attention layers and
tunes behavior toward successful simulator outcomes. Stage 3 refines
decision quality through counterfactual preference optimization, teaching
the model why certain lines outperform alternatives.
Example training records
Each JSONL record encodes one semantically atomic concept derived from the domain knowledge pipeline.
JSONL sample · rules record
{
"id": "f884d4a7-07ff-4bbc-938b-b415546d5287",
"record_type": "insight",
"cards": [],
"frame_kind": "game_mechanic",
"event_kind": "rules_definition",
"insight_type": "rules",
"text": "Flashback allows an instant or sorcery card to be cast from the graveyard by paying its flashback cost instead of its mana cost.",
"artifact_id": "681249302ECD202DEF83BDF2205FB9F2",
"segment_id": "comprehensive_rules_702_34a_1"
}
JSONL sample · technique record
{
"id": "4b8efbce-5d34-477e-828c-8652a7373ada",
"record_type": "insight",
"cards": ["thoughtseize","echo_of_eons"],
"frame_kind": "graveyard_setup",
"event_kind": "engine_activation",
"insight_type": "technique",
"text": "Targeting yourself with Thoughtseize can place Echo of Eons into the graveyard to enable its flashback ability.",
"artifact_id": "9E0286190ABAE83F76D5CBB44151DC11",
"segment_id": "tes_video_match1_2"
}
Design principles shaping the system
Several architectural constraints shape the system. These are not domain-specific choices;
they are design principles intended to make policy improvement measurable and reproducible.
Evaluation before modeling
The system begins with a deterministic environment capable of evaluating candidate
strategies reproducibly. Model quality is therefore measured against stable outcomes
rather than proxy metrics or human intuition.
- Deterministic state transitions
- Reproducible scenario replay
- Comparative outcome evaluation
Symbolic correctness + neural generalization
The symbolic layer enforces legality, constraints, and deterministic execution.
The neural layer learns to rank candidate strategies and generalize across
previously unseen states.
- Hard invariants enforced by simulator
- Policy model ranks candidate lines
- Clear separation of reasoning roles
Training–deployment symmetry
Training occurs in Python with PyTorch, but deployment is designed for ONNX
inference embedded directly inside the simulator. This keeps the runtime
decision loop deterministic and avoids Python production dependencies.
- PyTorch training pipeline
- ONNX model export
- In-process runtime inference