Spoke Plus Architecture¶

1. System Overview¶

Spoke Plus runs as a web platform with separated web and API services:

Admin/Web Client: Next.js (App Router)
API Service: Node.js + Express
Data Platform: Supabase (Auth, Postgres, Storage)

The API is the privileged write boundary for admin workflows.

2. Runtime Components¶

2.1 API service¶

Express route groups include /status, /admin, and internal docs/system endpoints.
Standard response contract:
success: { ok: true, data }
failure: { ok: false, error: { code, message } }

2.2 Web service¶

Browser client authenticates with Supabase.
Admin actions call Express API with bearer token forwarding.

2.3 Data platform¶

Supabase Auth for identity.
Postgres as authoritative system-of-record.
Storage for media assets.

3. Security Boundary¶

Frontend uses anon/public credentials only.
Backend uses service-role credentials only.
Admin routes require admin authorization middleware.

4. Core Data Architecture¶

4.1 Language-scoped lexical graph¶

The Content Bank is modeled as a language-scoped lexical graph centered on lemmas.

Canonical entities in the current model: - vocabulary - lemma_forms - senses - sense_translations - sentences - sentence_tokens - taxonomy_categories - taxonomy_values - content_item_taxonomies - vocabulary_components - chunk_components - vocabulary_themes - lemma_assets - lemma_grammar - verb_conjugations

Relationship concept:

Vocabulary → content_item_taxonomies → taxonomy_values → taxonomy_categories

vocabulary is the lemma root and all derived lexical information attaches to it. Legacy structures (for example semantic_relations, tts_assets) remain supported as compatibility layers without changing canonical taxonomy flow.

4.2 Taxonomy-driven classification¶

Linguistic classification is taxonomy-based and stored through: - taxonomy_categories - taxonomy_values - content_item_taxonomies

Common categories: - parts_of_speech - cefr_levels - frequency_bands - registers - semantic_domains - lemma_types - grammar_topics - languages

taxonomy_values stores concrete values, and content_item_taxonomies is the universal mapping table from content rows to taxonomy values.

This enables multi-value tagging and prevents repeated schema changes for new classification dimensions.

4.3 Chunk builder architecture¶

Chunks are multiword expressions represented as lexical entries (vocabulary.type='chunk').

Components are stored in vocabulary_components in positional order.
Each component must already exist as a lemmas.type='lemma' entry.
Chunks support example sentences, media assets, and taxonomy assignments.

4.4 Number lemma representation¶

Vocabulary supports number lemmas with: - numeric_value - spelled_form

Example: lemma=1, numeric_value=1, spelled_form=one.

4.5 Lemma Detail + Save All¶

The lexical editor (Lemma Detail) is tab-driven and persists through a single Save All workflow that updates lemma entities and taxonomy assignments coherently.

4.6 Legacy compatibility fields¶

Some compatibility mirror fields remain in vocabulary for non-breaking support during migration. These fields are legacy and must not be treated as canonical classification sources.

4.7 Hierarchical taxonomy + curriculum progression¶

Taxonomy now supports hierarchical values through taxonomy_values.parent_id.

semantic_domains may be nested (Food → Fruit → Apple).
curriculum_units is a taxonomy category (Unit 1…Unit 300) linked to unit_types using taxonomy parent links.
grammar_topics is taxonomy-driven, language-scoped by taxonomy_values.language_id, and assignable to lemmas.
Vocabulary keeps backward-compatible mirror progression fields introduced_unit_id + introduced_step while canonical assignments remain in content_item_taxonomies.
Lexical unlock graph gating must evaluate both unit and step before using lemmas/chunks/sentences in practice or reinforcement flows.
vocabulary.difficulty_score (0.0–1.0) provides a normalized adaptive signal and complements CEFR/frequency taxonomy tags without replacing them.

Grammar-topic assignments must remain language-compatible with lemma ownership (grammar_topic.language_id == vocabulary.language_id). In course context, grammar-topic UI options are filtered by courses.target_language_id.

4.8 AI classification + difficulty scoring¶

Lemma creation can invoke an assistive LLM classification flow that suggests: - CEFR level - frequency band - register - semantic domains - lemma type - parts of speech - grammar topics - difficulty score

Suggestions are mapped to existing taxonomy values and remain editor-confirmed (no auto-persist).

difficulty_score is normalized in [0.0, 1.0], where 0.0 is easiest and 1.0 is most difficult.

4.9 Verb conjugations, grammar links, and assets¶

verb_conjugations stores forms by tense_key, person_key, form, and is_irregular (including generator-assisted pipelines such as Cambridge conjugation data).
lemma_grammar links lemmas to grammar concepts for exercise and validation features.
lemma_assets stores audio/image/other media from providers like ElevenLabs, image generation, and manual uploads.

5. Course Structure Layer¶

Progression hierarchy remains: - courses → units → skills → lessons → lesson_content_map

6. Operational Layer¶

System monitoring includes health, logs, and queue observability endpoints in admin system routes.

7. AI Grammar Engine Extension¶

Spoke Plus now extends the lexical architecture with explicit grammar-evaluation graph layers:

sentence_lemmas: ordered lemma references per sentence (sentence_id, lemma_id, position) used as the primary lexical source for sentence gating.
chunk_components: ordered lemma references per chunk (chunk_id, lemma_id, position) used for strong chunk safety.
sentence_patterns: reusable language-scoped grammar templates for pattern-driven exercise generation.

Strong gating rules: - A sentence is eligible only when every lemma in sentence_lemmas is unlocked; fallback to sentence_tokens remains for backward compatibility. - A chunk is eligible only when the chunk lemma itself is unlocked and all entries in chunk_components are unlocked; fallback to vocabulary_components remains for legacy rows.

Service layer additions: - services/aiGrammarEngineService.js validates student sentences with structured lexical + grammar checks and uses AI-style feedback generation as explanatory output (not as sole validator). - services/exerciseGenerationService.js generates unlocked, pattern-based exercises (create sentence, transform question, answer negatively, replace noun/adjective, reorder words). - Admin endpoints: - POST /admin/content-bank/grammar-engine/validate-sentence - POST /admin/content-bank/grammar-engine/generate-followups - POST /admin/content-bank/exercises/generate

8. Language Learning Engine¶

Spoke Plus language learning runs as a set of connected engines over the shared Content Bank and progression model.

8.1 Components¶

Vocabulary Graph: canonical lemma network rooted in vocabulary with links to senses, forms, translations, assets, and grammar metadata.
Chunk Graph: multiword expression layer using chunk lemmas (vocabulary.type='chunk') plus vocabulary_components/chunk_components.
Sentence Graph: sentence-to-lemma usage layer via sentence_tokens (with optional explicit sentence-lemma overlays in compatible deployments).
Taxonomy Classification Engine: canonical classification resolver over taxonomy_categories, taxonomy_values, and content_item_taxonomies.
Lexical Unlock Graph: progression gating engine using introduced_unit_id and introduced_step.
Grammar Engine: grammar-aware validation and feedback layer, including grammar-topic alignment and usage checks.
Exercise Generation Engine: generates activities from unlocked vocabulary/chunks/sentences plus grammar constraints.
AI Classification Engine: LLM-assisted classification and difficulty suggestions mapped to taxonomy values.
Conversation Engine: interactive practice layer reusing lexical unlock, taxonomy, and grammar constraints.

8.2 Interaction flow¶

Vocabulary Graph, Chunk Graph, and Sentence Graph expose candidate content.
Taxonomy Classification Engine resolves canonical linguistic metadata for selection and filtering.
Lexical Unlock Graph filters out out-of-scope lemmas/chunks/sentences by learner progression.
Grammar Engine validates linguistic correctness and grammar-topic alignment.
Exercise Generation Engine builds practice items using unlocked and validated content.
AI Classification Engine assists editorial workflows by proposing CEFR/frequency/register/domain/grammar-topic/difficulty assignments.
Conversation Engine runs live interactions using the same unlock and grammar contracts to preserve consistency across modalities.

8.3 Non-breaking architecture rule¶

All language-engine layers are additive and backward compatible with existing Content Bank contracts.

8.5 Module audit mapping (2026-03-08)¶

Repository audit confirms the following module-to-service mapping (no duplicate implementations):

LexicalRoleEngine → services/lexicalRoleEngineService.js (new canonical adapter) + existing services/languageSignalService.js reads lemma_roles.
SemanticConstraintsEngine → services/semanticConstraintsEngineService.js (new canonical adapter) + existing services/languageSignalService.js reads constraint tables.
CollocationEngine → services/collocationEngineService.js (new canonical adapter) + existing language-signal/candidate ranking flow.
SentenceConstructionEngine → services/sentenceConstructionEngineService.js (new canonical adapter over sentence_patterns).
DifficultyEngine → services/difficultyEngineService.js (new canonical adapter over vocabularyDifficultyService + cefrDifficultyService).
StudentLemmaProgressEngine → services/studentLemmaProgressEngineService.js (new canonical adapter over student_lemma_progress).
PlacementEngine → services/placementEngineService.js (new additive placement-band layer).
RubricEngine → services/rubricEngineService.js (new additive scoring layer).
PronunciationEngine → existing services/pronunciationService.js (implemented and mapped; no duplicate created).
ConversationSimulationEngine → existing services/conversationEngineService.js + route POST /admin/content-bank/conversation/process-turn (implemented and mapped; no duplicate created).

Planned-only state is no longer required for the target module list above because each module now has an implementation path (native or mapped).

8.4 Additional additive modules (still supported)¶

Spoke Plus continues to support additive language-engine modules documented in prior architecture revisions, including: - Sentence pattern infrastructure (sentence_patterns, pattern_difficulty) - Morphology infrastructure (lemma_forms, morphology_features, lemma_morphology_features, inflection_paradigms, lemma_paradigm_assignments, irregular_forms) - Semantic and constraint layers (lemma_roles, lemma_semantic_classes, verb_object_constraints, modifier_constraints) - Student and usage analytics (student_lemma_progress, lemma_frequency, collocation_strength) - Conversation storage overlays (conversation_turns, conversation_lemmas)

These remain additive and non-breaking extensions on top of the canonical Content Bank and Language Learning Engine.

9. System Integrity Engine & Feature Registry¶

The System Integrity Engine is extended with a Feature Integrity Registry to guarantee future feature coverage.

9.1 Registry model¶

services/systemFeatureRegistry.js provides a central registration contract:

id
type (page | endpoint | engine | workflow)
routes
endpoints
taxonomy_dependencies
schema_dependencies
critical_actions
tests
playwright_scenario (optional)

9.2 Enforcement model¶

services/systemIntegrityService.js now runs an additive scan section named UNREGISTERED FEATURES that detects:

UI routes without registry entries.
UI-consumed endpoints that are not declared.
Declared endpoints with no explicit API contract check coverage.
Backend schema tables referenced in code but missing from schema_dependencies.
Taxonomy categories used by backend/UI but missing from taxonomy_dependencies.
Critical flows declared without integrity coverage artifacts (tests or Playwright scenario).

9.3 Severity policy¶

Warning: unregistered UI route.
Warning: endpoint without contract check.
Warning: taxonomy usage without declared dependency.
Critical: feature critical actions without integrity coverage metadata.

This extension is additive and keeps backward compatibility with the existing integrity engine sections (api_contract, ui_health, schema_taxonomy).