Spoke Plus Course Engine Specification¶

1. Purpose¶

This document defines current Course Engine behavior and the implemented lexical architecture that powers enrichment, generation, and token-level analysis.

2. Architectural Principle¶

Single Linguistic Source of Truth¶

The Content Bank is a lemma-first lexical system:

vocabulary is the canonical lemma registry.
lemma_forms, senses, sense_translations, sentences/sentence_tokens, semantic_relations, and tts_assets are all linked to lemma entries.
Classification is taxonomy-driven (taxonomy_categories, taxonomy_values, content_item_taxonomies).
The Content Bank behaves as a language-scoped lexical graph through course-content linking.

Conceptual path:

Course → target_language_id → Vocabulary (lemma) → Word Forms → Word Senses → Translations (course.source_language_id) → Sentences → Semantic Relations → Audio/Media → Taxonomies

3. Linguistic Engine (Implemented Core)¶

3.1 Content Bank Core Entities¶

Official lexical entities:

vocabulary (lemma root)
lemma_forms
senses
sense_translations
sentences
sentence_tokens
semantic_relations
tts_assets

Classification and graph helpers:

taxonomy_categories
taxonomy_values
content_item_taxonomies
vocabulary_components (chunk composition)

3.2 Taxonomy-driven classification¶

Legacy direct linguistic columns were replaced by taxonomy assignments as the canonical model.

Canonical classification source: - content_item_taxonomies joins a content entity (for example vocabulary) to taxonomy values.

Common taxonomy categories include: - parts_of_speech - cefr_levels - frequency_bands - registers - semantic_domains - themes

Benefits: - multi-value classification per lemma; - no schema fragmentation from repeatedly adding fixed columns; - consistent filtering/search semantics across lexical and non-lexical entities.

3.3 Chunk builder model¶

Chunks are multiword expressions represented as lemmas with type='chunk'.

Component lemmas must already exist as type='lemma'.
Composition is stored in vocabulary_components (chunk_id, lemma_id, position).
Chunks support examples (sentences), media (tts_assets/images), and taxonomy assignments.

3.4 Number lemma model¶

Numbers are represented as lexical lemmas with dual fields: - numeric_value - spelled_form

Example: - lemma 1 - numeric value 1 - spelled form one

This supports number recognition, number-to-word conversion, and writing exercises.

3.5 Lemma Detail and Save All workflow¶

The admin lexical editor uses tabbed Lemma Detail and a single Save All action.

Save All is the official persistence pattern for lexical entities and taxonomy classifications in one editorial action.

4. Course Structure¶

The engine uses normalized progression hierarchy: 1. courses 2. units 3. skills 4. lessons 5. lesson_content_map

5. Planned Adaptive Direction¶

Adaptive logic is evolving over current analytics and token-level linguistic infrastructure.

Roadmap direction: - weak-skill ranking; - due-item recommendations; - adaptive session assembly; - morphology/syntax-informed remediation.

6. Official specs (normative references)¶

docs/content-bank/WORD_DETAIL_PANEL_SPEC.md
docs/content-bank/VOCABULARY_CONTENT_BANK_ARCHITECTURE.md
docs/schema/vocabulary-content-bank-model.md
docs/engine/MORPHOLOGICAL_EVALUATION_SPEC.md
docs/engine/MORPHOLOGY_GENERATION_SPEC.md

3.6 Curriculum-gated adaptive selection¶

Implemented progression rule for vocabulary selection in scoped content queries:

student_current_unit + student_current_step gate available lemmas/chunks.
Effective unlock rule: introduced_unit < student_current_unit OR (introduced_unit = student_current_unit AND introduced_step <= student_current_step).
Backward-compatible mirrors remain supported via vocabulary.introduced_chapter + vocabulary.introduced_step, with canonical curriculum assignment in taxonomy category curriculum_units.
Each lemma must have exactly one introduced curriculum unit assignment.

3.8 Difficulty score model¶

To support adaptive ranking/filtering, vocabulary includes a normalized numeric difficulty_score (0.0–1.0).

The score complements taxonomy classification and does not replace CEFR/frequency/semantic dimensions.
Default derivation uses CEFR weight + frequency modifier + grammar complexity + semantic abstraction, clamped to [0,1].
Existing rows are backfilled with an initial CEFR+frequency heuristic for backward-compatible rollout.
Admin workflows can manually edit the score or recompute via taxonomy/AI suggestion.

3.9 Adaptive selection gating with difficulty¶

Course-scoped vocabulary list endpoints may filter by: - student_current_unit (existing rule: introduced_chapter <= student_current_unit) - student_difficulty_level (new rule: difficulty_score <= student_difficulty_level)

When difficulty filter is present, results are ordered ascending by difficulty_score to prioritize easier eligible lemmas.

3.7 Extended taxonomy dimensions¶

The lexical engine now uses additional taxonomy categories:

curriculum_units (Unit 1..Unit 300)
unit_types (e.g., skill, review, assessment)
grammar_topics

grammar_topics is language-scoped: each topic is stored in taxonomy_values with language_id referencing a value from taxonomy category languages. Course-context selectors must resolve courses.target_language_id and only expose grammar topics for that language. Vocabulary-to-grammar classification must enforce grammar_topic.language_id == vocabulary.language_id.

semantic_domains supports hierarchy through taxonomy_values.parent_id.

3.8 AI-Assisted Lemma Classification¶

The Content Bank supports AI-assisted taxonomy suggestions during lemma creation:

Endpoint: POST /admin/content-bank/lemmas/ai-classify.
Input: lemma, language_id.
Output: taxonomy suggestion payload for CEFR, frequency band, register, semantic domains, lemma type, parts of speech, and grammar topics.
Mapping is constrained to existing taxonomy values (taxonomy_values); the flow must not auto-create taxonomy entries.
Unknown/out-of-taxonomy suggestions are mapped to Unknown values when present; the service never creates new taxonomy values automatically.
Suggestions are editorial aids only; they do not persist data until the user confirms Create Lemma.

7. AI Grammar Engine (Phase 1)¶

7.1 Lexical graph extensions¶

sentence_lemmas stores explicit ordered lemma usage per sentence and is the preferred source for sentence unlock validation.
chunk_components stores explicit ordered lemma composition for chunk items and hardens chunk unlock checks.
sentence_tokens and previous chunk logic are preserved for backward compatibility.

7.2 Strong unlock gating¶

Sentence eligibility: - allow only if each referenced lemma is unlocked via: - introduced_unit_id < student_current_unit - OR (introduced_unit_id = student_current_unit AND introduced_step <= student_current_step)

Chunk eligibility: - chunk lemma must be unlocked, - every component lemma in chunk_components must be unlocked.

7.3 Grammar validation responsibilities¶

services/aiGrammarEngineService.js performs deterministic checks before generating explanation text:

known/unlocked vocabulary only;
adjective position;
supported word-order pattern checks;
question and negative transformation scaffolding;
optional grammar-topic compatibility hooks through pattern metadata.

Returned payload is structured (status, corrected_sentence, issues, detected_pattern, expected_pattern, follow_up_exercises).

7.4 Pattern-based exercise generation¶

services/exerciseGenerationService.js composes exercises from:

unlocked vocabulary;
unlocked chunks;
sentence_patterns templates;
lexical unlock graph constraints.

Supported exercise types (phase 1): - create sentence, - transform to question, - answer negatively, - replace noun, - replace adjective, - reorder words.

8. Engine Module Specifications¶

8.1 Lexical Unlock Graph¶

Purpose - Enforce progression-safe lexical usage across exercises, sentence validation, and conversational interactions.

Inputs - Learner progression context (student_current_unit, student_current_step, optional student_current_unit_id). - Lemma/chunk progression metadata (introduced_unit_id, introduced_step; legacy mirrors where required). - Optional explicit lexical graphs (sentence_lemmas, chunk_components).

Outputs - Eligibility decisions for lemmas/chunks/sentences. - Locked/out-of-scope token/lemma sets for downstream feedback.

Interactions - Consumed by AI Grammar Engine, Exercise Generation Service, and Conversation Engine. - Uses vocabulary/chunk graphs as lexical source-of-truth.

8.2 AI Grammar Engine¶

Purpose - Validate student responses with deterministic lexical + grammar checks and produce structured feedback.

Inputs - Student sentence/response text. - language_id and progression context. - Lexical data (vocabulary, sentence_lemmas, chunk_components) and pattern data (sentence_patterns).

Outputs - status, corrected_sentence, issues, detected_pattern, expected_pattern, and follow-up recommendations.

Interactions - Calls Lexical Unlock Graph for in-scope validation. - Feeds Exercise Generation Service (follow-up exercises). - Feeds Conversation Engine for turn-level grammar feedback persistence.

8.3 Sentence Pattern Engine¶

Purpose - Represent reusable language-scoped construction templates for validation and generation.

Inputs - sentence_patterns records (language_id, pattern_structure, grammar_topic_id, progression metadata). - Request context (exercise type, language scope, learner progression).

Outputs - Selected pattern metadata and expected structure constraints.

Interactions - Used by AI Grammar Engine to compare detected vs expected structure. - Used by Exercise Generation Service to build prompt templates. - Aligned with grammar topics taxonomy for pedagogical scope.

8.4 Exercise Generation Service¶

Purpose - Generate adaptive exercises using unlocked vocabulary/chunks and pattern templates.

Inputs - Student progression and language context. - Lexical unlock eligibility results. - Candidate vocabulary/chunks/sentences. - Optional grammar topic/pattern constraints.

Outputs - Exercise payloads (prompt, expected answer/pattern, metadata for correction and feedback).

Interactions - Depends on Lexical Unlock Graph and Sentence Pattern Engine. - Consumed by admin APIs and grammar follow-up generation flow. - Shares validation path with AI Grammar Engine.

8.5 Conversation Engine¶

Purpose - Persist conversational practice turns and connect free-form dialogue to the existing learning safety model.

Inputs - Turn payload (student_id, session_id, language_id, unit_id, step, prompt_text, student_response). - AI Grammar Engine validation output. - Lexical unlock eligibility output.

Outputs - Stored conversation turn (conversation_turns) including correction, detected pattern, and grammar feedback. - Stored lemma usage links (conversation_lemmas) for analytics and consistency checks. - Out-of-scope vocabulary signals (unknown_tokens, locked_tokens).

Interactions - Reuses AI Grammar Engine and Lexical Unlock Graph. - Complements (does not replace) exercise/sentence tables. - Supports future dialogue continuation and difficulty adaptation loops.

3.10 Final language engine expansion (additive)¶

New optional layers (backward compatible): - Lexical Role Engine via lemma_roles (lemma_id, role, confidence). - Semantic Constraints Engine via lemma_semantic_classes, verb_object_constraints, modifier_constraints. - Student Lemma Progress via student_lemma_progress. - Difficulty/Frequency extensions via pattern_difficulty and lemma_frequency (complements taxonomy frequency_bands). - Collocation Strength/Naturalness via collocation_strength.

Generation preference order when data exists: unlocked lemmas → POS compatibility → lexical role expectations → semantic constraints → collocation strength → pattern difficulty.

Fallback behavior: - All new layers are optional and non-blocking. - Existing endpoints and legacy lexical flows remain unchanged when these tables are empty/missing.

Conversation and exercise integration: - Exercise generation can rank candidates by semantic compatibility/collocation strength. - Grammar validation can emit semantic and naturalness hints. - Conversation turn processing can update per-lemma student progress.

9. 2026-03-08 audit synchronization¶

9.1 Database-table audit status¶

lemma_roles: used by languageSignalService and now exposed through lexicalRoleEngineService.
student_lemma_progress: used by conversationEngineService and now exposed through studentLemmaProgressEngineService.
lemma_frequency: used by languageSignalService and CEFR/difficulty flows.
sentence_patterns: used by exerciseGenerationService, aiGrammarEngineService, and sentenceConstructionEngineService.
conversation_turns + conversation_lemmas: used by conversationEngineService and exposed via /admin/content-bank/conversation/process-turn.

9.2 Service-flow audit status¶

Verified system-connected services: - lexicalUnlockGraphService: consumed by aiGrammarEngineService, exerciseGenerationService, languageSignalService, and admin content-engine controllers. - lexicalGraphGateService: consumed by grammar/exercise services and content-engine controller. - aiGrammarEngineService: consumed by admin grammar controller and conversation engine. - exerciseGenerationService: consumed by admin grammar controller and AI grammar follow-up generation. - conversationEngineService: consumed by admin grammar controller; exposed by POST /admin/content-bank/conversation/process-turn. - contentBankSentencesService: consumed by content-bank ops controller; exposed by POST /admin/content-bank/sentences/create.

9.3 Canonical engine service modules¶

New canonical adapters added for discoverability and non-duplication: - lexicalRoleEngineService - semanticConstraintsEngineService - collocationEngineService - sentenceConstructionEngineService - difficultyEngineService - studentLemmaProgressEngineService - placementEngineService - rubricEngineService

Existing mapped modules (no duplicate created): - Pronunciation Engine → pronunciationService - Conversation Simulation Engine → conversationEngineService