Spoke Plus Course Engine Specification¶
1. Purpose¶
This document defines current Course Engine behavior and the implemented lexical architecture that powers enrichment, generation, and token-level analysis.
2. Architectural Principle¶
Single Linguistic Source of Truth¶
The Content Bank is a lemma-first lexical system:
vocabularyis the canonical lemma registry.lemma_forms,senses,sense_translations,sentences/sentence_tokens,semantic_relations, andtts_assetsare all linked to lemma entries.- Classification is taxonomy-driven (
taxonomy_categories,taxonomy_values,content_item_taxonomies). - The Content Bank behaves as a language-scoped lexical graph through course-content linking.
Conceptual path:
Course → target_language_id → Vocabulary (lemma) → Word Forms → Word Senses → Translations (course.source_language_id) → Sentences → Semantic Relations → Audio/Media → Taxonomies
3. Linguistic Engine (Implemented Core)¶
3.1 Content Bank Core Entities¶
Official lexical entities:
vocabulary(lemma root)lemma_formssensessense_translationssentencessentence_tokenssemantic_relationstts_assets
Classification and graph helpers:
taxonomy_categoriestaxonomy_valuescontent_item_taxonomiesvocabulary_components(chunk composition)
3.2 Taxonomy-driven classification¶
Legacy direct linguistic columns were replaced by taxonomy assignments as the canonical model.
Canonical classification source:
- content_item_taxonomies joins a content entity (for example vocabulary) to taxonomy values.
Common taxonomy categories include:
- parts_of_speech
- cefr_levels
- frequency_bands
- registers
- semantic_domains
- themes
Benefits: - multi-value classification per lemma; - no schema fragmentation from repeatedly adding fixed columns; - consistent filtering/search semantics across lexical and non-lexical entities.
3.3 Chunk builder model¶
Chunks are multiword expressions represented as lemmas with type='chunk'.
- Component lemmas must already exist as
type='lemma'. - Composition is stored in
vocabulary_components(chunk_id,lemma_id,position). - Chunks support examples (
sentences), media (tts_assets/images), and taxonomy assignments.
3.4 Number lemma model¶
Numbers are represented as lexical lemmas with dual fields:
- numeric_value
- spelled_form
Example:
- lemma 1
- numeric value 1
- spelled form one
This supports number recognition, number-to-word conversion, and writing exercises.
3.5 Lemma Detail and Save All workflow¶
The admin lexical editor uses tabbed Lemma Detail and a single Save All action.
Save All is the official persistence pattern for lexical entities and taxonomy classifications in one editorial action.
4. Course Structure¶
The engine uses normalized progression hierarchy:
1. courses
2. units
3. skills
4. lessons
5. lesson_content_map
5. Planned Adaptive Direction¶
Adaptive logic is evolving over current analytics and token-level linguistic infrastructure.
Roadmap direction: - weak-skill ranking; - due-item recommendations; - adaptive session assembly; - morphology/syntax-informed remediation.
6. Official specs (normative references)¶
docs/content-bank/WORD_DETAIL_PANEL_SPEC.mddocs/content-bank/VOCABULARY_CONTENT_BANK_ARCHITECTURE.mddocs/schema/vocabulary-content-bank-model.mddocs/engine/MORPHOLOGICAL_EVALUATION_SPEC.mddocs/engine/MORPHOLOGY_GENERATION_SPEC.md
3.6 Curriculum-gated adaptive selection¶
Implemented progression rule for vocabulary selection in scoped content queries:
student_current_unit+student_current_stepgate available lemmas/chunks.- Effective unlock rule:
introduced_unit < student_current_unit OR (introduced_unit = student_current_unit AND introduced_step <= student_current_step). - Backward-compatible mirrors remain supported via
vocabulary.introduced_chapter+vocabulary.introduced_step, with canonical curriculum assignment in taxonomy categorycurriculum_units. - Each lemma must have exactly one introduced curriculum unit assignment.
3.8 Difficulty score model¶
To support adaptive ranking/filtering, vocabulary includes a normalized numeric difficulty_score (0.0–1.0).
- The score complements taxonomy classification and does not replace CEFR/frequency/semantic dimensions.
- Default derivation uses CEFR weight + frequency modifier + grammar complexity + semantic abstraction, clamped to [0,1].
- Existing rows are backfilled with an initial CEFR+frequency heuristic for backward-compatible rollout.
- Admin workflows can manually edit the score or recompute via taxonomy/AI suggestion.
3.9 Adaptive selection gating with difficulty¶
Course-scoped vocabulary list endpoints may filter by:
- student_current_unit (existing rule: introduced_chapter <= student_current_unit)
- student_difficulty_level (new rule: difficulty_score <= student_difficulty_level)
When difficulty filter is present, results are ordered ascending by difficulty_score to prioritize easier eligible lemmas.
3.7 Extended taxonomy dimensions¶
The lexical engine now uses additional taxonomy categories:
curriculum_units(Unit 1..Unit 300)unit_types(e.g.,skill,review,assessment)grammar_topics
grammar_topics is language-scoped: each topic is stored in taxonomy_values with language_id referencing a value from taxonomy category languages.
Course-context selectors must resolve courses.target_language_id and only expose grammar topics for that language.
Vocabulary-to-grammar classification must enforce grammar_topic.language_id == vocabulary.language_id.
semantic_domains supports hierarchy through taxonomy_values.parent_id.
3.8 AI-Assisted Lemma Classification¶
The Content Bank supports AI-assisted taxonomy suggestions during lemma creation:
- Endpoint:
POST /admin/content-bank/lemmas/ai-classify. - Input:
lemma,language_id. - Output: taxonomy suggestion payload for CEFR, frequency band, register, semantic domains, lemma type, parts of speech, and grammar topics.
- Mapping is constrained to existing taxonomy values (
taxonomy_values); the flow must not auto-create taxonomy entries. - Unknown/out-of-taxonomy suggestions are mapped to
Unknownvalues when present; the service never creates new taxonomy values automatically. - Suggestions are editorial aids only; they do not persist data until the user confirms
Create Lemma.
7. AI Grammar Engine (Phase 1)¶
7.1 Lexical graph extensions¶
sentence_lemmasstores explicit ordered lemma usage per sentence and is the preferred source for sentence unlock validation.chunk_componentsstores explicit ordered lemma composition for chunk items and hardens chunk unlock checks.sentence_tokensand previous chunk logic are preserved for backward compatibility.
7.2 Strong unlock gating¶
Sentence eligibility:
- allow only if each referenced lemma is unlocked via:
- introduced_unit_id < student_current_unit
- OR (introduced_unit_id = student_current_unit AND introduced_step <= student_current_step)
Chunk eligibility:
- chunk lemma must be unlocked,
- every component lemma in chunk_components must be unlocked.
7.3 Grammar validation responsibilities¶
services/aiGrammarEngineService.js performs deterministic checks before generating explanation text:
- known/unlocked vocabulary only;
- adjective position;
- supported word-order pattern checks;
- question and negative transformation scaffolding;
- optional grammar-topic compatibility hooks through pattern metadata.
Returned payload is structured (status, corrected_sentence, issues, detected_pattern, expected_pattern, follow_up_exercises).
7.4 Pattern-based exercise generation¶
services/exerciseGenerationService.js composes exercises from:
- unlocked vocabulary;
- unlocked chunks;
sentence_patternstemplates;- lexical unlock graph constraints.
Supported exercise types (phase 1): - create sentence, - transform to question, - answer negatively, - replace noun, - replace adjective, - reorder words.
8. Engine Module Specifications¶
8.1 Lexical Unlock Graph¶
Purpose - Enforce progression-safe lexical usage across exercises, sentence validation, and conversational interactions.
Inputs
- Learner progression context (student_current_unit, student_current_step, optional student_current_unit_id).
- Lemma/chunk progression metadata (introduced_unit_id, introduced_step; legacy mirrors where required).
- Optional explicit lexical graphs (sentence_lemmas, chunk_components).
Outputs - Eligibility decisions for lemmas/chunks/sentences. - Locked/out-of-scope token/lemma sets for downstream feedback.
Interactions - Consumed by AI Grammar Engine, Exercise Generation Service, and Conversation Engine. - Uses vocabulary/chunk graphs as lexical source-of-truth.
8.2 AI Grammar Engine¶
Purpose - Validate student responses with deterministic lexical + grammar checks and produce structured feedback.
Inputs
- Student sentence/response text.
- language_id and progression context.
- Lexical data (vocabulary, sentence_lemmas, chunk_components) and pattern data (sentence_patterns).
Outputs
- status, corrected_sentence, issues, detected_pattern, expected_pattern, and follow-up recommendations.
Interactions - Calls Lexical Unlock Graph for in-scope validation. - Feeds Exercise Generation Service (follow-up exercises). - Feeds Conversation Engine for turn-level grammar feedback persistence.
8.3 Sentence Pattern Engine¶
Purpose - Represent reusable language-scoped construction templates for validation and generation.
Inputs
- sentence_patterns records (language_id, pattern_structure, grammar_topic_id, progression metadata).
- Request context (exercise type, language scope, learner progression).
Outputs - Selected pattern metadata and expected structure constraints.
Interactions - Used by AI Grammar Engine to compare detected vs expected structure. - Used by Exercise Generation Service to build prompt templates. - Aligned with grammar topics taxonomy for pedagogical scope.
8.4 Exercise Generation Service¶
Purpose - Generate adaptive exercises using unlocked vocabulary/chunks and pattern templates.
Inputs - Student progression and language context. - Lexical unlock eligibility results. - Candidate vocabulary/chunks/sentences. - Optional grammar topic/pattern constraints.
Outputs - Exercise payloads (prompt, expected answer/pattern, metadata for correction and feedback).
Interactions - Depends on Lexical Unlock Graph and Sentence Pattern Engine. - Consumed by admin APIs and grammar follow-up generation flow. - Shares validation path with AI Grammar Engine.
8.5 Conversation Engine¶
Purpose - Persist conversational practice turns and connect free-form dialogue to the existing learning safety model.
Inputs
- Turn payload (student_id, session_id, language_id, unit_id, step, prompt_text, student_response).
- AI Grammar Engine validation output.
- Lexical unlock eligibility output.
Outputs
- Stored conversation turn (conversation_turns) including correction, detected pattern, and grammar feedback.
- Stored lemma usage links (conversation_lemmas) for analytics and consistency checks.
- Out-of-scope vocabulary signals (unknown_tokens, locked_tokens).
Interactions - Reuses AI Grammar Engine and Lexical Unlock Graph. - Complements (does not replace) exercise/sentence tables. - Supports future dialogue continuation and difficulty adaptation loops.
3.10 Final language engine expansion (additive)¶
New optional layers (backward compatible):
- Lexical Role Engine via lemma_roles (lemma_id, role, confidence).
- Semantic Constraints Engine via lemma_semantic_classes, verb_object_constraints, modifier_constraints.
- Student Lemma Progress via student_lemma_progress.
- Difficulty/Frequency extensions via pattern_difficulty and lemma_frequency (complements taxonomy frequency_bands).
- Collocation Strength/Naturalness via collocation_strength.
Generation preference order when data exists: unlocked lemmas → POS compatibility → lexical role expectations → semantic constraints → collocation strength → pattern difficulty.
Fallback behavior: - All new layers are optional and non-blocking. - Existing endpoints and legacy lexical flows remain unchanged when these tables are empty/missing.
Conversation and exercise integration: - Exercise generation can rank candidates by semantic compatibility/collocation strength. - Grammar validation can emit semantic and naturalness hints. - Conversation turn processing can update per-lemma student progress.
9. 2026-03-08 audit synchronization¶
9.1 Database-table audit status¶
lemma_roles: used bylanguageSignalServiceand now exposed throughlexicalRoleEngineService.student_lemma_progress: used byconversationEngineServiceand now exposed throughstudentLemmaProgressEngineService.lemma_frequency: used bylanguageSignalServiceand CEFR/difficulty flows.sentence_patterns: used byexerciseGenerationService,aiGrammarEngineService, andsentenceConstructionEngineService.conversation_turns+conversation_lemmas: used byconversationEngineServiceand exposed via/admin/content-bank/conversation/process-turn.
9.2 Service-flow audit status¶
Verified system-connected services:
- lexicalUnlockGraphService: consumed by aiGrammarEngineService, exerciseGenerationService, languageSignalService, and admin content-engine controllers.
- lexicalGraphGateService: consumed by grammar/exercise services and content-engine controller.
- aiGrammarEngineService: consumed by admin grammar controller and conversation engine.
- exerciseGenerationService: consumed by admin grammar controller and AI grammar follow-up generation.
- conversationEngineService: consumed by admin grammar controller; exposed by POST /admin/content-bank/conversation/process-turn.
- contentBankSentencesService: consumed by content-bank ops controller; exposed by POST /admin/content-bank/sentences/create.
9.3 Canonical engine service modules¶
New canonical adapters added for discoverability and non-duplication:
- lexicalRoleEngineService
- semanticConstraintsEngineService
- collocationEngineService
- sentenceConstructionEngineService
- difficultyEngineService
- studentLemmaProgressEngineService
- placementEngineService
- rubricEngineService
Existing mapped modules (no duplicate created):
- Pronunciation Engine → pronunciationService
- Conversation Simulation Engine → conversationEngineService