Skip to content

Spoke Plus Course Engine Specification

1. Purpose

This document defines current Course Engine behavior and the implemented lexical architecture that powers enrichment, generation, and token-level analysis.


2. Architectural Principle

Single Linguistic Source of Truth

The Content Bank is a lemma-first lexical system:

  • vocabulary is the canonical lemma registry.
  • lemma_forms, senses, sense_translations, sentences/sentence_tokens, semantic_relations, and tts_assets are all linked to lemma entries.
  • Classification is taxonomy-driven (taxonomy_categories, taxonomy_values, content_item_taxonomies).
  • The Content Bank behaves as a language-scoped lexical graph through course-content linking.

Conceptual path:

Course → target_language_id → Vocabulary (lemma) → Word Forms → Word Senses → Translations (course.source_language_id) → Sentences → Semantic Relations → Audio/Media → Taxonomies


3. Linguistic Engine (Implemented Core)

3.1 Content Bank Core Entities

Official lexical entities:

  • vocabulary (lemma root)
  • lemma_forms
  • senses
  • sense_translations
  • sentences
  • sentence_tokens
  • semantic_relations
  • tts_assets

Classification and graph helpers:

  • taxonomy_categories
  • taxonomy_values
  • content_item_taxonomies
  • vocabulary_components (chunk composition)

3.2 Taxonomy-driven classification

Legacy direct linguistic columns were replaced by taxonomy assignments as the canonical model.

Canonical classification source: - content_item_taxonomies joins a content entity (for example vocabulary) to taxonomy values.

Common taxonomy categories include: - parts_of_speech - cefr_levels - frequency_bands - registers - semantic_domains - themes

Benefits: - multi-value classification per lemma; - no schema fragmentation from repeatedly adding fixed columns; - consistent filtering/search semantics across lexical and non-lexical entities.

3.3 Chunk builder model

Chunks are multiword expressions represented as lemmas with type='chunk'.

  • Component lemmas must already exist as type='lemma'.
  • Composition is stored in vocabulary_components (chunk_id, lemma_id, position).
  • Chunks support examples (sentences), media (tts_assets/images), and taxonomy assignments.

3.4 Number lemma model

Numbers are represented as lexical lemmas with dual fields: - numeric_value - spelled_form

Example: - lemma 1 - numeric value 1 - spelled form one

This supports number recognition, number-to-word conversion, and writing exercises.

3.5 Lemma Detail and Save All workflow

The admin lexical editor uses tabbed Lemma Detail and a single Save All action.

Save All is the official persistence pattern for lexical entities and taxonomy classifications in one editorial action.


4. Course Structure

The engine uses normalized progression hierarchy: 1. courses 2. units 3. skills 4. lessons 5. lesson_content_map


5. Planned Adaptive Direction

Adaptive logic is evolving over current analytics and token-level linguistic infrastructure.

Roadmap direction: - weak-skill ranking; - due-item recommendations; - adaptive session assembly; - morphology/syntax-informed remediation.


6. Official specs (normative references)

  • docs/content-bank/WORD_DETAIL_PANEL_SPEC.md
  • docs/content-bank/VOCABULARY_CONTENT_BANK_ARCHITECTURE.md
  • docs/schema/vocabulary-content-bank-model.md
  • docs/engine/MORPHOLOGICAL_EVALUATION_SPEC.md
  • docs/engine/MORPHOLOGY_GENERATION_SPEC.md

3.6 Curriculum-gated adaptive selection

Implemented progression rule for vocabulary selection in scoped content queries:

  • student_current_unit + student_current_step gate available lemmas/chunks.
  • Effective unlock rule: introduced_unit < student_current_unit OR (introduced_unit = student_current_unit AND introduced_step <= student_current_step).
  • Backward-compatible mirrors remain supported via vocabulary.introduced_chapter + vocabulary.introduced_step, with canonical curriculum assignment in taxonomy category curriculum_units.
  • Each lemma must have exactly one introduced curriculum unit assignment.

3.8 Difficulty score model

To support adaptive ranking/filtering, vocabulary includes a normalized numeric difficulty_score (0.0–1.0).

  • The score complements taxonomy classification and does not replace CEFR/frequency/semantic dimensions.
  • Default derivation uses CEFR weight + frequency modifier + grammar complexity + semantic abstraction, clamped to [0,1].
  • Existing rows are backfilled with an initial CEFR+frequency heuristic for backward-compatible rollout.
  • Admin workflows can manually edit the score or recompute via taxonomy/AI suggestion.

3.9 Adaptive selection gating with difficulty

Course-scoped vocabulary list endpoints may filter by: - student_current_unit (existing rule: introduced_chapter <= student_current_unit) - student_difficulty_level (new rule: difficulty_score <= student_difficulty_level)

When difficulty filter is present, results are ordered ascending by difficulty_score to prioritize easier eligible lemmas.

3.7 Extended taxonomy dimensions

The lexical engine now uses additional taxonomy categories:

  • curriculum_units (Unit 1..Unit 300)
  • unit_types (e.g., skill, review, assessment)
  • grammar_topics

grammar_topics is language-scoped: each topic is stored in taxonomy_values with language_id referencing a value from taxonomy category languages. Course-context selectors must resolve courses.target_language_id and only expose grammar topics for that language. Vocabulary-to-grammar classification must enforce grammar_topic.language_id == vocabulary.language_id.

semantic_domains supports hierarchy through taxonomy_values.parent_id.

3.8 AI-Assisted Lemma Classification

The Content Bank supports AI-assisted taxonomy suggestions during lemma creation:

  • Endpoint: POST /admin/content-bank/lemmas/ai-classify.
  • Input: lemma, language_id.
  • Output: taxonomy suggestion payload for CEFR, frequency band, register, semantic domains, lemma type, parts of speech, and grammar topics.
  • Mapping is constrained to existing taxonomy values (taxonomy_values); the flow must not auto-create taxonomy entries.
  • Unknown/out-of-taxonomy suggestions are mapped to Unknown values when present; the service never creates new taxonomy values automatically.
  • Suggestions are editorial aids only; they do not persist data until the user confirms Create Lemma.

7. AI Grammar Engine (Phase 1)

7.1 Lexical graph extensions

  • sentence_lemmas stores explicit ordered lemma usage per sentence and is the preferred source for sentence unlock validation.
  • chunk_components stores explicit ordered lemma composition for chunk items and hardens chunk unlock checks.
  • sentence_tokens and previous chunk logic are preserved for backward compatibility.

7.2 Strong unlock gating

Sentence eligibility: - allow only if each referenced lemma is unlocked via: - introduced_unit_id < student_current_unit - OR (introduced_unit_id = student_current_unit AND introduced_step <= student_current_step)

Chunk eligibility: - chunk lemma must be unlocked, - every component lemma in chunk_components must be unlocked.

7.3 Grammar validation responsibilities

services/aiGrammarEngineService.js performs deterministic checks before generating explanation text:

  • known/unlocked vocabulary only;
  • adjective position;
  • supported word-order pattern checks;
  • question and negative transformation scaffolding;
  • optional grammar-topic compatibility hooks through pattern metadata.

Returned payload is structured (status, corrected_sentence, issues, detected_pattern, expected_pattern, follow_up_exercises).

7.4 Pattern-based exercise generation

services/exerciseGenerationService.js composes exercises from:

  • unlocked vocabulary;
  • unlocked chunks;
  • sentence_patterns templates;
  • lexical unlock graph constraints.

Supported exercise types (phase 1): - create sentence, - transform to question, - answer negatively, - replace noun, - replace adjective, - reorder words.


8. Engine Module Specifications

8.1 Lexical Unlock Graph

Purpose - Enforce progression-safe lexical usage across exercises, sentence validation, and conversational interactions.

Inputs - Learner progression context (student_current_unit, student_current_step, optional student_current_unit_id). - Lemma/chunk progression metadata (introduced_unit_id, introduced_step; legacy mirrors where required). - Optional explicit lexical graphs (sentence_lemmas, chunk_components).

Outputs - Eligibility decisions for lemmas/chunks/sentences. - Locked/out-of-scope token/lemma sets for downstream feedback.

Interactions - Consumed by AI Grammar Engine, Exercise Generation Service, and Conversation Engine. - Uses vocabulary/chunk graphs as lexical source-of-truth.

8.2 AI Grammar Engine

Purpose - Validate student responses with deterministic lexical + grammar checks and produce structured feedback.

Inputs - Student sentence/response text. - language_id and progression context. - Lexical data (vocabulary, sentence_lemmas, chunk_components) and pattern data (sentence_patterns).

Outputs - status, corrected_sentence, issues, detected_pattern, expected_pattern, and follow-up recommendations.

Interactions - Calls Lexical Unlock Graph for in-scope validation. - Feeds Exercise Generation Service (follow-up exercises). - Feeds Conversation Engine for turn-level grammar feedback persistence.

8.3 Sentence Pattern Engine

Purpose - Represent reusable language-scoped construction templates for validation and generation.

Inputs - sentence_patterns records (language_id, pattern_structure, grammar_topic_id, progression metadata). - Request context (exercise type, language scope, learner progression).

Outputs - Selected pattern metadata and expected structure constraints.

Interactions - Used by AI Grammar Engine to compare detected vs expected structure. - Used by Exercise Generation Service to build prompt templates. - Aligned with grammar topics taxonomy for pedagogical scope.

8.4 Exercise Generation Service

Purpose - Generate adaptive exercises using unlocked vocabulary/chunks and pattern templates.

Inputs - Student progression and language context. - Lexical unlock eligibility results. - Candidate vocabulary/chunks/sentences. - Optional grammar topic/pattern constraints.

Outputs - Exercise payloads (prompt, expected answer/pattern, metadata for correction and feedback).

Interactions - Depends on Lexical Unlock Graph and Sentence Pattern Engine. - Consumed by admin APIs and grammar follow-up generation flow. - Shares validation path with AI Grammar Engine.

8.5 Conversation Engine

Purpose - Persist conversational practice turns and connect free-form dialogue to the existing learning safety model.

Inputs - Turn payload (student_id, session_id, language_id, unit_id, step, prompt_text, student_response). - AI Grammar Engine validation output. - Lexical unlock eligibility output.

Outputs - Stored conversation turn (conversation_turns) including correction, detected pattern, and grammar feedback. - Stored lemma usage links (conversation_lemmas) for analytics and consistency checks. - Out-of-scope vocabulary signals (unknown_tokens, locked_tokens).

Interactions - Reuses AI Grammar Engine and Lexical Unlock Graph. - Complements (does not replace) exercise/sentence tables. - Supports future dialogue continuation and difficulty adaptation loops.

3.10 Final language engine expansion (additive)

New optional layers (backward compatible): - Lexical Role Engine via lemma_roles (lemma_id, role, confidence). - Semantic Constraints Engine via lemma_semantic_classes, verb_object_constraints, modifier_constraints. - Student Lemma Progress via student_lemma_progress. - Difficulty/Frequency extensions via pattern_difficulty and lemma_frequency (complements taxonomy frequency_bands). - Collocation Strength/Naturalness via collocation_strength.

Generation preference order when data exists: unlocked lemmas → POS compatibility → lexical role expectations → semantic constraints → collocation strength → pattern difficulty.

Fallback behavior: - All new layers are optional and non-blocking. - Existing endpoints and legacy lexical flows remain unchanged when these tables are empty/missing.

Conversation and exercise integration: - Exercise generation can rank candidates by semantic compatibility/collocation strength. - Grammar validation can emit semantic and naturalness hints. - Conversation turn processing can update per-lemma student progress.

9. 2026-03-08 audit synchronization

9.1 Database-table audit status

  • lemma_roles: used by languageSignalService and now exposed through lexicalRoleEngineService.
  • student_lemma_progress: used by conversationEngineService and now exposed through studentLemmaProgressEngineService.
  • lemma_frequency: used by languageSignalService and CEFR/difficulty flows.
  • sentence_patterns: used by exerciseGenerationService, aiGrammarEngineService, and sentenceConstructionEngineService.
  • conversation_turns + conversation_lemmas: used by conversationEngineService and exposed via /admin/content-bank/conversation/process-turn.

9.2 Service-flow audit status

Verified system-connected services: - lexicalUnlockGraphService: consumed by aiGrammarEngineService, exerciseGenerationService, languageSignalService, and admin content-engine controllers. - lexicalGraphGateService: consumed by grammar/exercise services and content-engine controller. - aiGrammarEngineService: consumed by admin grammar controller and conversation engine. - exerciseGenerationService: consumed by admin grammar controller and AI grammar follow-up generation. - conversationEngineService: consumed by admin grammar controller; exposed by POST /admin/content-bank/conversation/process-turn. - contentBankSentencesService: consumed by content-bank ops controller; exposed by POST /admin/content-bank/sentences/create.

9.3 Canonical engine service modules

New canonical adapters added for discoverability and non-duplication: - lexicalRoleEngineService - semanticConstraintsEngineService - collocationEngineService - sentenceConstructionEngineService - difficultyEngineService - studentLemmaProgressEngineService - placementEngineService - rubricEngineService

Existing mapped modules (no duplicate created): - Pronunciation Engine → pronunciationService - Conversation Simulation Engine → conversationEngineService