Skip to content

Lemma-First Content Bank Data Model

Core relationship graph (current canonical model)

  • lemmas (lemma root)
  • 1:N lemma_forms (lemma_id)
  • 1:N senses (lemma_id)
  • 1:N sense_translations (lemma_id)
  • 1:N lemma_assets (lemma_id)
  • M:N sentences via sentence_lemmas
  • taxonomy mapping via content_item_taxonomies
  • chunk composition via vocabulary_components (chunk_id -> lemma_id, ordered)
  • chunk composition via chunk_components (chunk_id -> lemma_id, ordered)

Taxonomy model (canonical)

  • taxonomy_categories
  • taxonomy_values
  • content_item_taxonomies

Taxonomy categories define systems such as: - parts_of_speech - cefr_levels - frequency_bands - semantic_domains - registers - lemma_types - grammar_topics - languages

Universal mapping pattern:

lemmas -> content_item_taxonomies -> taxonomy_values -> taxonomy_categories

Lemma schema fields documented for engine integration

  • id
  • lemma
  • lemma_normalized
  • type
  • language_id
  • base_lang
  • language_code
  • pos
  • pos_id
  • lemma_type_id
  • cefr_level
  • cefr_level_id
  • frequency_rank
  • introduced_chapter
  • introduced_unit_id
  • introduced_step
  • difficulty_score
  • editorial_status
  • created_at
  • updated_at

Progression note: - introduced_unit_id + introduced_step are consumed by the Lexical Unlock Graph.

Difficulty note: - difficulty_score is normalized in [0.0, 1.0].

Chunk graph model

Chunks are represented in lemmas with type='chunk'.

Chunk edges are defined via: - vocabulary_components - chunk_components

All chunk components must reference valid lemma rows.

Additional lexical extensions

  • lemma_roles
  • lemma_frequency
  • collocation data / semantic constraints as additive layers

verb_conjugations fields include tense_key, person_key, form, and is_irregular.

Legacy compatibility

Legacy entities and mirror fields may still exist for compatibility (for example semantic_relations, tts_assets, and mirror classification columns in vocabulary). They remain non-breaking support layers and are not canonical taxonomy sources.