Lemma-First Content Bank Data Model¶

Core relationship graph (current canonical model)¶

lemmas (lemma root)
1:N lemma_forms (lemma_id)
1:N senses (lemma_id)
1:N sense_translations (lemma_id)
1:N lemma_assets (lemma_id)
M:N sentences via sentence_lemmas
taxonomy mapping via content_item_taxonomies
chunk composition via vocabulary_components (chunk_id -> lemma_id, ordered)
chunk composition via chunk_components (chunk_id -> lemma_id, ordered)

Taxonomy model (canonical)¶

taxonomy_categories
taxonomy_values
content_item_taxonomies

Taxonomy categories define systems such as: - parts_of_speech - cefr_levels - frequency_bands - semantic_domains - registers - lemma_types - grammar_topics - languages

Universal mapping pattern:

lemmas -> content_item_taxonomies -> taxonomy_values -> taxonomy_categories

Lemma schema fields documented for engine integration¶

id
lemma
lemma_normalized
type
language_id
base_lang
language_code
pos
pos_id
lemma_type_id
cefr_level
cefr_level_id
frequency_rank
introduced_chapter
introduced_unit_id
introduced_step
difficulty_score
editorial_status
created_at
updated_at

Progression note: - introduced_unit_id + introduced_step are consumed by the Lexical Unlock Graph.

Difficulty note: - difficulty_score is normalized in [0.0, 1.0].

Chunk graph model¶

Chunks are represented in lemmas with type='chunk'.

Chunk edges are defined via: - vocabulary_components - chunk_components

All chunk components must reference valid lemma rows.

Additional lexical extensions¶

lemma_roles
lemma_frequency
collocation data / semantic constraints as additive layers

verb_conjugations fields include tense_key, person_key, form, and is_irregular.

Legacy compatibility¶

Legacy entities and mirror fields may still exist for compatibility (for example semantic_relations, tts_assets, and mirror classification columns in vocabulary). They remain non-breaking support layers and are not canonical taxonomy sources.