Lemma-First Content Bank Data Model¶
Core relationship graph (current canonical model)¶
lemmas(lemma root)- 1:N
lemma_forms(lemma_id) - 1:N
senses(lemma_id) - 1:N
sense_translations(lemma_id) - 1:N
lemma_assets(lemma_id) - M:N
sentencesviasentence_lemmas - taxonomy mapping via
content_item_taxonomies - chunk composition via
vocabulary_components(chunk_id -> lemma_id, ordered) - chunk composition via
chunk_components(chunk_id -> lemma_id, ordered)
Taxonomy model (canonical)¶
taxonomy_categoriestaxonomy_valuescontent_item_taxonomies
Taxonomy categories define systems such as:
- parts_of_speech
- cefr_levels
- frequency_bands
- semantic_domains
- registers
- lemma_types
- grammar_topics
- languages
Universal mapping pattern:
lemmas -> content_item_taxonomies -> taxonomy_values -> taxonomy_categories
Lemma schema fields documented for engine integration¶
idlemmalemma_normalizedtypelanguage_idbase_langlanguage_codepospos_idlemma_type_idcefr_levelcefr_level_idfrequency_rankintroduced_chapterintroduced_unit_idintroduced_stepdifficulty_scoreeditorial_statuscreated_atupdated_at
Progression note:
- introduced_unit_id + introduced_step are consumed by the Lexical Unlock Graph.
Difficulty note:
- difficulty_score is normalized in [0.0, 1.0].
Chunk graph model¶
Chunks are represented in lemmas with type='chunk'.
Chunk edges are defined via:
- vocabulary_components
- chunk_components
All chunk components must reference valid lemma rows.
Additional lexical extensions¶
lemma_roleslemma_frequency- collocation data / semantic constraints as additive layers
verb_conjugations fields include tense_key, person_key, form, and is_irregular.
Legacy compatibility¶
Legacy entities and mirror fields may still exist for compatibility (for example semantic_relations, tts_assets, and mirror classification columns in vocabulary). They remain non-breaking support layers and are not canonical taxonomy sources.