The Problem: Why Auto-Generating Taxonomies Matters
Every enterprise struggles with the same hidden cost: fragmented data schemas spread across dozens of databases, each using its own vocabulary. A "customer" in your CRM is a "client" in billing, a "user" in authentication, and a "patron" in analytics. This semantic fragmentation creates real pain:
- Manual Taxonomy Reconciliation — Data architects spend weeks mapping schemas to standardized vocabularies, only to repeat the process when schemas evolve
- Inconsistent Data Governance — Without unified taxonomies, compliance audits become archaeology expeditions through conflicting metadata
- Integration Bottlenecks — Every new data source requires human experts to define how it fits into the enterprise knowledge graph
- Vocabulary Drift — As teams independently extend schemas, the gap between local data models and enterprise ontologies widens
The cost is not just time. It is strategic agility. Organizations cannot build intelligent data products when their semantic foundation requires constant manual intervention. TaxonomyLLM automates this reconciliation by learning the deep patterns that connect database schemas to RDF taxonomy structures.
The Solution: LLM-Based RDF Generation
TaxonomyLLM treats taxonomy generation as a sequence-to-sequence translation problem: given a logical schema (the source language), produce valid RDF taxonomy triples (the target language). But unlike typical text translation, this requires understanding both relational structure and ontological semantics.
Schema Parsing
Extract tables, columns, datatypes from SQL CREATE statements
Schema Encoding
Convert structural features into disentangled embeddings
TopoAttention
Separate schema structure from taxonomy position reasoning
Taxonomy Decoding
Generate RDF triples with proper ontological relationships
RDF Materialization
Assemble valid Turtle/RDF-XML with constraint enforcement
The key insight is disentanglement. By separating what the schema contains (structural features) from where elements belong in the taxonomy (positional features), the model can reason about each concern independently before combining them for generation.
Architecture: Schema Encoding Meets Taxonomy Decoding
Why T5? Model Selection Rationale
We evaluated GPT-3, PaLM, BLOOM, and T5 against three critical requirements: schema assimilation, relational reasoning, and RDF constraint adherence. T5's encoder-decoder architecture proved superior for this structured generation task because it maintains full bidirectional context during encoding while generating output autoregressively.
The Disentangled Embedding Strategy
Standard transformers conflate all features into unified representations. For taxonomy generation, this creates a problem: the model must simultaneously understand that a "customer_id" column is a primary key (structural) and should map to PersonalIdentifier in the taxonomy (positional). Our approach separates these concerns:
Input Schema → Parse(CREATE TABLE statements)
↓
┌───────────────┴───────────────┐
↓ ↓
Es (Structural) Ep (Positional)
"column types, "taxonomy hierarchy,
constraints, parent-child
foreign keys" relationships"
↓ ↓
Hs = TopoAttention(Es) Hp = TopoAttention(Ep)
↓ ↓
└───────────────┬───────────────┘
↓
Generation(Hs, Hp) → RDF Triples
TopoAttention: The Core Innovation
The TopoAttention mechanism computes separate self-attention matrices for structural versus positional reasoning. This disentanglement allows the model to learn that while "email" and "phone" have similar structural properties (both are VARCHAR columns), they occupy different positions in a privacy-aware taxonomy.
// Separate attention for structure and position
HsAttention = softmax(Qs × Ks^T / √d) × Vs // Schema structure focus
HpAttention = softmax(Qp × Kp^T / √d) × Vp // Taxonomy position focus
// Combined hidden states for generation
H_combined = LayerNorm(Hs + Hp)
Output = TaxonomyDecoder(H_combined)
This architecture enables precise schema-element-to-taxonomy-component translation. The model learns that a "timestamp" column should map to TemporalMarker, not because of string similarity, but because it has learned the topological relationship between temporal SQL types and time-related taxonomy nodes.
The Complete Algorithm
The end-to-end process follows four distinct phases:
def generate_taxonomy(schema: str) -> Graph:
# Phase 1: Parse and encode schema structure
parsed = parse_sql_schema(schema)
Es = structural_embedding(parsed.tables, parsed.columns, parsed.constraints)
Ep = positional_embedding(parsed.relationships, parsed.hierarchy_hints)
# Phase 2: Disentangled attention reasoning
Hs = topo_attention(Es) # Structure-focused hidden states
Hp = topo_attention(Ep) # Position-focused hidden states
# Phase 3: Taxonomy generation
rdf_triples = decoder.generate(Hs, Hp)
# Phase 4: RDF materialization with constraint enforcement
graph = Graph()
graph.parse(data=rdf_triples, format="turtle")
validate_rdfs_constraints(graph)
return graph
Example: From Schema to Taxonomy
Consider a simple membership database:
CREATE TABLE Member (
id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(255)
);
CREATE TABLE Activity (
id INT PRIMARY KEY,
member_id INT REFERENCES Member(id),
type VARCHAR(50),
timestamp DATETIME
);
TaxonomyLLM learns the topological compatibility between schema elements and taxonomy concepts through attention scoring:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex: <http://example.org/taxonomy#> .
ex:Member rdfs:subClassOf ex:PersonalInformation ;
rdfs:label "Member Entity" .
ex:MemberName rdfs:subPropertyOf ex:PersonalIdentifier ;
rdfs:domain ex:Member .
ex:MemberEmail rdfs:subPropertyOf ex:ContactInformation ;
rdfs:domain ex:Member .
ex:Activity rdfs:subClassOf ex:ActivityEvent ;
rdfs:label "Activity Record" .
ex:ActivityTimestamp rdfs:subPropertyOf ex:TemporalMarker ;
rdfs:domain ex:Activity .
Training: Two-Phase Knowledge Acquisition
| Phase | Data Source | Learning Objective |
|---|---|---|
| Pre-training | SchemaStore (5,000+ schemas) | Schema assimilation, RDF action encoding, topological alignment |
| Instruction Tuning | 1,000+ enterprise taxonomy graphs | rdfs:subClassOf relationships, property scoping, constraint validation |
Pre-training on Schema Diversity
The model trains on SchemaStore's diverse collection spanning SQL, NoSQL, and graph database formats. This teaches three foundational capabilities:
- Schema Assimilation — Recognizing patterns across different DDL syntaxes and schema conventions
- RDF Action Encoding — Learning the vocabulary of semantic web predicates and their usage contexts
- Topological Alignment — Understanding how relational structures map to hierarchical taxonomies
Instruction Tuning for Enterprise Constraints
The second phase uses curated enterprise taxonomy graphs with explicit valid/invalid examples. The model learns formal constraints like proper rdfs:subClassOf hierarchies and property scoping rules through contrastive feedback.
Implementation Details
| Component | Technology | Purpose |
|---|---|---|
| Base Model | T5 (Transformers 4.10.0) | Encoder-decoder architecture for seq2seq generation |
| Framework | TensorFlow 2.8.0 | Model training and inference |
| Schema Parser | SQLParse 0.4.2 | Extract structure from CREATE statements |
| RDF Engine | RDFLib 6.1.1 | Graph construction and serialization |
Custom Components
class TaxonomyLLM(transformers.TFT5ForConditionalGeneration):
"""
Specialized T5 variant for schema-to-taxonomy generation.
Extends base T5 with disentangled encoding.
"""
def __init__(self, config):
super().__init__(config)
self.schema_encoder = SchemaEncoder(config.d_model)
self.taxonomy_decoder = TaxonomyDecoder(config.d_model)
self.topo_attention = TopoAttentionLayer(config.num_heads)
def forward(self, schema_input):
# Structural and positional embeddings
Es, Ep = self.schema_encoder(schema_input)
# Disentangled attention
Hs = self.topo_attention(Es, mode='structural')
Hp = self.topo_attention(Ep, mode='positional')
# Generate RDF output
return self.taxonomy_decoder(Hs, Hp)
from taxonomy_llm import TaxonomyLLM
from rdflib import Graph
# Initialize model
model = TaxonomyLLM.from_pretrained('taxonomy-llm-base')
# Input schema
schema = """
CREATE TABLE Customer (
id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(255) UNIQUE
);
"""
# Generate taxonomy
parsed_schema = model.parse(schema)
input_vectors = model.encode(parsed_schema)
output_triples = model.generate(input_vectors)
# Materialize RDF graph
graph = Graph()
graph.parse(data=output_triples, format="turtle")
print(graph.serialize(format="turtle"))
Results: Validation on Enterprise Schemas
We evaluated TaxonomyLLM on 50 enterprise schemas with human-annotated gold-standard taxonomies:
These metrics represent a significant reduction in manual taxonomy authoring effort. The 86% RDF validity score means the vast majority of generated taxonomies are syntactically correct and semantically coherent, requiring only targeted human review rather than ground-up creation.
Application Domains
Data Governance
Auto-generate metadata taxonomies for compliance frameworks like GDPR, enabling consistent data classification across diverse systems
Knowledge Graph Construction
Bootstrap enterprise knowledge graphs by automatically mapping legacy database schemas to ontological structures
Data Catalog Enrichment
Enhance data catalog entries with semantic relationships, improving discoverability and lineage tracking
Schema Evolution
Automatically update taxonomies when database schemas change, maintaining semantic consistency across versions
Key Takeaways
- Disentanglement matters — Separating structural and positional reasoning enables more precise taxonomy generation than unified embeddings
- T5 excels at structured generation — The encoder-decoder architecture naturally fits the schema-to-RDF translation task
- Two-phase training is essential — Pre-training on schema diversity followed by instruction tuning on enterprise taxonomies produces production-ready results
- Automation enables agility — Reducing taxonomy authoring from weeks to minutes unlocks faster data product development
Explore the Code
The complete implementation including model architecture, training scripts, and example datasets is available on GitHub.
View on GitHub