The Problem: Why Auto-Generating Taxonomies Matters

Every enterprise struggles with the same hidden cost: fragmented data schemas spread across dozens of databases, each using its own vocabulary. A "customer" in your CRM is a "client" in billing, a "user" in authentication, and a "patron" in analytics. This semantic fragmentation creates real pain:

  • Manual Taxonomy Reconciliation — Data architects spend weeks mapping schemas to standardized vocabularies, only to repeat the process when schemas evolve
  • Inconsistent Data Governance — Without unified taxonomies, compliance audits become archaeology expeditions through conflicting metadata
  • Integration Bottlenecks — Every new data source requires human experts to define how it fits into the enterprise knowledge graph
  • Vocabulary Drift — As teams independently extend schemas, the gap between local data models and enterprise ontologies widens

The cost is not just time. It is strategic agility. Organizations cannot build intelligent data products when their semantic foundation requires constant manual intervention. TaxonomyLLM automates this reconciliation by learning the deep patterns that connect database schemas to RDF taxonomy structures.

The Solution: LLM-Based RDF Generation

TaxonomyLLM treats taxonomy generation as a sequence-to-sequence translation problem: given a logical schema (the source language), produce valid RDF taxonomy triples (the target language). But unlike typical text translation, this requires understanding both relational structure and ontological semantics.

TaxonomyLLM Processing Pipeline
1

Schema Parsing

Extract tables, columns, datatypes from SQL CREATE statements

2

Schema Encoding

Convert structural features into disentangled embeddings

3

TopoAttention

Separate schema structure from taxonomy position reasoning

4

Taxonomy Decoding

Generate RDF triples with proper ontological relationships

5

RDF Materialization

Assemble valid Turtle/RDF-XML with constraint enforcement

The key insight is disentanglement. By separating what the schema contains (structural features) from where elements belong in the taxonomy (positional features), the model can reason about each concern independently before combining them for generation.

Architecture: Schema Encoding Meets Taxonomy Decoding

Why T5? Model Selection Rationale

We evaluated GPT-3, PaLM, BLOOM, and T5 against three critical requirements: schema assimilation, relational reasoning, and RDF constraint adherence. T5's encoder-decoder architecture proved superior for this structured generation task because it maintains full bidirectional context during encoding while generating output autoregressively.

The Disentangled Embedding Strategy

Standard transformers conflate all features into unified representations. For taxonomy generation, this creates a problem: the model must simultaneously understand that a "customer_id" column is a primary key (structural) and should map to PersonalIdentifier in the taxonomy (positional). Our approach separates these concerns:

Dual Embedding Architecture
Input Schema → Parse(CREATE TABLE statements)
                    ↓
    ┌───────────────┴───────────────┐
    ↓                               ↓
Es (Structural)              Ep (Positional)
"column types,               "taxonomy hierarchy,
 constraints,                 parent-child
 foreign keys"                relationships"
    ↓                               ↓
Hs = TopoAttention(Es)      Hp = TopoAttention(Ep)
    ↓                               ↓
    └───────────────┬───────────────┘
                    ↓
         Generation(Hs, Hp) → RDF Triples

TopoAttention: The Core Innovation

The TopoAttention mechanism computes separate self-attention matrices for structural versus positional reasoning. This disentanglement allows the model to learn that while "email" and "phone" have similar structural properties (both are VARCHAR columns), they occupy different positions in a privacy-aware taxonomy.

TopoAttention Mechanism
// Separate attention for structure and position
HsAttention = softmax(Qs × Ks^T / √d) × Vs    // Schema structure focus
HpAttention = softmax(Qp × Kp^T / √d) × Vp    // Taxonomy position focus

// Combined hidden states for generation
H_combined = LayerNorm(Hs + Hp)
Output = TaxonomyDecoder(H_combined)

This architecture enables precise schema-element-to-taxonomy-component translation. The model learns that a "timestamp" column should map to TemporalMarker, not because of string similarity, but because it has learned the topological relationship between temporal SQL types and time-related taxonomy nodes.

The Complete Algorithm

The end-to-end process follows four distinct phases:

TaxonomyLLM Core Algorithm
def generate_taxonomy(schema: str) -> Graph:
    # Phase 1: Parse and encode schema structure
    parsed = parse_sql_schema(schema)
    Es = structural_embedding(parsed.tables, parsed.columns, parsed.constraints)
    Ep = positional_embedding(parsed.relationships, parsed.hierarchy_hints)

    # Phase 2: Disentangled attention reasoning
    Hs = topo_attention(Es)  # Structure-focused hidden states
    Hp = topo_attention(Ep)  # Position-focused hidden states

    # Phase 3: Taxonomy generation
    rdf_triples = decoder.generate(Hs, Hp)

    # Phase 4: RDF materialization with constraint enforcement
    graph = Graph()
    graph.parse(data=rdf_triples, format="turtle")
    validate_rdfs_constraints(graph)

    return graph

Example: From Schema to Taxonomy

Consider a simple membership database:

Input: SQL Schema
CREATE TABLE Member (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(255)
);

CREATE TABLE Activity (
    id INT PRIMARY KEY,
    member_id INT REFERENCES Member(id),
    type VARCHAR(50),
    timestamp DATETIME
);

TaxonomyLLM learns the topological compatibility between schema elements and taxonomy concepts through attention scoring:

Output: RDF Taxonomy (Turtle)
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ex: <http://example.org/taxonomy#> .

ex:Member rdfs:subClassOf ex:PersonalInformation ;
    rdfs:label "Member Entity" .

ex:MemberName rdfs:subPropertyOf ex:PersonalIdentifier ;
    rdfs:domain ex:Member .

ex:MemberEmail rdfs:subPropertyOf ex:ContactInformation ;
    rdfs:domain ex:Member .

ex:Activity rdfs:subClassOf ex:ActivityEvent ;
    rdfs:label "Activity Record" .

ex:ActivityTimestamp rdfs:subPropertyOf ex:TemporalMarker ;
    rdfs:domain ex:Activity .

Training: Two-Phase Knowledge Acquisition

Phase Data Source Learning Objective
Pre-training SchemaStore (5,000+ schemas) Schema assimilation, RDF action encoding, topological alignment
Instruction Tuning 1,000+ enterprise taxonomy graphs rdfs:subClassOf relationships, property scoping, constraint validation

Pre-training on Schema Diversity

The model trains on SchemaStore's diverse collection spanning SQL, NoSQL, and graph database formats. This teaches three foundational capabilities:

  • Schema Assimilation — Recognizing patterns across different DDL syntaxes and schema conventions
  • RDF Action Encoding — Learning the vocabulary of semantic web predicates and their usage contexts
  • Topological Alignment — Understanding how relational structures map to hierarchical taxonomies

Instruction Tuning for Enterprise Constraints

The second phase uses curated enterprise taxonomy graphs with explicit valid/invalid examples. The model learns formal constraints like proper rdfs:subClassOf hierarchies and property scoping rules through contrastive feedback.

Implementation Details

Component Technology Purpose
Base Model T5 (Transformers 4.10.0) Encoder-decoder architecture for seq2seq generation
Framework TensorFlow 2.8.0 Model training and inference
Schema Parser SQLParse 0.4.2 Extract structure from CREATE statements
RDF Engine RDFLib 6.1.1 Graph construction and serialization

Custom Components

TaxonomyLLM Class Structure
class TaxonomyLLM(transformers.TFT5ForConditionalGeneration):
    """
    Specialized T5 variant for schema-to-taxonomy generation.
    Extends base T5 with disentangled encoding.
    """
    def __init__(self, config):
        super().__init__(config)
        self.schema_encoder = SchemaEncoder(config.d_model)
        self.taxonomy_decoder = TaxonomyDecoder(config.d_model)
        self.topo_attention = TopoAttentionLayer(config.num_heads)

    def forward(self, schema_input):
        # Structural and positional embeddings
        Es, Ep = self.schema_encoder(schema_input)

        # Disentangled attention
        Hs = self.topo_attention(Es, mode='structural')
        Hp = self.topo_attention(Ep, mode='positional')

        # Generate RDF output
        return self.taxonomy_decoder(Hs, Hp)
Complete Pipeline Usage
from taxonomy_llm import TaxonomyLLM
from rdflib import Graph

# Initialize model
model = TaxonomyLLM.from_pretrained('taxonomy-llm-base')

# Input schema
schema = """
CREATE TABLE Customer (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(255) UNIQUE
);
"""

# Generate taxonomy
parsed_schema = model.parse(schema)
input_vectors = model.encode(parsed_schema)
output_triples = model.generate(input_vectors)

# Materialize RDF graph
graph = Graph()
graph.parse(data=output_triples, format="turtle")
print(graph.serialize(format="turtle"))

Results: Validation on Enterprise Schemas

We evaluated TaxonomyLLM on 50 enterprise schemas with human-annotated gold-standard taxonomies:

86% RDF Validity
81% Mapping Precision
79% Vocabulary Alignment
74% Topology Comparability

These metrics represent a significant reduction in manual taxonomy authoring effort. The 86% RDF validity score means the vast majority of generated taxonomies are syntactically correct and semantically coherent, requiring only targeted human review rather than ground-up creation.

Application Domains

Data Governance

Auto-generate metadata taxonomies for compliance frameworks like GDPR, enabling consistent data classification across diverse systems

Knowledge Graph Construction

Bootstrap enterprise knowledge graphs by automatically mapping legacy database schemas to ontological structures

Data Catalog Enrichment

Enhance data catalog entries with semantic relationships, improving discoverability and lineage tracking

Schema Evolution

Automatically update taxonomies when database schemas change, maintaining semantic consistency across versions

Key Takeaways

  • Disentanglement matters — Separating structural and positional reasoning enables more precise taxonomy generation than unified embeddings
  • T5 excels at structured generation — The encoder-decoder architecture naturally fits the schema-to-RDF translation task
  • Two-phase training is essential — Pre-training on schema diversity followed by instruction tuning on enterprise taxonomies produces production-ready results
  • Automation enables agility — Reducing taxonomy authoring from weeks to minutes unlocks faster data product development

Explore the Code

The complete implementation including model architecture, training scripts, and example datasets is available on GitHub.

View on GitHub