The Problem: Why ML Feature Management is Hard

Machine learning teams face a coordination nightmare. As models proliferate across an organization, the artifacts that feed them grow exponentially. The challenges compound:

  • Feature Duplication — Different teams create the same features independently, leading to inconsistent definitions and wasted compute
  • Version Chaos — Which version of feature X was used to train model Y? Nobody knows, and reproducing results becomes impossible
  • Lineage Blindness — When a data source changes, which models are affected? Teams discover issues only when production breaks
  • Hyperparameter Amnesia — That high-performing model from three months ago? The exact configuration is lost in someone's notebook
  • Platform Lock-in — Each ML platform (SageMaker, Kubeflow, Vertex AI) has its own way of tracking artifacts, fragmenting institutional knowledge

The root cause is simple: ML artifacts exist in a web of relationships, but most systems treat them as isolated files. You need a system that understands connections, not just storage.

The Solution: A Social Network for ML Artifacts

Chakravyuh approaches ML engineering from a different angle. Rather than storing feature values or model binaries, it tracks the relationships between them. Think of it as LinkedIn for your ML artifacts—a platform where features, datasets, models, and hyperparameters maintain their professional network.

What Chakravyuh Actually Tracks
1

Features

Definitions, versions, compositions via group/set theory

2

Datasets

Pointers, versions, and transformation lineage

3

Models

Training records, discovery, ranking, registration

4

Hyperparameters

Configuration tracking across training runs

5

Execution Runs

Complete provenance of what ran when

What Chakravyuh Is NOT

This distinction matters. Chakravyuh does not replace your existing infrastructure:

  • Not a feature value store — Use Redis, Feast, or your platform's native solution for that
  • Not a dataset warehouse — S3, GCS, or your data lake handles actual storage
  • Not a hyperparameter tuning framework — Optuna, Ray Tune, or SageMaker Hyperparameter Tuning does the optimization
  • Not a model registry — MLflow, SageMaker Model Registry, or Vertex AI stores the actual model binaries

Instead, it sits above all these systems, maintaining the metadata graph that connects everything together. Platform-agnostic by design.

Why Graph? The Natural Shape of ML Lineage

Consider a typical ML lineage question: "Which production models will be affected if we change the customer_lifetime_value feature?" In a traditional relational database, answering this requires multiple joins across features, datasets, training runs, and deployed models. The query complexity grows exponentially with relationship depth.

In a graph database like Neo4j, this becomes a simple traversal:

Lineage Query in Cypher
// Find all models affected by a feature change
MATCH (f:Feature {name: 'customer_lifetime_value'})
      -[:USED_IN]->(d:Dataset)
      -[:TRAINED]->(m:Model)
      -[:DEPLOYED_TO]->(env:Environment {type: 'production'})
RETURN m.name, m.version, env.name

Feature Composition with Set Theory

Features rarely exist in isolation. A "premium_customer" feature might combine purchase_frequency, average_order_value, and customer_tenure. Chakravyuh models these compositions using group and set theory:

Feature Group Definition
// Define a composite feature group
MATCH (f1:Feature {name: 'purchase_frequency'})
MATCH (f2:Feature {name: 'average_order_value'})
MATCH (f3:Feature {name: 'customer_tenure'})
CREATE (fg:FeatureGroup {name: 'premium_customer_signals'})
CREATE (fg)-[:CONTAINS]->(f1)
CREATE (fg)-[:CONTAINS]->(f2)
CREATE (fg)-[:CONTAINS]->(f3)
CREATE (fg)-[:DERIVED_BY]->(transform:Transformation {
    logic: 'weighted_combination',
    weights: [0.3, 0.5, 0.2]
})

When any constituent feature changes, the graph immediately reveals which composite features and downstream models need attention.

Domain Model: The Core Entities

The domain model captures the essential relationships in ML engineering workflows. Each entity type has specific attributes and relationships:

Entity Key Attributes Primary Relationships
Feature name, version, dataType, description, owner BELONGS_TO FeatureGroup, USED_IN Dataset
FeatureGroup name, composition_logic, created_at CONTAINS Features, DERIVED_BY Transformation
Dataset name, version, location_uri, schema USES Features, PRODUCED_BY Pipeline
Model name, version, algorithm, metrics TRAINED_ON Dataset, CONFIGURED_WITH Hyperparameters
Hyperparameters config_map, search_space, tuning_method APPLIED_TO TrainingRun, OPTIMIZED_BY Experiment
TrainingRun run_id, start_time, duration, status PRODUCED Model, USED Dataset, APPLIED Hyperparameters
Java Domain Entity Example
@Node
public class Feature {
    @Id @GeneratedValue
    private Long id;

    private String name;
    private String version;
    private String dataType;
    private String description;
    private LocalDateTime createdAt;

    @Relationship(type = "BELONGS_TO", direction = OUTGOING)
    private Set groups;

    @Relationship(type = "USED_IN", direction = OUTGOING)
    private Set datasets;

    @Relationship(type = "PREVIOUS_VERSION", direction = OUTGOING)
    private Feature previousVersion;
}

API Architecture: Three Pillars of ML Engineering

Chakravyuh organizes its RESTful APIs into three functional categories, each addressing a distinct phase of the ML lifecycle:

ML Engineering APIs

Feature CRUD, dataset registration, version management, lineage queries. The daily operations of building ML systems.

Collaboration APIs

Model discovery, feature search, team ownership, access control. Enabling cross-team reuse and knowledge sharing.

Deployment APIs

Model registration, serving configuration, OAS 3.0 spec generation. Bridging training and production.

Feature Management Endpoints

Feature API Examples
// Register a new feature
POST /api/v1/features
{
    "name": "customer_churn_score",
    "dataType": "FLOAT",
    "description": "Probability of customer churning in next 30 days",
    "sourceUri": "s3://features/churn/v1",
    "owner": "risk-team"
}

// Get feature lineage
GET /api/v1/features/{featureId}/lineage?depth=3

// Find features by pattern
GET /api/v1/features/search?query=customer*&owner=risk-team

// Create new version
POST /api/v1/features/{featureId}/versions
{
    "changes": "Added recency weighting",
    "sourceUri": "s3://features/churn/v2"
}

Model Discovery and Ranking

Model Discovery API
// Search models by capability
GET /api/v1/models/discover?task=classification&domain=fraud

// Response includes ranking by metrics
{
    "models": [
        {
            "name": "fraud_detector_v3",
            "metrics": {"auc": 0.94, "precision": 0.87},
            "rank": 1,
            "features_used": ["transaction_velocity", "merchant_risk_score"],
            "last_trained": "2024-01-15T10:30:00Z"
        }
    ]
}

Automatic OAS 3.0 Generation

When a model is ready for serving, Chakravyuh can generate an OpenAPI 3.0 specification based on its input features and output schema:

Generate Serving Spec
// Generate OAS 3.0 spec for model serving
POST /api/v1/models/{modelId}/generate-oas

// Returns complete OpenAPI specification
{
    "openapi": "3.0.0",
    "info": {
        "title": "Fraud Detector API",
        "version": "3.0.0"
    },
    "paths": {
        "/predict": {
            "post": {
                "requestBody": {
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/FraudPredictionRequest"
                            }
                        }
                    }
                }
            }
        }
    }
}

Hyperparameter Tracking: Beyond Configuration Files

Every ML practitioner has lost track of the exact hyperparameters that produced a winning model. Chakravyuh treats hyperparameter configurations as first-class citizens in the graph:

Hyperparameter Tracking
// Record hyperparameters for a training run
POST /api/v1/training-runs
{
    "modelId": "fraud_detector",
    "datasetId": "transactions_202401",
    "hyperparameters": {
        "learning_rate": 0.001,
        "batch_size": 256,
        "epochs": 50,
        "dropout": 0.3,
        "optimizer": "adam",
        "early_stopping_patience": 5
    },
    "search_space": {
        "learning_rate": {"type": "log_uniform", "min": 0.0001, "max": 0.1},
        "dropout": {"type": "uniform", "min": 0.1, "max": 0.5}
    },
    "tuning_method": "bayesian_optimization"
}

The graph structure enables powerful queries:

Hyperparameter Analysis Queries
// Find best hyperparameters for a model across all runs
GET /api/v1/models/{modelId}/best-hyperparameters?metric=auc

// Compare hyperparameters between model versions
GET /api/v1/models/{modelId}/versions/diff?v1=2&v2=3

// Find runs with similar configurations
GET /api/v1/training-runs/similar?config={"learning_rate": 0.001}

System Architecture

Layer Technologies Purpose
Runtime Java 14 Type safety, performance, enterprise compatibility
Framework Spring Boot 2.3.3 Dependency injection, configuration, REST support
Database Neo4j Native graph storage, Cypher queries, ACID compliance
Data Access Spring Data Neo4j Repository pattern, object-graph mapping
API Layer Spring Web, OpenAPI RESTful endpoints, documentation generation

Integration Patterns

Chakravyuh integrates with existing ML infrastructure through lightweight adapters:

Platform Integration Example
// SageMaker integration - sync training job metadata
@Service
public class SageMakerAdapter {

    @Autowired
    private TrainingRunRepository trainingRunRepository;

    public void syncTrainingJob(String sageMakerJobArn) {
        // Fetch job details from SageMaker
        DescribeTrainingJobResult job = sageMaker.describeTrainingJob(
            new DescribeTrainingJobRequest().withTrainingJobName(jobArn)
        );

        // Create graph node with relationships
        TrainingRun run = TrainingRun.builder()
            .externalId(sageMakerJobArn)
            .platform("SAGEMAKER")
            .hyperparameters(job.getHyperParameters())
            .metrics(job.getFinalMetricDataList())
            .build();

        trainingRunRepository.save(run);
    }
}

Real-World Impact

Organizations using graph-based ML metadata management report significant improvements:

60% Reduction in Feature Duplication
5x Faster Impact Analysis
100% Reproducibility of Training Runs

Where This Matters Most

Financial Services

Model governance, regulatory compliance, audit trails for credit decisions and fraud detection systems.

E-commerce

Recommendation engine features, A/B test tracking, personalization model lineage across customer touchpoints.

Healthcare

Clinical ML models require rigorous provenance tracking for FDA compliance and patient safety.

Multi-Cloud Enterprises

Unified metadata layer when ML workloads span AWS, GCP, and Azure with different native tooling.

Getting Started

Chakravyuh runs as a standalone service alongside your existing ML infrastructure:

Quick Start
# Clone the repository
git clone https://github.com/mgorav/Chakravyuh.git
cd Chakravyuh

# Start Neo4j (Docker)
docker run -d --name neo4j \
    -p 7474:7474 -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/password \
    neo4j:latest

# Configure and run
./mvnw spring-boot:run -Dspring.profiles.active=dev

# API available at http://localhost:8080/api/v1

Explore the Code

The complete implementation is available on GitHub with documentation, domain models, and API specifications.

View on GitHub