The Problem: Why ML Feature Management is Hard
Machine learning teams face a coordination nightmare. As models proliferate across an organization, the artifacts that feed them grow exponentially. The challenges compound:
- Feature Duplication — Different teams create the same features independently, leading to inconsistent definitions and wasted compute
- Version Chaos — Which version of feature X was used to train model Y? Nobody knows, and reproducing results becomes impossible
- Lineage Blindness — When a data source changes, which models are affected? Teams discover issues only when production breaks
- Hyperparameter Amnesia — That high-performing model from three months ago? The exact configuration is lost in someone's notebook
- Platform Lock-in — Each ML platform (SageMaker, Kubeflow, Vertex AI) has its own way of tracking artifacts, fragmenting institutional knowledge
The root cause is simple: ML artifacts exist in a web of relationships, but most systems treat them as isolated files. You need a system that understands connections, not just storage.
The Solution: A Social Network for ML Artifacts
Chakravyuh approaches ML engineering from a different angle. Rather than storing feature values or model binaries, it tracks the relationships between them. Think of it as LinkedIn for your ML artifacts—a platform where features, datasets, models, and hyperparameters maintain their professional network.
Features
Definitions, versions, compositions via group/set theory
Datasets
Pointers, versions, and transformation lineage
Models
Training records, discovery, ranking, registration
Hyperparameters
Configuration tracking across training runs
Execution Runs
Complete provenance of what ran when
What Chakravyuh Is NOT
This distinction matters. Chakravyuh does not replace your existing infrastructure:
- Not a feature value store — Use Redis, Feast, or your platform's native solution for that
- Not a dataset warehouse — S3, GCS, or your data lake handles actual storage
- Not a hyperparameter tuning framework — Optuna, Ray Tune, or SageMaker Hyperparameter Tuning does the optimization
- Not a model registry — MLflow, SageMaker Model Registry, or Vertex AI stores the actual model binaries
Instead, it sits above all these systems, maintaining the metadata graph that connects everything together. Platform-agnostic by design.
Why Graph? The Natural Shape of ML Lineage
Consider a typical ML lineage question: "Which production models will be affected if we change the customer_lifetime_value feature?" In a traditional relational database, answering this requires multiple joins across features, datasets, training runs, and deployed models. The query complexity grows exponentially with relationship depth.
In a graph database like Neo4j, this becomes a simple traversal:
// Find all models affected by a feature change
MATCH (f:Feature {name: 'customer_lifetime_value'})
-[:USED_IN]->(d:Dataset)
-[:TRAINED]->(m:Model)
-[:DEPLOYED_TO]->(env:Environment {type: 'production'})
RETURN m.name, m.version, env.name
Feature Composition with Set Theory
Features rarely exist in isolation. A "premium_customer" feature might combine purchase_frequency, average_order_value, and customer_tenure. Chakravyuh models these compositions using group and set theory:
// Define a composite feature group
MATCH (f1:Feature {name: 'purchase_frequency'})
MATCH (f2:Feature {name: 'average_order_value'})
MATCH (f3:Feature {name: 'customer_tenure'})
CREATE (fg:FeatureGroup {name: 'premium_customer_signals'})
CREATE (fg)-[:CONTAINS]->(f1)
CREATE (fg)-[:CONTAINS]->(f2)
CREATE (fg)-[:CONTAINS]->(f3)
CREATE (fg)-[:DERIVED_BY]->(transform:Transformation {
logic: 'weighted_combination',
weights: [0.3, 0.5, 0.2]
})
When any constituent feature changes, the graph immediately reveals which composite features and downstream models need attention.
Domain Model: The Core Entities
The domain model captures the essential relationships in ML engineering workflows. Each entity type has specific attributes and relationships:
| Entity | Key Attributes | Primary Relationships |
|---|---|---|
| Feature | name, version, dataType, description, owner | BELONGS_TO FeatureGroup, USED_IN Dataset |
| FeatureGroup | name, composition_logic, created_at | CONTAINS Features, DERIVED_BY Transformation |
| Dataset | name, version, location_uri, schema | USES Features, PRODUCED_BY Pipeline |
| Model | name, version, algorithm, metrics | TRAINED_ON Dataset, CONFIGURED_WITH Hyperparameters |
| Hyperparameters | config_map, search_space, tuning_method | APPLIED_TO TrainingRun, OPTIMIZED_BY Experiment |
| TrainingRun | run_id, start_time, duration, status | PRODUCED Model, USED Dataset, APPLIED Hyperparameters |
@Node
public class Feature {
@Id @GeneratedValue
private Long id;
private String name;
private String version;
private String dataType;
private String description;
private LocalDateTime createdAt;
@Relationship(type = "BELONGS_TO", direction = OUTGOING)
private Set groups;
@Relationship(type = "USED_IN", direction = OUTGOING)
private Set datasets;
@Relationship(type = "PREVIOUS_VERSION", direction = OUTGOING)
private Feature previousVersion;
}
API Architecture: Three Pillars of ML Engineering
Chakravyuh organizes its RESTful APIs into three functional categories, each addressing a distinct phase of the ML lifecycle:
ML Engineering APIs
Feature CRUD, dataset registration, version management, lineage queries. The daily operations of building ML systems.
Collaboration APIs
Model discovery, feature search, team ownership, access control. Enabling cross-team reuse and knowledge sharing.
Deployment APIs
Model registration, serving configuration, OAS 3.0 spec generation. Bridging training and production.
Feature Management Endpoints
// Register a new feature
POST /api/v1/features
{
"name": "customer_churn_score",
"dataType": "FLOAT",
"description": "Probability of customer churning in next 30 days",
"sourceUri": "s3://features/churn/v1",
"owner": "risk-team"
}
// Get feature lineage
GET /api/v1/features/{featureId}/lineage?depth=3
// Find features by pattern
GET /api/v1/features/search?query=customer*&owner=risk-team
// Create new version
POST /api/v1/features/{featureId}/versions
{
"changes": "Added recency weighting",
"sourceUri": "s3://features/churn/v2"
}
Model Discovery and Ranking
// Search models by capability
GET /api/v1/models/discover?task=classification&domain=fraud
// Response includes ranking by metrics
{
"models": [
{
"name": "fraud_detector_v3",
"metrics": {"auc": 0.94, "precision": 0.87},
"rank": 1,
"features_used": ["transaction_velocity", "merchant_risk_score"],
"last_trained": "2024-01-15T10:30:00Z"
}
]
}
Automatic OAS 3.0 Generation
When a model is ready for serving, Chakravyuh can generate an OpenAPI 3.0 specification based on its input features and output schema:
// Generate OAS 3.0 spec for model serving
POST /api/v1/models/{modelId}/generate-oas
// Returns complete OpenAPI specification
{
"openapi": "3.0.0",
"info": {
"title": "Fraud Detector API",
"version": "3.0.0"
},
"paths": {
"/predict": {
"post": {
"requestBody": {
"content": {
"application/json": {
"schema": {
"$ref": "#/components/schemas/FraudPredictionRequest"
}
}
}
}
}
}
}
}
Hyperparameter Tracking: Beyond Configuration Files
Every ML practitioner has lost track of the exact hyperparameters that produced a winning model. Chakravyuh treats hyperparameter configurations as first-class citizens in the graph:
// Record hyperparameters for a training run
POST /api/v1/training-runs
{
"modelId": "fraud_detector",
"datasetId": "transactions_202401",
"hyperparameters": {
"learning_rate": 0.001,
"batch_size": 256,
"epochs": 50,
"dropout": 0.3,
"optimizer": "adam",
"early_stopping_patience": 5
},
"search_space": {
"learning_rate": {"type": "log_uniform", "min": 0.0001, "max": 0.1},
"dropout": {"type": "uniform", "min": 0.1, "max": 0.5}
},
"tuning_method": "bayesian_optimization"
}
The graph structure enables powerful queries:
// Find best hyperparameters for a model across all runs
GET /api/v1/models/{modelId}/best-hyperparameters?metric=auc
// Compare hyperparameters between model versions
GET /api/v1/models/{modelId}/versions/diff?v1=2&v2=3
// Find runs with similar configurations
GET /api/v1/training-runs/similar?config={"learning_rate": 0.001}
System Architecture
| Layer | Technologies | Purpose |
|---|---|---|
| Runtime | Java 14 | Type safety, performance, enterprise compatibility |
| Framework | Spring Boot 2.3.3 | Dependency injection, configuration, REST support |
| Database | Neo4j | Native graph storage, Cypher queries, ACID compliance |
| Data Access | Spring Data Neo4j | Repository pattern, object-graph mapping |
| API Layer | Spring Web, OpenAPI | RESTful endpoints, documentation generation |
Integration Patterns
Chakravyuh integrates with existing ML infrastructure through lightweight adapters:
// SageMaker integration - sync training job metadata
@Service
public class SageMakerAdapter {
@Autowired
private TrainingRunRepository trainingRunRepository;
public void syncTrainingJob(String sageMakerJobArn) {
// Fetch job details from SageMaker
DescribeTrainingJobResult job = sageMaker.describeTrainingJob(
new DescribeTrainingJobRequest().withTrainingJobName(jobArn)
);
// Create graph node with relationships
TrainingRun run = TrainingRun.builder()
.externalId(sageMakerJobArn)
.platform("SAGEMAKER")
.hyperparameters(job.getHyperParameters())
.metrics(job.getFinalMetricDataList())
.build();
trainingRunRepository.save(run);
}
}
Real-World Impact
Organizations using graph-based ML metadata management report significant improvements:
Where This Matters Most
Financial Services
Model governance, regulatory compliance, audit trails for credit decisions and fraud detection systems.
E-commerce
Recommendation engine features, A/B test tracking, personalization model lineage across customer touchpoints.
Healthcare
Clinical ML models require rigorous provenance tracking for FDA compliance and patient safety.
Multi-Cloud Enterprises
Unified metadata layer when ML workloads span AWS, GCP, and Azure with different native tooling.
Getting Started
Chakravyuh runs as a standalone service alongside your existing ML infrastructure:
# Clone the repository
git clone https://github.com/mgorav/Chakravyuh.git
cd Chakravyuh
# Start Neo4j (Docker)
docker run -d --name neo4j \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/password \
neo4j:latest
# Configure and run
./mvnw spring-boot:run -Dspring.profiles.active=dev
# API available at http://localhost:8080/api/v1
Explore the Code
The complete implementation is available on GitHub with documentation, domain models, and API specifications.
View on GitHub