Multi-Cloud LLM Training Architecture: GCP, AWS & RunPod GPU Infrastructure

Complete reference architecture for LLM fine-tuning with data on GCP/AWS and training on RunPod GPUs. Covers networking options, CIDR planning, Docker registry integration, and secure data transfer patterns.

GT
Gonnect Team
January 18, 202518 min readView on GitHub
RunPodGCPAWSWireGuardCloud SyncArtifact RegistryECR

Introduction

Training large language models (LLMs) requires substantial GPU compute resources that traditional cloud providers may not always offer at competitive prices. RunPod has emerged as a popular alternative, providing access to high-end GPUs like A100 and H100 at significantly lower costs than AWS or GCP.

However, many enterprises already have their training data stored in cloud platforms like AWS S3 or Google Cloud Storage. This creates a common architectural challenge: how do you efficiently and securely transfer data from your existing cloud infrastructure to external GPU compute providers like RunPod?

This article provides a complete reference architecture for multi-cloud LLM training, covering:

  1. RunPod Networking Options - Cloud Sync, Global Networking, and optional WireGuard
  2. Cloud Data Storage - GCP and AWS data lake patterns
  3. Docker Image Management - Where to store and pull training container images
  4. Network Architecture - CIDR planning and secure connectivity
  5. LLM Fine-Tuning Data Flow - End-to-end training pipeline

Key Finding from Research: RunPod does NOT support native VPC peering with external clouds. Data transfer happens via Cloud Sync (TLS/HTTPS) or optional self-hosted WireGuard tunnels.

Multi-Cloud Training Architecture Overview

Multi-Cloud LLM Training Architecture

Loading diagram...

Architecture Components

ComponentPurposeLocation
GCS / S3Training data storageGCP / AWS
Artifact Registry / ECRDocker image storageGCP / AWS
Cloud SyncData transfer to RunPodRunPod built-in
Global NetworkingPod-to-Pod communicationRunPod internal
GPU PodsLLM fine-tuning computeRunPod

RunPod Networking: What You Need to Know

RunPod offers several networking features, but it's crucial to understand their capabilities and limitations.

Global Networking (Internal)

RunPod Global Networking (Internal)

Loading diagram...

RunPod's Global Networking creates a secure, private network connecting all your Pods within your account:

FeatureSpecification
Speed100 Mbps between Pods
DNS<podid>.runpod.internal
IsolationComplete isolation from external networks
AvailabilityNVIDIA GPU Pods only
Regions17+ data centers worldwide

Connectivity Options Comparison

RunPod Connectivity Options

Loading diagram...
OptionProsConsUse Case
Cloud Sync (Default)Easy setup, built-in, TLS encryptedUses public internetStandard training workflows
WireGuard (Advanced)Private connectivity, no public portsSelf-hosted, more complexHigh-security requirements

GCP Network Architecture

When your training data lives in GCP, implement a hub-spoke VPC architecture for optimal security and scalability.

GCP Hub-Spoke Network (CIDR Planning)

Loading diagram...

GCP CIDR Planning

VPCCIDR BlockPurpose
Hub VPC10.0.0.0/16Centralized egress, VPN gateway
Data Spoke10.1.0.0/16GCS, BigQuery data storage
ML Spoke10.2.0.0/16Artifact Registry, ML services
Shared Services10.3.0.0/16Vault, secrets management

GCP Best Practices

  1. Private Service Connect - Access GCS without public IPs
  2. VPC Service Controls - Prevent data exfiltration
  3. Cloud NAT - Controlled egress for RunPod API calls
  4. Cloud VPN - Optional WireGuard gateway

AWS Network Architecture

For AWS-based training data, use a multi-subnet VPC with VPC endpoints for private access.

AWS VPC Network (CIDR Planning)

Loading diagram...

AWS CIDR Planning

SubnetCIDR BlockPurpose
VPC172.16.0.0/16Main VPC
Public Subnet172.16.0.0/24NAT Gateway, bastion
Data Subnet172.16.1.0/24S3/ECR endpoints
ML Subnet172.16.2.0/24VPN endpoint

AWS Best Practices

  1. S3 Gateway Endpoint - Free private S3 access
  2. ECR Interface Endpoint - Private Docker pulls
  3. Security Groups - Least privilege access
  4. VPN Endpoint - Optional WireGuard connectivity

Docker Image Storage

Docker Image Storage & Distribution

Loading diagram...

Registry Comparison

RegistryCloudFeaturesRunPod Access
Artifact RegistryGCPVulnerability scanning, IAMPublic or VPC endpoint
ECRAWSLifecycle policies, cross-accountPublic or PrivateLink
Docker HubNeutralEasy public accessDirect pull
RunPod RegistryRunPodS3-compatible APINative integration

Docker Image Strategy

# CI/CD Pipeline for Multi-Registry Push
name: Build and Push
on:
  push:
    branches: [main]
    paths: ['docker/**']

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build Image
        run: docker build -t llm-trainer:${{ github.sha }} .

      # Push to GCP Artifact Registry
      - name: Push to GCR
        run: |
          gcloud auth configure-docker us-central1-docker.pkg.dev
          docker tag llm-trainer:${{ github.sha }} \
            us-central1-docker.pkg.dev/project/repo/llm-trainer:${{ github.sha }}
          docker push us-central1-docker.pkg.dev/project/repo/llm-trainer:${{ github.sha }}

      # Push to AWS ECR
      - name: Push to ECR
        run: |
          aws ecr get-login-password | docker login --username AWS --password-stdin \
            ${{ secrets.AWS_ACCOUNT }}.dkr.ecr.us-east-1.amazonaws.com
          docker tag llm-trainer:${{ github.sha }} \
            ${{ secrets.AWS_ACCOUNT }}.dkr.ecr.us-east-1.amazonaws.com/llm-trainer:${{ github.sha }}
          docker push ${{ secrets.AWS_ACCOUNT }}.dkr.ecr.us-east-1.amazonaws.com/llm-trainer:${{ github.sha }}

LLM Fine-Tuning Data Flow

LLM Fine-Tuning Data Flow

Loading diagram...

Training Pipeline Stages

StageLocationTool
Raw DataGCP/AWSGCS/S3 bucket
TokenizationCloudDataflow/Glue
Data ShardingCloudCustom script
Secure TransferNetworkCloud Sync (TLS)
Data LoadingRunPodPyTorch DataLoader
Fine-TuningRunPodDeepSpeed/FSDP
Checkpoint SaveRunPod → CloudCloud Sync
Model ExportCloudGCS/S3

Cloud Sync Configuration

# RunPod Cloud Sync for LLM Training
import runpod
from runpod.cloud_sync import CloudSync

# Initialize sync with GCS
gcs_sync = CloudSync(
    provider="gcs",
    credentials_json=os.environ["GCS_SERVICE_ACCOUNT"],
    bucket="training-data-bucket",
    local_path="/workspace/data"
)

# Sync training data
gcs_sync.download(
    source_path="datasets/llm-finetune/",
    include_patterns=["*.parquet", "*.jsonl"]
)

# After training, sync checkpoints back
s3_sync = CloudSync(
    provider="aws",
    access_key=os.environ["AWS_ACCESS_KEY"],
    secret_key=os.environ["AWS_SECRET_KEY"],
    region="us-east-1",
    bucket="model-checkpoints-bucket",
    local_path="/workspace/checkpoints"
)

s3_sync.upload(
    destination_path=f"run-{run_id}/",
    include_patterns=["*.pt", "*.safetensors"]
)

CIDR Planning for Multi-Cloud

Multi-Cloud CIDR Block Planning

Loading diagram...

Complete CIDR Allocation

EnvironmentCIDRNotes
GCP Hub10.0.0.0/1665,536 IPs
GCP Data10.1.0.0/16Storage VPC
GCP ML10.2.0.0/16Registry, ML services
AWS VPC172.16.0.0/16Non-overlapping with GCP
AWS Public172.16.0.0/24NAT, bastion
AWS Private172.16.1.0/24Data subnets
WireGuard Tunnel10.100.0.0/24Optional overlay
RunPod InternalManaged*.runpod.internal

Secure Network Path

Secure Network Path (No Internet Hop)

Loading diagram...

Security Layers

LayerGCPAWSRunPod
Data at RestCMEK encryptionKMS encryptionAES-256
Data in TransitTLS 1.3TLS 1.3TLS 1.3
Access ControlIAM + VPC-SCIAM + SGAPI keys
Network IsolationPrivate endpointsPrivateLinkGlobal Networking

When to Use WireGuard vs Cloud Sync

Based on our research, here's the decision framework:

Use Cloud Sync (Default) When:

  • Standard LLM training workflows
  • Data is not extremely sensitive
  • You want simple, managed connectivity
  • You're okay with 100 Mbps throughput

Use WireGuard When:

  • Regulatory requirements prohibit public endpoints
  • You need private IP connectivity
  • You want to avoid exposing any ports publicly
  • You're running sensitive workloads

WireGuard is FREE and Open Source

WireGuard is licensed under:

  • Kernel components: GPLv2
  • User-space tools: GPL-2.0, MIT, BSD, Apache 2.0

You can deploy WireGuard at no licensing cost—you only pay for the VM/bandwidth.

Best Practices Summary

1. Data Storage

  • Use private endpoints for GCS/S3 access
  • Implement VPC Service Controls (GCP) or VPC endpoints (AWS)
  • Enable encryption at rest with customer-managed keys

2. Networking

  • Single cloud? Use native VPC networking
  • Multi-cloud? Consider AWS-GCP Interconnect (new in 2025)
  • External GPU? Use Cloud Sync or WireGuard

3. Docker Images

  • Push to both Artifact Registry and ECR
  • Use VPC endpoints for private pulls
  • Implement image signing and vulnerability scanning

4. Security

  • Never hard-code API keys in scripts
  • Use environment variables or secrets managers
  • Implement least privilege IAM policies
  • Enable audit logging for all data access

Conclusion

Building a multi-cloud LLM training architecture requires understanding the networking capabilities and limitations of each platform. Key takeaways:

  1. RunPod uses Cloud Sync for external data transfer—there's no native VPC peering
  2. Global Networking provides 100 Mbps Pod-to-Pod connectivity within RunPod
  3. WireGuard is optional for cases requiring private connectivity without public ports
  4. Docker images should be stored in cloud registries (Artifact Registry, ECR) with VPC endpoints
  5. CIDR planning must avoid overlaps between GCP, AWS, and tunnel networks

Further Reading