Multi-LLM Orchestration

Multi-LLM Orchestration & Routing in ARAL

ARAL v1.2+ — Advanced intelligent routing, fallback, aggregation, and optimization across multiple LLM providers

Overview

ARAL supports advanced multi-LLM orchestration, enabling agents to:

✅ Use multiple LLM providers (OpenAI, Anthropic, Azure, Google, Cohere, etc.)
✅ Intelligent routing (specialized, cost-optimized, quality-first, latency-first)
✅ Weighted ponderation (blend models with custom weights)
✅ Response aggregation (best-of-n, ensemble, consensus, voting)
✅ Automatic fallback (provider A → B → C if failure)
✅ Load balancing (distribute requests across providers)
✅ Cost & latency optimization (budget limits, max latency)
✅ Specialized routing (task-specific model selection)
✅ Model-specific prompts (customize prompts per LLM)
✅ Content moderation (safety filters, guardrails)
✅ Cost tracking (per-provider usage and spend monitoring)

✅ Yes, ARAL Supports Multi-LLM!

Current Support:

Provider	Models	Status
OpenAI	GPT-4, GPT-4-turbo, GPT-3.5-turbo	✅
Azure	Azure OpenAI models	✅
Anthropic	Claude 4 (Opus, Sonnet), Claude 3	✅
Google	Gemini Pro, Gemini Ultra	🔜
Cohere	Command, Command-R	🔜
Mistral	Mistral Large, Medium, Small	🔜
Local	Ollama, LM Studio, vLLM	🔜

Architecture

┌────────────────────────────────────────────────────┐
│         LLM Orchestrator (Layer 4)                 │
│  ┌──────────────────────────────────────────────┐  │
│  │  Router (Specialized Routing)                │  │
│  │  - Task-based model selection                │  │
│  │  - Cost optimization                         │  │
│  │  - Quality-based selection                   │  │
│  │  - Latency optimization                      │  │
│  │  - Load balancing                            │  │
│  │  - Priority-based routing                    │  │
│  └──────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────┐  │
│  │  Aggregation Engine                          │  │
│  │  - Best-of-N selection                       │  │
│  │  - Ensemble combination                      │  │
│  │  - Consensus voting                          │  │
│  │  - Weighted blending                         │  │
│  │  - Quality-based ranking                     │  │
│  └──────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────┐  │
│  │  Fallback Manager                            │  │
│  │  - Primary → Secondary → Tertiary            │  │
│  │  - Circuit breaker per provider              │  │
│  │  - Health monitoring                         │  │
│  └──────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────┐  │
│  │  Optimization Layer                          │  │
│  │  - Cost tracking & budgets                   │  │
│  │  - Latency monitoring                        │  │
│  │  - Quality scoring                           │  │
│  │  - Performance analytics                     │  │
│  └──────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────┐  │
│  │  Moderation Layer                            │  │
│  │  - Content safety filters                    │  │
│  │  - PII detection                             │  │
│  │  - Guardrails enforcement                    │  │
│  └──────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────┘
         │                │                │
         ▼                ▼                ▼
   ┌─────────┐      ┌─────────┐      ┌─────────┐
   │ OpenAI  │      │Anthropic│      │  Azure  │
   └─────────┘      └─────────┘      └─────────┘

Configuration

Basic Multi-Provider Setup

import { ARALAgent, LLMOrchestrator } from "@aral-standard/sdk";

const agent = new ARALAgent({
  persona: "assistant.json",
  llm: new LLMOrchestrator({
    providers: [
      {
        name: "openai",
        type: "openai",
        apiKey: process.env.OPENAI_API_KEY,
        models: ["gpt-4", "gpt-3.5-turbo"],
        priority: 1,
        cost_per_1k_tokens: { input: 0.03, output: 0.06 },
      },
      {
        name: "anthropic",
        type: "anthropic",
        apiKey: process.env.ANTHROPIC_API_KEY,
        models: ["claude-3-opus", "claude-3-sonnet"],
        priority: 2,
        cost_per_1k_tokens: { input: 0.015, output: 0.075 },
      },
      {
        name: "azure",
        type: "azure",
        endpoint: process.env.AZURE_ENDPOINT,
        apiKey: process.env.AZURE_API_KEY,
        models: ["gpt-4-azure"],
        priority: 3,
        cost_per_1k_tokens: { input: 0.03, output: 0.06 },
      },
    ],
    routing_strategy: "cost_optimized",
    fallback_enabled: true,
    moderation: {
      enabled: true,
      provider: "openai",
    },
  }),
});

Python SDK Example

from aral import ARALAgent, LLMOrchestrator, ProviderConfig

agent = ARALAgent(
    persona_id="assistant",
    llm=LLMOrchestrator(
        providers=[
            ProviderConfig(
                name="openai",
                type="openai",
                api_key=os.getenv("OPENAI_API_KEY"),
                models=["gpt-4", "gpt-3.5-turbo"],
                priority=1,
                cost_per_1k_tokens={"input": 0.03, "output": 0.06}
            ),
            ProviderConfig(
                name="anthropic",
                type="anthropic",
                api_key=os.getenv("ANTHROPIC_API_KEY"),
                models=["claude-3-opus", "claude-3-sonnet"],
                priority=2,
                cost_per_1k_tokens={"input": 0.015, "output": 0.075}
            )
        ],
        routing_strategy="cost_optimized",
        fallback_enabled=True,
        moderation={"enabled": True, "provider": "openai"}
    )
)

Routing Strategies

1. Cost-Optimized Routing

Route to the cheapest model that meets quality requirements:

const orchestrator = new LLMOrchestrator({
  routing_strategy: "cost_optimized",
  quality_threshold: 0.85, // Min quality score
  providers: [
    { name: "gpt-3.5", cost: 0.002, quality: 0.8 }, // Cheapest, below threshold
    { name: "claude-sonnet", cost: 0.015, quality: 0.88 }, // ✅ Selected (cheapest meeting threshold)
    { name: "gpt-4", cost: 0.03, quality: 0.95 }, // Most expensive
  ],
});

// Result: Uses claude-sonnet (meets quality + cheapest)

2. Quality-First Routing

Always use the highest-quality model:

const orchestrator = new LLMOrchestrator({
  routing_strategy: "quality_first",
  providers: [
    { name: "gpt-3.5", quality: 0.8 },
    { name: "claude-sonnet", quality: 0.88 },
    { name: "gpt-4", quality: 0.95 }, // ✅ Always selected
  ],
});

3. Latency-Optimized Routing

Route to the fastest-responding provider:

const orchestrator = new LLMOrchestrator({
  routing_strategy: "latency_first",
  latency_tracking: true,
  providers: [
    { name: "gpt-4", avg_latency_ms: 1200 },
    { name: "claude-sonnet", avg_latency_ms: 800 }, // ✅ Fastest
    { name: "gpt-3.5", avg_latency_ms: 900 },
  ],
});

4. Round-Robin Load Balancing

Distribute requests evenly:

const orchestrator = new LLMOrchestrator({
  routing_strategy: "round_robin",
  providers: ["gpt-4", "claude-opus", "gpt-3.5"],
});

// Request 1 → gpt-4
// Request 2 → claude-opus
// Request 3 → gpt-3.5
// Request 4 → gpt-4 (cycle repeats)

5. Context-Aware Routing

Route based on task characteristics:

const orchestrator = new LLMOrchestrator({
  routing_strategy: "context_aware",
  rules: [
    {
      condition: (ctx) => ctx.task_type === "code_generation",
      provider: "gpt-4", // Best for coding
    },
    {
      condition: (ctx) => ctx.task_type === "creative_writing",
      provider: "claude-opus", // Best for creative tasks
    },
    {
      condition: (ctx) => ctx.input_length > 100000,
      provider: "claude-opus", // Large context window
    },
    {
      condition: (ctx) => ctx.priority === "low",
      provider: "gpt-3.5", // Cheap for non-critical tasks
    },
  ],
});

Fallback Chains

Automatic Failover

const orchestrator = new LLMOrchestrator({
  fallback_chain: [
    {
      provider: "gpt-4",
      retry_attempts: 2,
      timeout_ms: 30000,
    },
    {
      provider: "claude-opus",
      retry_attempts: 2,
      timeout_ms: 30000,
    },
    {
      provider: "gpt-3.5",
      retry_attempts: 1,
      timeout_ms: 15000,
    },
  ],
  circuit_breaker: {
    failure_threshold: 5,
    reset_timeout_ms: 60000,
  },
});

// Try gpt-4 (2 retries) → if fails → claude-opus (2 retries) → gpt-3.5

Python Example

orchestrator = LLMOrchestrator(
    fallback_chain=[
        {"provider": "gpt-4", "retry_attempts": 2},
        {"provider": "claude-opus", "retry_attempts": 2},
        {"provider": "gpt-3.5", "retry_attempts": 1}
    ],
    circuit_breaker={
        "failure_threshold": 5,
        "reset_timeout_ms": 60000
    }
)

try:
    response = await orchestrator.complete("Analyze this data...")
except AllProvidersFailedError as e:
    # All providers in chain failed
    logger.error(f"All providers failed: {e.failures}")

Consensus Mode

Query multiple models and require agreement:

Example: Critical Decision

const orchestrator = new LLMOrchestrator({
  mode: "consensus",
  consensus_config: {
    providers: ["gpt-4", "claude-opus", "gemini-ultra"],
    min_agreement: 0.67, // Require 2/3 agreement
    voting_method: "majority",
    tie_breaker: "highest_confidence",
  },
});

const result = await orchestrator.complete({
  prompt: "Is this medical diagnosis correct?",
  context: diagnosticData,
});

// Result includes:
// - consensus_reached: true/false
// - agreement_score: 0.67
// - individual_responses: [...]
// - final_decision: "..."
// - confidence: 0.85

Use Cases for Consensus Mode:

🏥 Medical diagnoses
⚖️ Legal opinions
💰 Financial decisions
🔐 Security assessments
🎯 High-stakes recommendations

Routing Strategies

ARAL v1.2+ supports advanced routing strategies that go beyond simple priority-based selection, enabling intelligent model routing based on task type, cost, quality, and latency requirements.

Available Routing Strategies

1. Single (Default for Simple Cases)

Routes all requests to a single primary model.

llm_config: {
  providers: [
    { provider: "openai", model: "gpt-4", priority: 1 }
  ],
  routing_strategy: "single"
}

2. Priority-Based

Routes to the highest priority available provider, with automatic fallback.

llm_config: {
  providers: [
    { provider: "anthropic", model: "claude-opus-4", priority: 1 },
    { provider: "openai", model: "gpt-4", priority: 2, fallback: false },
    { provider: "openai", model: "gpt-3.5-turbo", priority: 3, fallback: true }
  ],
  routing_strategy: "priority_based"
}

Use Cases:

General purpose workflows with fallback
Quality-first approach with cost-effective fallbacks

3. Specialized (Task-Based Routing)

Routes requests to different models based on task type using routing rules.

llm_config: {
  providers: [
    { provider: "anthropic", model: "claude-opus-4", weight: 0.5 },
    { provider: "openai", model: "gpt-4", weight: 0.3 },
    { provider: "anthropic", model: "claude-sonnet-4", weight: 0.2 }
  ],
  routing_strategy: "specialized"
},
extensions: {
  routing_rules: {
    "brainstorming": "claude-opus-4",
    "long_form_content": "gpt-4",
    "refinement": "claude-sonnet-4",
    "code_generation": "gpt-4",
    "creative_fiction": "claude-opus-4",
    "quick_summary": "gpt-3.5-turbo"
  }
}

Use Cases:

Creative applications with distinct workflow stages
Code generation + documentation
Research (search) + analysis (reasoning)
Multi-stage content pipelines

4. Cost-Optimized

Prefers cheaper models while respecting quality thresholds.

llm_config: {
  providers: [
    { provider: "openai", model: "gpt-3.5-turbo", priority: 1 },
    { provider: "anthropic", model: "claude-sonnet-4", priority: 2 },
    { provider: "anthropic", model: "claude-opus-4", priority: 3 }
  ],
  routing_strategy: "cost_optimized",
  cost_optimization: {
    enabled: true,
    max_cost_per_request: 0.10,
    prefer_cheaper: true,
    budget_limit: 50.0
  }
}

Use Cases:

High-volume production applications
Budget-constrained projects
Simple queries that don’t require top-tier models

5. Quality-First

Always routes to the highest quality model regardless of cost.

llm_config: {
  providers: [
    { provider: "anthropic", model: "claude-opus-4", priority: 1 },
    { provider: "openai", model: "gpt-4", priority: 2, fallback: true }
  ],
  routing_strategy: "quality_first"
}

Use Cases:

Critical decision-making
High-stakes content generation
Complex reasoning tasks
Medical/legal/financial applications

6. Latency-First

Prioritizes fastest response time.

llm_config: {
  providers: [
    { provider: "openai", model: "gpt-3.5-turbo", priority: 1 },
    { provider: "anthropic", model: "claude-sonnet-4", priority: 2 }
  ],
  routing_strategy: "latency_first",
  latency_optimization: {
    enabled: true,
    max_latency_ms: 2000,
    timeout_ms: 5000,
    prefer_faster: true
  }
}

Use Cases:

Real-time chat applications
Interactive UIs requiring instant feedback
High-throughput APIs

7. Round-Robin

Distributes requests evenly across providers for load balancing.

llm_config: {
  providers: [
    { provider: "openai", model: "gpt-4" },
    { provider: "anthropic", model: "claude-opus-4" },
    { provider: "azure", model: "gpt-4" }
  ],
  routing_strategy: "round_robin"
}

Use Cases:

Load distribution across multiple API keys
Testing and comparison
Avoiding rate limits

8. Ponderation (Weighted Blending)

Combines outputs from multiple models with custom weights.

llm_config: {
  providers: [
    { provider: "openai", model: "gpt-5.2", weight: 0.8 },
    { provider: "anthropic", model: "claude-sonnet-4.5", weight: 0.2 }
  ],
  routing_strategy: "ponderation"
}

See LLM Ponderation section for details.

9. Consensus (Agreement-Based)

Queries multiple models and requires agreement.

llm_config: {
  providers: [
    { provider: "openai", model: "gpt-4" },
    { provider: "anthropic", model: "claude-opus-4" },
    { provider: "google", model: "gemini-ultra" }
  ],
  routing_strategy: "consensus",
  aggregation: {
    method: "consensus",
    min_responses: 3,
    selection_criteria: ["agreement_score", "confidence"]
  }
}

Use Cases:

High-stakes decisions
Verification and validation
Reducing hallucination risk

Routing Strategy Comparison

Strategy	Cost	Quality	Speed	Use Case
Single	Low	Varies	Fast	Simple applications
Priority-Based	Medium	High	Fast	General purpose
Specialized	Medium	High	Fast	Task-specific workflows
Cost-Optimized	Low	Medium	Fast	High-volume, budget-limited
Quality-First	High	Highest	Medium	Critical applications
Latency-First	Low	Medium	Fastest	Real-time interactive
Round-Robin	Medium	Varies	Fast	Load balancing
Ponderation	High	Highest	Slow	Ensemble quality
Consensus	High	Highest	Slowest	Verification, high-stakes

Response Aggregation

When using multiple LLM providers simultaneously, ARAL can aggregate responses using various strategies.

Aggregation Methods

1. First (Default)

Returns the first successful response.

aggregation: {
  method: "first";
}

2. Best-of-N

Queries multiple models and selects the best response based on criteria.

aggregation: {
  method: "best_of_n",
  selection_criteria: ["creativity", "coherence", "engagement", "originality"],
  min_responses: 2,
  max_responses: 3
}

Example: Creative Specialist Persona

{
  "llm_config": {
    "providers": [
      { "provider": "anthropic", "model": "claude-opus-4", "weight": 0.5 },
      { "provider": "openai", "model": "gpt-4", "weight": 0.3 },
      { "provider": "anthropic", "model": "claude-sonnet-4", "weight": 0.2 }
    ],
    "routing_strategy": "specialized",
    "aggregation": {
      "method": "best_of_n",
      "selection_criteria": [
        "creativity",
        "coherence",
        "originality",
        "engagement"
      ],
      "min_responses": 2,
      "max_responses": 3
    }
  }
}

Selection Criteria Options:

accuracy - Factual correctness
creativity - Originality and novelty
coherence - Logical flow and structure
relevance - Alignment with prompt
engagement - Reader interest and appeal
brevity - Conciseness
detail - Comprehensiveness
clarity - Ease of understanding
technical_depth - Expert-level detail
originality - Unique perspectives

3. Weighted Blend

Combines responses using provider weights.

aggregation: {
  method: "weighted_blend",
  selection_criteria: ["quality", "relevance"]
}

Uses the weight property from each provider configuration.

4. Majority Vote

Selects the response supported by the majority of models.

aggregation: {
  method: "majority_vote",
  min_responses: 3
}

5. Consensus

Requires high agreement between all models.

aggregation: {
  method: "consensus",
  selection_criteria: ["agreement_score", "confidence"],
  min_responses: 3
}

6. Ensemble

Combines strengths of multiple responses into a synthesized output.

aggregation: {
  method: "ensemble",
  selection_criteria: ["accuracy", "creativity", "completeness"],
  min_responses: 2,
  max_responses: 4
}

Aggregation Strategy Comparison

Method	Speed	Cost	Quality	Use Case
First	Fastest	Low	Varies	Simple, cost-effective
Best-of-N	Slow	High	High	Quality-critical, creative
Weighted Blend	Slow	High	High	Balanced ensemble
Majority Vote	Medium	High	Medium	Objective decisions
Consensus	Slowest	Highest	Highest	Mission-critical, verification
Ensemble	Slowest	Highest	Highest	Comprehensive analysis

Model-Specific Prompts

ARAL v1.2+ supports customizing prompts per LLM model, enabling specialized instructions that leverage each model’s unique strengths.

Configuration

{
  "prompts": {
    "system": "You are a Creative Specialist...",
    "prefix": "Creative task:\n\n",
    "llm_specific": {
      "claude-opus-4": {
        "system": "You excel at deep, nuanced creative work with rich narrative exploration.",
        "prefix": "For this creative challenge, bring your deepest creative capabilities:\n\n"
      },
      "gpt-4": {
        "system": "You excel at structured creativity with clear organization.",
        "prefix": "Creative task requiring structured innovation:\n\n"
      },
      "claude-sonnet-4": {
        "system": "You provide balanced creative refinement, enhancing coherence and polish.",
        "prefix": "Creative refinement task:\n\n"
      }
    }
  }
}

Use Cases

1. Leveraging Model Strengths

// Claude Opus: Deep creative exploration
"claude-opus-4": {
  "system": "Focus on deep creative exploration with vivid imagery and emotional resonance."
}

// GPT-4: Structured problem-solving
"gpt-4": {
  "system": "Focus on clear structure, logical organization, and actionable insights."
}

// Claude Sonnet: Balance and refinement
"claude-sonnet-4": {
  "system": "Focus on balance, coherence, and polishing for professional quality."
}

2. Task-Specific Specialization

{
  "extensions": {
    "routing_rules": {
      "brainstorming": "claude-opus-4",
      "structuring": "gpt-4",
      "refinement": "claude-sonnet-4"
    }
  },
  "prompts": {
    "llm_specific": {
      "claude-opus-4": {
        "prefix": "BRAINSTORM MODE: Generate wild, creative ideas without constraints.\n\n"
      },
      "gpt-4": {
        "prefix": "STRUCTURE MODE: Organize and structure the content logically.\n\n"
      },
      "claude-sonnet-4": {
        "prefix": "REFINEMENT MODE: Polish and enhance for clarity and quality.\n\n"
      }
    }
  }
}

3. Output Format Customization

{
  "llm_specific": {
    "gpt-4": {
      "suffix": "\n\nProvide response in JSON format."
    },
    "claude-opus-4": {
      "suffix": "\n\nProvide response in markdown with rich formatting."
    }
  }
}

LLM Ponderation (Weighted Blending)

⚖️ Legal opinions
💰 Financial decisions
🔐 Security assessments
🎯 High-stakes recommendations

LLM Ponderation (Weighted Blending)

Overview

Ponderation combines outputs from multiple LLMs with custom weights, creating ensemble responses that leverage each model’s strengths.

Example: GPT-5.2 = 0.8 + Claude Sonnet 4.5 = 0.2

80% weight to GPT-5.2 (strong reasoning)
20% weight to Claude (creativity, nuance)

Basic Ponderation

const orchestrator = new LLMOrchestrator({
  mode: "ponderation",
  ponderation: {
    enabled: true,
    weights: {
      "gpt-5.2": 0.8,
      "claude-sonnet-4.5": 0.2,
    },
    merge_strategy: "weighted_blend",
    normalize: true, // Auto-normalize weights to sum to 1.0
  },
});

const result = await orchestrator.complete({
  prompt: "Explain quantum computing in simple terms",
});

// Result: 80% GPT-5.2 + 20% Claude Sonnet 4.5 blended response

Python Example

from aral import LLMOrchestrator, PonderationConfig

orchestrator = LLMOrchestrator(
    mode='ponderation',
    ponderation=PonderationConfig(
        enabled=True,
        weights={
            'gpt-5.2': 0.8,
            'claude-sonnet-4.5': 0.2
        },
        merge_strategy='weighted_blend',
        normalize=True
    )
)

response = await orchestrator.complete(
    "Explain quantum computing in simple terms"
)

# Response combines:
# - GPT-5.2's technical accuracy (80%)
# - Claude's clear explanations (20%)

Advanced Ponderation Strategies

1. Task-Adaptive Weighting

Different weights for different task types:

const orchestrator = new LLMOrchestrator({
  mode: "ponderation",
  adaptive_weighting: {
    code_generation: {
      "gpt-4": 0.7,
      "claude-opus": 0.2,
      codellama: 0.1,
    },
    creative_writing: {
      "claude-opus": 0.6,
      "gpt-4": 0.3,
      "mistral-large": 0.1,
    },
    data_analysis: {
      "gpt-4": 0.5,
      "claude-sonnet": 0.3,
      "gemini-pro": 0.2,
    },
  },
});

// Automatically selects weights based on detected task type
const result = await orchestrator.complete({
  prompt: "Write a Python function to analyze sales data",
  task_type: "code_generation", // Uses code_generation weights
});

2. Confidence-Based Dynamic Weighting

Adjust weights based on model confidence scores:

const orchestrator = new LLMOrchestrator({
  mode: "ponderation",
  dynamic_weighting: {
    enabled: true,
    base_weights: {
      "gpt-5.2": 0.5,
      "claude-sonnet-4.5": 0.5,
    },
    confidence_threshold: 0.8,
    adjustment_factor: 0.2, // Boost weight by 20% if confidence > threshold
  },
});

// If GPT-5.2 has confidence 0.95 and Claude has 0.75:
// GPT-5.2: 0.5 + (0.2 * 0.5) = 0.6
// Claude: 0.5 - (0.2 * 0.5) = 0.4
// (Automatically normalized to sum to 1.0)

3. Multi-Stage Ponderation

Different weights for different stages of processing:

const pipeline = new LLMPipeline([
  {
    stage: "research",
    ponderation: {
      "gpt-4": 0.4,
      "claude-opus": 0.4,
      "gemini-pro": 0.2,
    },
  },
  {
    stage: "analysis",
    ponderation: {
      "gpt-5.2": 0.7,
      "claude-sonnet": 0.3,
    },
  },
  {
    stage: "synthesis",
    ponderation: {
      "claude-opus": 0.6,
      "gpt-4": 0.4,
    },
  },
]);

const result = await pipeline.execute(inputData);

Merge Strategies

Strategy	Description	Use Case
weighted_blend	Combine text outputs with weights	General responses
weighted_voting	Vote on discrete choices	Classification, yes/no
weighted_average	Average numerical outputs	Scoring, predictions
confidence_weighted	Use confidence scores as weights	Dynamic blending
best_of_ensemble	Select highest-confidence response	Quality filtering

Weighted Blend Example

// Query both models
const gptResponse = await gpt52.complete(prompt);
const claudeResponse = await claudeSonnet.complete(prompt);

// Blend with 80/20 weights
const blended = weightedBlend([
  { response: gptResponse, weight: 0.8 },
  { response: claudeResponse, weight: 0.2 },
]);

console.log(blended.text);
// Output combines 80% GPT + 20% Claude
// Preserves style/tone weighted toward GPT
// Incorporates Claude's insights at 20%

Practical Use Cases

1. Balanced Accuracy + Creativity

// 70% accuracy-focused, 30% creativity-focused
ponderation: {
  'gpt-4': 0.7,      // Strong reasoning
  'claude-opus': 0.3  // Creative insights
}

2. Cost Optimization with Quality

// 90% cheap model, 10% expensive for quality control
ponderation: {
  'gpt-3.5-turbo': 0.9,  // Cheap
  'gpt-4': 0.1            // Quality check
}

3. Domain Expertise Blending

// Medical diagnosis: blend specialist models
ponderation: {
  'medical-llm': 0.6,     // Domain-specific
  'gpt-4': 0.3,           // General reasoning
  'claude-opus': 0.1      // Safety check
}

4. Multi-Language Support

// Translation with native speaker quality
ponderation: {
  'gpt-4': 0.5,              // Technical accuracy
  'claude-multilingual': 0.3, // Fluency
  'specialized-translator': 0.2 // Domain terms
}

Performance Considerations

Pros:

✅ Better quality than single model
✅ Reduces individual model biases
✅ Leverages complementary strengths
✅ More robust than single-model dependency

Cons:

❌ Higher latency (queries multiple models)
❌ Increased cost (multiple API calls)
❌ Complex error handling

Optimization:

// Parallel execution for speed
const orchestrator = new LLMOrchestrator({
  ponderation: {
    weights: { "gpt-5.2": 0.8, "claude-sonnet": 0.2 },
    execution: "parallel", // Query simultaneously
    timeout_ms: 30000,
    cache_enabled: true, // Cache repeated queries
  },
});

Validation & Normalization

// ARAL automatically validates and normalizes weights
const orchestrator = new LLMOrchestrator({
  ponderation: {
    weights: {
      "gpt-5.2": 4, // Raw weight
      "claude-sonnet": 1, // Raw weight
    },
    normalize: true, // Auto-converts to 0.8 and 0.2
  },
});

// Validation rules:
// - Weights must be positive
// - Sum must be > 0
// - Auto-normalized to sum to 1.0
// - Warning if imbalance > 0.95 (one model dominates)

Content Moderation & Safety

Built-in Moderation

const orchestrator = new LLMOrchestrator({
  moderation: {
    enabled: true,
    provider: 'openai',  // Use OpenAI moderation API

    input_checks: {
      hate_speech: true,
      sexual_content: true,
      violence: true,
      self_harm: true,
      pii_detection: true
    },

    output_checks: {
      factual_accuracy: false,  // Requires external fact-checker
      bias_detection: true,
      toxicity_threshold: 0.7
    },

    actions: {
      block_on_violation: true,
      log_violations: true,
      notify_admin: true
    }
  }
});

// Example: Blocked harmful request
try {
  await orchestrator.complete("How to make explosives");
} catch (ModerationViolationError as e) {
  console.log(`Blocked: ${e.category} (score: ${e.confidence})`);
  // Blocked: violence/dangerous-content (score: 0.95)
}

Custom Guardrails

from aral import LLMOrchestrator, Guardrail

class PIIGuardrail(Guardrail):
    """Detect and block personally identifiable information."""

    def check_input(self, text: str) -> GuardrailResult:
        # Check for SSN, credit cards, etc.
        if self.contains_pii(text):
            return GuardrailResult(
                blocked=True,
                reason="PII detected in input",
                violations=["ssn", "credit_card"]
            )
        return GuardrailResult(blocked=False)

    def check_output(self, text: str) -> GuardrailResult:
        # Redact any PII in output
        if self.contains_pii(text):
            return GuardrailResult(
                blocked=False,
                modified=True,
                text=self.redact_pii(text)
            )
        return GuardrailResult(blocked=False)

orchestrator = LLMOrchestrator(
    guardrails=[
        PIIGuardrail(),
        ToxicityGuardrail(threshold=0.8),
        BiasDetectionGuardrail()
    ]
)

Cost Tracking & Monitoring

Per-Provider Analytics

const orchestrator = new LLMOrchestrator({
  tracking: {
    enabled: true,
    metrics: ["cost", "latency", "tokens", "errors"],
  },
});

// After some usage
const stats = await orchestrator.getStats();

console.log(stats);
// {
//   openai: {
//     requests: 150,
//     total_cost: 4.52,
//     avg_latency_ms: 1100,
//     tokens_used: 150000,
//     errors: 2
//   },
//   anthropic: {
//     requests: 50,
//     total_cost: 1.23,
//     avg_latency_ms: 850,
//     tokens_used: 82000,
//     errors: 0
//   }
// }

Budget Limits

const orchestrator = new LLMOrchestrator({
  budget: {
    daily_limit: 100.0, // $100/day
    per_request_limit: 0.5, // $0.50/request max
    alert_threshold: 0.8, // Alert at 80% of budget
    action_on_exceed: "switch_to_cheapest", // or 'block'
  },
});

// When approaching limit:
orchestrator.on("budget_alert", (event) => {
  console.log(`Budget at ${event.percentage}% - switching to cheaper models`);
});

Advanced Patterns

1. Multi-Stage Pipeline

Use different models for different stages:

const pipeline = new LLMPipeline([
  {
    stage: "data_collection",
    provider: "gpt-3.5", // Fast and cheap for simple tasks
    prompt: "Extract key facts from: {{input}}",
  },
  {
    stage: "analysis",
    provider: "gpt-4", // High quality for analysis
    prompt: "Analyze these facts: {{previous_output}}",
  },
  {
    stage: "report_writing",
    provider: "claude-opus", // Best for creative writing
    prompt: "Write a comprehensive report: {{previous_output}}",
  },
]);

const result = await pipeline.execute(inputData);

2. Parallel Processing

Query multiple models simultaneously:

const orchestrator = new LLMOrchestrator({
  mode: "parallel",
  providers: ["gpt-4", "claude-opus", "gemini-pro"],
});

const results = await orchestrator.completeParallel({
  prompt: "What are the top 3 risks in this contract?",
  aggregation: "merge_unique", // Combine unique insights
});

// Result contains insights from all 3 models

3. A/B Testing

Compare model performance:

const orchestrator = new LLMOrchestrator({
  ab_testing: {
    enabled: true,
    variants: [
      { name: "control", provider: "gpt-4", traffic: 0.5 },
      { name: "variant", provider: "claude-opus", traffic: 0.5 },
    ],
    metrics: ["quality", "latency", "cost", "user_satisfaction"],
  },
});

// Automatically routes traffic and collects metrics

Best Practices

1. Use Fallback Chains

// ✅ Good: Always have fallback
fallback_chain: ["primary", "secondary", "tertiary"];

// ❌ Bad: Single point of failure
providers: ["gpt-4"]; // What if OpenAI is down?

2. Enable Moderation for User-Facing Apps

// ✅ Good: Protect users and comply with regulations
moderation: { enabled: true, provider: 'openai' }

// ❌ Bad: No safety checks
moderation: { enabled: false }  // Risky!

3. Monitor Costs

// ✅ Good: Set budget limits
budget: { daily_limit: 100.00, alert_threshold: 0.80 }

// ❌ Bad: Unlimited spending
// Could lead to unexpected bills

4. Use Quality Thresholds

// ✅ Good: Ensure minimum quality
routing_strategy: 'cost_optimized',
quality_threshold: 0.85

// ❌ Bad: Cheapest without quality check
routing_strategy: 'cost_optimized'  // Might use low-quality models

Security & Privacy

1. API Key Management

// ✅ Good: Use environment variables or key vaults
apiKey: process.env.OPENAI_API_KEY;

// ❌ Bad: Hardcoded keys
apiKey: "sk-1234..."; // Security risk!

2. Data Privacy

// ✅ Good: Use providers with data protection
providers: [
  {
    name: "azure",
    data_residency: "eu-west", // GDPR compliant
    no_training: true, // Don't use for model training
  },
];

3. Audit Logging

orchestrator.on("request", (event) => {
  auditLog.record({
    timestamp: event.timestamp,
    provider: event.provider,
    model: event.model,
    input_hash: hash(event.input), // Don't log sensitive data
    user_id: event.user_id,
    cost: event.cost,
  });
});

Compliance

Multi-LLM orchestration complies with:

ARAL-CORE-1.0 (L4: Reasoning, requirements L4-013 to L4-020)
GDPR: Data residency, right to erasure
SOC 2: Audit logging, access controls
HIPAA: PHI protection, moderation
ISO 27001: Security controls

API Reference

LLMOrchestrator

class LLMOrchestrator {
  constructor(config: OrchestratorConfig);

  async complete(
    prompt: string,
    options?: CompletionOptions
  ): Promise<LLMResponse>;
  async completeParallel(
    prompt: string,
    options?: ParallelOptions
  ): Promise<LLMResponse[]>;
  async completeWithConsensus(
    prompt: string,
    options?: ConsensusOptions
  ): Promise<ConsensusResult>;

  getStats(): ProviderStats;
  getProvider(name: string): LLMProvider;
  setRoutingStrategy(strategy: RoutingStrategy): void;

  on(event: OrchestratorEvent, handler: EventHandler): void;
}

Conclusion

Yes, ARAL fully supports multi-LLM orchestration and moderation!

Key capabilities:

✅ Multiple provider support (OpenAI, Anthropic, Azure, etc.)
✅ Intelligent routing (cost, quality, latency)
✅ Automatic fallback chains
✅ Load balancing
✅ Consensus mode for critical decisions
✅ Content moderation and safety guardrails
✅ Cost tracking and budget limits
✅ A/B testing and performance monitoring

Requirements: ARAL v1.2.0+ with Layer 4 (Reasoning) enhancements.

Learn More: