Skip to content

Multi-LLM Orchestration

ARAL v1.2+ — Advanced intelligent routing, fallback, aggregation, and optimization across multiple LLM providers


ARAL supports advanced multi-LLM orchestration, enabling agents to:

  • Use multiple LLM providers (OpenAI, Anthropic, Azure, Google, Cohere, etc.)
  • Intelligent routing (specialized, cost-optimized, quality-first, latency-first)
  • Weighted ponderation (blend models with custom weights)
  • Response aggregation (best-of-n, ensemble, consensus, voting)
  • Automatic fallback (provider A → B → C if failure)
  • Load balancing (distribute requests across providers)
  • Cost & latency optimization (budget limits, max latency)
  • Specialized routing (task-specific model selection)
  • Model-specific prompts (customize prompts per LLM)
  • Content moderation (safety filters, guardrails)
  • Cost tracking (per-provider usage and spend monitoring)

ProviderModelsStatus
OpenAIGPT-4, GPT-4-turbo, GPT-3.5-turbo
AzureAzure OpenAI models
AnthropicClaude 4 (Opus, Sonnet), Claude 3
GoogleGemini Pro, Gemini Ultra🔜
CohereCommand, Command-R🔜
MistralMistral Large, Medium, Small🔜
LocalOllama, LM Studio, vLLM🔜

┌────────────────────────────────────────────────────┐
│ LLM Orchestrator (Layer 4) │
│ ┌──────────────────────────────────────────────┐ │
│ │ Router (Specialized Routing) │ │
│ │ - Task-based model selection │ │
│ │ - Cost optimization │ │
│ │ - Quality-based selection │ │
│ │ - Latency optimization │ │
│ │ - Load balancing │ │
│ │ - Priority-based routing │ │
│ └──────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Aggregation Engine │ │
│ │ - Best-of-N selection │ │
│ │ - Ensemble combination │ │
│ │ - Consensus voting │ │
│ │ - Weighted blending │ │
│ │ - Quality-based ranking │ │
│ └──────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Fallback Manager │ │
│ │ - Primary → Secondary → Tertiary │ │
│ │ - Circuit breaker per provider │ │
│ │ - Health monitoring │ │
│ └──────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Optimization Layer │ │
│ │ - Cost tracking & budgets │ │
│ │ - Latency monitoring │ │
│ │ - Quality scoring │ │
│ │ - Performance analytics │ │
│ └──────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Moderation Layer │ │
│ │ - Content safety filters │ │
│ │ - PII detection │ │
│ │ - Guardrails enforcement │ │
│ └──────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ OpenAI │ │Anthropic│ │ Azure │
└─────────┘ └─────────┘ └─────────┘

import { ARALAgent, LLMOrchestrator } from "@aral-standard/sdk";
const agent = new ARALAgent({
persona: "assistant.json",
llm: new LLMOrchestrator({
providers: [
{
name: "openai",
type: "openai",
apiKey: process.env.OPENAI_API_KEY,
models: ["gpt-4", "gpt-3.5-turbo"],
priority: 1,
cost_per_1k_tokens: { input: 0.03, output: 0.06 },
},
{
name: "anthropic",
type: "anthropic",
apiKey: process.env.ANTHROPIC_API_KEY,
models: ["claude-3-opus", "claude-3-sonnet"],
priority: 2,
cost_per_1k_tokens: { input: 0.015, output: 0.075 },
},
{
name: "azure",
type: "azure",
endpoint: process.env.AZURE_ENDPOINT,
apiKey: process.env.AZURE_API_KEY,
models: ["gpt-4-azure"],
priority: 3,
cost_per_1k_tokens: { input: 0.03, output: 0.06 },
},
],
routing_strategy: "cost_optimized",
fallback_enabled: true,
moderation: {
enabled: true,
provider: "openai",
},
}),
});
from aral import ARALAgent, LLMOrchestrator, ProviderConfig
agent = ARALAgent(
persona_id="assistant",
llm=LLMOrchestrator(
providers=[
ProviderConfig(
name="openai",
type="openai",
api_key=os.getenv("OPENAI_API_KEY"),
models=["gpt-4", "gpt-3.5-turbo"],
priority=1,
cost_per_1k_tokens={"input": 0.03, "output": 0.06}
),
ProviderConfig(
name="anthropic",
type="anthropic",
api_key=os.getenv("ANTHROPIC_API_KEY"),
models=["claude-3-opus", "claude-3-sonnet"],
priority=2,
cost_per_1k_tokens={"input": 0.015, "output": 0.075}
)
],
routing_strategy="cost_optimized",
fallback_enabled=True,
moderation={"enabled": True, "provider": "openai"}
)
)

Route to the cheapest model that meets quality requirements:

const orchestrator = new LLMOrchestrator({
routing_strategy: "cost_optimized",
quality_threshold: 0.85, // Min quality score
providers: [
{ name: "gpt-3.5", cost: 0.002, quality: 0.8 }, // Cheapest, below threshold
{ name: "claude-sonnet", cost: 0.015, quality: 0.88 }, // ✅ Selected (cheapest meeting threshold)
{ name: "gpt-4", cost: 0.03, quality: 0.95 }, // Most expensive
],
});
// Result: Uses claude-sonnet (meets quality + cheapest)

Always use the highest-quality model:

const orchestrator = new LLMOrchestrator({
routing_strategy: "quality_first",
providers: [
{ name: "gpt-3.5", quality: 0.8 },
{ name: "claude-sonnet", quality: 0.88 },
{ name: "gpt-4", quality: 0.95 }, // ✅ Always selected
],
});

Route to the fastest-responding provider:

const orchestrator = new LLMOrchestrator({
routing_strategy: "latency_first",
latency_tracking: true,
providers: [
{ name: "gpt-4", avg_latency_ms: 1200 },
{ name: "claude-sonnet", avg_latency_ms: 800 }, // ✅ Fastest
{ name: "gpt-3.5", avg_latency_ms: 900 },
],
});

Distribute requests evenly:

const orchestrator = new LLMOrchestrator({
routing_strategy: "round_robin",
providers: ["gpt-4", "claude-opus", "gpt-3.5"],
});
// Request 1 → gpt-4
// Request 2 → claude-opus
// Request 3 → gpt-3.5
// Request 4 → gpt-4 (cycle repeats)

Route based on task characteristics:

const orchestrator = new LLMOrchestrator({
routing_strategy: "context_aware",
rules: [
{
condition: (ctx) => ctx.task_type === "code_generation",
provider: "gpt-4", // Best for coding
},
{
condition: (ctx) => ctx.task_type === "creative_writing",
provider: "claude-opus", // Best for creative tasks
},
{
condition: (ctx) => ctx.input_length > 100000,
provider: "claude-opus", // Large context window
},
{
condition: (ctx) => ctx.priority === "low",
provider: "gpt-3.5", // Cheap for non-critical tasks
},
],
});

const orchestrator = new LLMOrchestrator({
fallback_chain: [
{
provider: "gpt-4",
retry_attempts: 2,
timeout_ms: 30000,
},
{
provider: "claude-opus",
retry_attempts: 2,
timeout_ms: 30000,
},
{
provider: "gpt-3.5",
retry_attempts: 1,
timeout_ms: 15000,
},
],
circuit_breaker: {
failure_threshold: 5,
reset_timeout_ms: 60000,
},
});
// Try gpt-4 (2 retries) → if fails → claude-opus (2 retries) → gpt-3.5
orchestrator = LLMOrchestrator(
fallback_chain=[
{"provider": "gpt-4", "retry_attempts": 2},
{"provider": "claude-opus", "retry_attempts": 2},
{"provider": "gpt-3.5", "retry_attempts": 1}
],
circuit_breaker={
"failure_threshold": 5,
"reset_timeout_ms": 60000
}
)
try:
response = await orchestrator.complete("Analyze this data...")
except AllProvidersFailedError as e:
# All providers in chain failed
logger.error(f"All providers failed: {e.failures}")

Query multiple models and require agreement:

const orchestrator = new LLMOrchestrator({
mode: "consensus",
consensus_config: {
providers: ["gpt-4", "claude-opus", "gemini-ultra"],
min_agreement: 0.67, // Require 2/3 agreement
voting_method: "majority",
tie_breaker: "highest_confidence",
},
});
const result = await orchestrator.complete({
prompt: "Is this medical diagnosis correct?",
context: diagnosticData,
});
// Result includes:
// - consensus_reached: true/false
// - agreement_score: 0.67
// - individual_responses: [...]
// - final_decision: "..."
// - confidence: 0.85
  • 🏥 Medical diagnoses
  • ⚖️ Legal opinions
  • 💰 Financial decisions
  • 🔐 Security assessments
  • 🎯 High-stakes recommendations

ARAL v1.2+ supports advanced routing strategies that go beyond simple priority-based selection, enabling intelligent model routing based on task type, cost, quality, and latency requirements.

Routes all requests to a single primary model.

llm_config: {
providers: [
{ provider: "openai", model: "gpt-4", priority: 1 }
],
routing_strategy: "single"
}

Routes to the highest priority available provider, with automatic fallback.

llm_config: {
providers: [
{ provider: "anthropic", model: "claude-opus-4", priority: 1 },
{ provider: "openai", model: "gpt-4", priority: 2, fallback: false },
{ provider: "openai", model: "gpt-3.5-turbo", priority: 3, fallback: true }
],
routing_strategy: "priority_based"
}

Use Cases:

  • General purpose workflows with fallback
  • Quality-first approach with cost-effective fallbacks

Routes requests to different models based on task type using routing rules.

llm_config: {
providers: [
{ provider: "anthropic", model: "claude-opus-4", weight: 0.5 },
{ provider: "openai", model: "gpt-4", weight: 0.3 },
{ provider: "anthropic", model: "claude-sonnet-4", weight: 0.2 }
],
routing_strategy: "specialized"
},
extensions: {
routing_rules: {
"brainstorming": "claude-opus-4",
"long_form_content": "gpt-4",
"refinement": "claude-sonnet-4",
"code_generation": "gpt-4",
"creative_fiction": "claude-opus-4",
"quick_summary": "gpt-3.5-turbo"
}
}

Use Cases:

  • Creative applications with distinct workflow stages
  • Code generation + documentation
  • Research (search) + analysis (reasoning)
  • Multi-stage content pipelines

Prefers cheaper models while respecting quality thresholds.

llm_config: {
providers: [
{ provider: "openai", model: "gpt-3.5-turbo", priority: 1 },
{ provider: "anthropic", model: "claude-sonnet-4", priority: 2 },
{ provider: "anthropic", model: "claude-opus-4", priority: 3 }
],
routing_strategy: "cost_optimized",
cost_optimization: {
enabled: true,
max_cost_per_request: 0.10,
prefer_cheaper: true,
budget_limit: 50.0
}
}

Use Cases:

  • High-volume production applications
  • Budget-constrained projects
  • Simple queries that don’t require top-tier models

Always routes to the highest quality model regardless of cost.

llm_config: {
providers: [
{ provider: "anthropic", model: "claude-opus-4", priority: 1 },
{ provider: "openai", model: "gpt-4", priority: 2, fallback: true }
],
routing_strategy: "quality_first"
}

Use Cases:

  • Critical decision-making
  • High-stakes content generation
  • Complex reasoning tasks
  • Medical/legal/financial applications

Prioritizes fastest response time.

llm_config: {
providers: [
{ provider: "openai", model: "gpt-3.5-turbo", priority: 1 },
{ provider: "anthropic", model: "claude-sonnet-4", priority: 2 }
],
routing_strategy: "latency_first",
latency_optimization: {
enabled: true,
max_latency_ms: 2000,
timeout_ms: 5000,
prefer_faster: true
}
}

Use Cases:

  • Real-time chat applications
  • Interactive UIs requiring instant feedback
  • High-throughput APIs

Distributes requests evenly across providers for load balancing.

llm_config: {
providers: [
{ provider: "openai", model: "gpt-4" },
{ provider: "anthropic", model: "claude-opus-4" },
{ provider: "azure", model: "gpt-4" }
],
routing_strategy: "round_robin"
}

Use Cases:

  • Load distribution across multiple API keys
  • Testing and comparison
  • Avoiding rate limits

Combines outputs from multiple models with custom weights.

llm_config: {
providers: [
{ provider: "openai", model: "gpt-5.2", weight: 0.8 },
{ provider: "anthropic", model: "claude-sonnet-4.5", weight: 0.2 }
],
routing_strategy: "ponderation"
}

See LLM Ponderation section for details.

Queries multiple models and requires agreement.

llm_config: {
providers: [
{ provider: "openai", model: "gpt-4" },
{ provider: "anthropic", model: "claude-opus-4" },
{ provider: "google", model: "gemini-ultra" }
],
routing_strategy: "consensus",
aggregation: {
method: "consensus",
min_responses: 3,
selection_criteria: ["agreement_score", "confidence"]
}
}

Use Cases:

  • High-stakes decisions
  • Verification and validation
  • Reducing hallucination risk
StrategyCostQualitySpeedUse Case
SingleLowVariesFastSimple applications
Priority-BasedMediumHighFastGeneral purpose
SpecializedMediumHighFastTask-specific workflows
Cost-OptimizedLowMediumFastHigh-volume, budget-limited
Quality-FirstHighHighestMediumCritical applications
Latency-FirstLowMediumFastestReal-time interactive
Round-RobinMediumVariesFastLoad balancing
PonderationHighHighestSlowEnsemble quality
ConsensusHighHighestSlowestVerification, high-stakes

When using multiple LLM providers simultaneously, ARAL can aggregate responses using various strategies.

Returns the first successful response.

aggregation: {
method: "first";
}

Queries multiple models and selects the best response based on criteria.

aggregation: {
method: "best_of_n",
selection_criteria: ["creativity", "coherence", "engagement", "originality"],
min_responses: 2,
max_responses: 3
}

Example: Creative Specialist Persona

{
"llm_config": {
"providers": [
{ "provider": "anthropic", "model": "claude-opus-4", "weight": 0.5 },
{ "provider": "openai", "model": "gpt-4", "weight": 0.3 },
{ "provider": "anthropic", "model": "claude-sonnet-4", "weight": 0.2 }
],
"routing_strategy": "specialized",
"aggregation": {
"method": "best_of_n",
"selection_criteria": [
"creativity",
"coherence",
"originality",
"engagement"
],
"min_responses": 2,
"max_responses": 3
}
}
}

Selection Criteria Options:

  • accuracy - Factual correctness
  • creativity - Originality and novelty
  • coherence - Logical flow and structure
  • relevance - Alignment with prompt
  • engagement - Reader interest and appeal
  • brevity - Conciseness
  • detail - Comprehensiveness
  • clarity - Ease of understanding
  • technical_depth - Expert-level detail
  • originality - Unique perspectives

Combines responses using provider weights.

aggregation: {
method: "weighted_blend",
selection_criteria: ["quality", "relevance"]
}

Uses the weight property from each provider configuration.

Selects the response supported by the majority of models.

aggregation: {
method: "majority_vote",
min_responses: 3
}

Requires high agreement between all models.

aggregation: {
method: "consensus",
selection_criteria: ["agreement_score", "confidence"],
min_responses: 3
}

Combines strengths of multiple responses into a synthesized output.

aggregation: {
method: "ensemble",
selection_criteria: ["accuracy", "creativity", "completeness"],
min_responses: 2,
max_responses: 4
}
MethodSpeedCostQualityUse Case
FirstFastestLowVariesSimple, cost-effective
Best-of-NSlowHighHighQuality-critical, creative
Weighted BlendSlowHighHighBalanced ensemble
Majority VoteMediumHighMediumObjective decisions
ConsensusSlowestHighestHighestMission-critical, verification
EnsembleSlowestHighestHighestComprehensive analysis

ARAL v1.2+ supports customizing prompts per LLM model, enabling specialized instructions that leverage each model’s unique strengths.

{
"prompts": {
"system": "You are a Creative Specialist...",
"prefix": "Creative task:\n\n",
"llm_specific": {
"claude-opus-4": {
"system": "You excel at deep, nuanced creative work with rich narrative exploration.",
"prefix": "For this creative challenge, bring your deepest creative capabilities:\n\n"
},
"gpt-4": {
"system": "You excel at structured creativity with clear organization.",
"prefix": "Creative task requiring structured innovation:\n\n"
},
"claude-sonnet-4": {
"system": "You provide balanced creative refinement, enhancing coherence and polish.",
"prefix": "Creative refinement task:\n\n"
}
}
}
}

1. Leveraging Model Strengths

// Claude Opus: Deep creative exploration
"claude-opus-4": {
"system": "Focus on deep creative exploration with vivid imagery and emotional resonance."
}
// GPT-4: Structured problem-solving
"gpt-4": {
"system": "Focus on clear structure, logical organization, and actionable insights."
}
// Claude Sonnet: Balance and refinement
"claude-sonnet-4": {
"system": "Focus on balance, coherence, and polishing for professional quality."
}

2. Task-Specific Specialization

{
"extensions": {
"routing_rules": {
"brainstorming": "claude-opus-4",
"structuring": "gpt-4",
"refinement": "claude-sonnet-4"
}
},
"prompts": {
"llm_specific": {
"claude-opus-4": {
"prefix": "BRAINSTORM MODE: Generate wild, creative ideas without constraints.\n\n"
},
"gpt-4": {
"prefix": "STRUCTURE MODE: Organize and structure the content logically.\n\n"
},
"claude-sonnet-4": {
"prefix": "REFINEMENT MODE: Polish and enhance for clarity and quality.\n\n"
}
}
}
}

3. Output Format Customization

{
"llm_specific": {
"gpt-4": {
"suffix": "\n\nProvide response in JSON format."
},
"claude-opus-4": {
"suffix": "\n\nProvide response in markdown with rich formatting."
}
}
}

  • ⚖️ Legal opinions
  • 💰 Financial decisions
  • 🔐 Security assessments
  • 🎯 High-stakes recommendations

Ponderation combines outputs from multiple LLMs with custom weights, creating ensemble responses that leverage each model’s strengths.

Example: GPT-5.2 = 0.8 + Claude Sonnet 4.5 = 0.2

  • 80% weight to GPT-5.2 (strong reasoning)
  • 20% weight to Claude (creativity, nuance)
const orchestrator = new LLMOrchestrator({
mode: "ponderation",
ponderation: {
enabled: true,
weights: {
"gpt-5.2": 0.8,
"claude-sonnet-4.5": 0.2,
},
merge_strategy: "weighted_blend",
normalize: true, // Auto-normalize weights to sum to 1.0
},
});
const result = await orchestrator.complete({
prompt: "Explain quantum computing in simple terms",
});
// Result: 80% GPT-5.2 + 20% Claude Sonnet 4.5 blended response
from aral import LLMOrchestrator, PonderationConfig
orchestrator = LLMOrchestrator(
mode='ponderation',
ponderation=PonderationConfig(
enabled=True,
weights={
'gpt-5.2': 0.8,
'claude-sonnet-4.5': 0.2
},
merge_strategy='weighted_blend',
normalize=True
)
)
response = await orchestrator.complete(
"Explain quantum computing in simple terms"
)
# Response combines:
# - GPT-5.2's technical accuracy (80%)
# - Claude's clear explanations (20%)

Different weights for different task types:

const orchestrator = new LLMOrchestrator({
mode: "ponderation",
adaptive_weighting: {
code_generation: {
"gpt-4": 0.7,
"claude-opus": 0.2,
codellama: 0.1,
},
creative_writing: {
"claude-opus": 0.6,
"gpt-4": 0.3,
"mistral-large": 0.1,
},
data_analysis: {
"gpt-4": 0.5,
"claude-sonnet": 0.3,
"gemini-pro": 0.2,
},
},
});
// Automatically selects weights based on detected task type
const result = await orchestrator.complete({
prompt: "Write a Python function to analyze sales data",
task_type: "code_generation", // Uses code_generation weights
});

Adjust weights based on model confidence scores:

const orchestrator = new LLMOrchestrator({
mode: "ponderation",
dynamic_weighting: {
enabled: true,
base_weights: {
"gpt-5.2": 0.5,
"claude-sonnet-4.5": 0.5,
},
confidence_threshold: 0.8,
adjustment_factor: 0.2, // Boost weight by 20% if confidence > threshold
},
});
// If GPT-5.2 has confidence 0.95 and Claude has 0.75:
// GPT-5.2: 0.5 + (0.2 * 0.5) = 0.6
// Claude: 0.5 - (0.2 * 0.5) = 0.4
// (Automatically normalized to sum to 1.0)

Different weights for different stages of processing:

const pipeline = new LLMPipeline([
{
stage: "research",
ponderation: {
"gpt-4": 0.4,
"claude-opus": 0.4,
"gemini-pro": 0.2,
},
},
{
stage: "analysis",
ponderation: {
"gpt-5.2": 0.7,
"claude-sonnet": 0.3,
},
},
{
stage: "synthesis",
ponderation: {
"claude-opus": 0.6,
"gpt-4": 0.4,
},
},
]);
const result = await pipeline.execute(inputData);
StrategyDescriptionUse Case
weighted_blendCombine text outputs with weightsGeneral responses
weighted_votingVote on discrete choicesClassification, yes/no
weighted_averageAverage numerical outputsScoring, predictions
confidence_weightedUse confidence scores as weightsDynamic blending
best_of_ensembleSelect highest-confidence responseQuality filtering
// Query both models
const gptResponse = await gpt52.complete(prompt);
const claudeResponse = await claudeSonnet.complete(prompt);
// Blend with 80/20 weights
const blended = weightedBlend([
{ response: gptResponse, weight: 0.8 },
{ response: claudeResponse, weight: 0.2 },
]);
console.log(blended.text);
// Output combines 80% GPT + 20% Claude
// Preserves style/tone weighted toward GPT
// Incorporates Claude's insights at 20%
// 70% accuracy-focused, 30% creativity-focused
ponderation: {
'gpt-4': 0.7, // Strong reasoning
'claude-opus': 0.3 // Creative insights
}
// 90% cheap model, 10% expensive for quality control
ponderation: {
'gpt-3.5-turbo': 0.9, // Cheap
'gpt-4': 0.1 // Quality check
}
// Medical diagnosis: blend specialist models
ponderation: {
'medical-llm': 0.6, // Domain-specific
'gpt-4': 0.3, // General reasoning
'claude-opus': 0.1 // Safety check
}
// Translation with native speaker quality
ponderation: {
'gpt-4': 0.5, // Technical accuracy
'claude-multilingual': 0.3, // Fluency
'specialized-translator': 0.2 // Domain terms
}

Pros:

  • ✅ Better quality than single model
  • ✅ Reduces individual model biases
  • ✅ Leverages complementary strengths
  • ✅ More robust than single-model dependency

Cons:

  • ❌ Higher latency (queries multiple models)
  • ❌ Increased cost (multiple API calls)
  • ❌ Complex error handling

Optimization:

// Parallel execution for speed
const orchestrator = new LLMOrchestrator({
ponderation: {
weights: { "gpt-5.2": 0.8, "claude-sonnet": 0.2 },
execution: "parallel", // Query simultaneously
timeout_ms: 30000,
cache_enabled: true, // Cache repeated queries
},
});
// ARAL automatically validates and normalizes weights
const orchestrator = new LLMOrchestrator({
ponderation: {
weights: {
"gpt-5.2": 4, // Raw weight
"claude-sonnet": 1, // Raw weight
},
normalize: true, // Auto-converts to 0.8 and 0.2
},
});
// Validation rules:
// - Weights must be positive
// - Sum must be > 0
// - Auto-normalized to sum to 1.0
// - Warning if imbalance > 0.95 (one model dominates)

const orchestrator = new LLMOrchestrator({
moderation: {
enabled: true,
provider: 'openai', // Use OpenAI moderation API
input_checks: {
hate_speech: true,
sexual_content: true,
violence: true,
self_harm: true,
pii_detection: true
},
output_checks: {
factual_accuracy: false, // Requires external fact-checker
bias_detection: true,
toxicity_threshold: 0.7
},
actions: {
block_on_violation: true,
log_violations: true,
notify_admin: true
}
}
});
// Example: Blocked harmful request
try {
await orchestrator.complete("How to make explosives");
} catch (ModerationViolationError as e) {
console.log(`Blocked: ${e.category} (score: ${e.confidence})`);
// Blocked: violence/dangerous-content (score: 0.95)
}
from aral import LLMOrchestrator, Guardrail
class PIIGuardrail(Guardrail):
"""Detect and block personally identifiable information."""
def check_input(self, text: str) -> GuardrailResult:
# Check for SSN, credit cards, etc.
if self.contains_pii(text):
return GuardrailResult(
blocked=True,
reason="PII detected in input",
violations=["ssn", "credit_card"]
)
return GuardrailResult(blocked=False)
def check_output(self, text: str) -> GuardrailResult:
# Redact any PII in output
if self.contains_pii(text):
return GuardrailResult(
blocked=False,
modified=True,
text=self.redact_pii(text)
)
return GuardrailResult(blocked=False)
orchestrator = LLMOrchestrator(
guardrails=[
PIIGuardrail(),
ToxicityGuardrail(threshold=0.8),
BiasDetectionGuardrail()
]
)

const orchestrator = new LLMOrchestrator({
tracking: {
enabled: true,
metrics: ["cost", "latency", "tokens", "errors"],
},
});
// After some usage
const stats = await orchestrator.getStats();
console.log(stats);
// {
// openai: {
// requests: 150,
// total_cost: 4.52,
// avg_latency_ms: 1100,
// tokens_used: 150000,
// errors: 2
// },
// anthropic: {
// requests: 50,
// total_cost: 1.23,
// avg_latency_ms: 850,
// tokens_used: 82000,
// errors: 0
// }
// }
const orchestrator = new LLMOrchestrator({
budget: {
daily_limit: 100.0, // $100/day
per_request_limit: 0.5, // $0.50/request max
alert_threshold: 0.8, // Alert at 80% of budget
action_on_exceed: "switch_to_cheapest", // or 'block'
},
});
// When approaching limit:
orchestrator.on("budget_alert", (event) => {
console.log(`Budget at ${event.percentage}% - switching to cheaper models`);
});

Use different models for different stages:

const pipeline = new LLMPipeline([
{
stage: "data_collection",
provider: "gpt-3.5", // Fast and cheap for simple tasks
prompt: "Extract key facts from: {{input}}",
},
{
stage: "analysis",
provider: "gpt-4", // High quality for analysis
prompt: "Analyze these facts: {{previous_output}}",
},
{
stage: "report_writing",
provider: "claude-opus", // Best for creative writing
prompt: "Write a comprehensive report: {{previous_output}}",
},
]);
const result = await pipeline.execute(inputData);

Query multiple models simultaneously:

const orchestrator = new LLMOrchestrator({
mode: "parallel",
providers: ["gpt-4", "claude-opus", "gemini-pro"],
});
const results = await orchestrator.completeParallel({
prompt: "What are the top 3 risks in this contract?",
aggregation: "merge_unique", // Combine unique insights
});
// Result contains insights from all 3 models

Compare model performance:

const orchestrator = new LLMOrchestrator({
ab_testing: {
enabled: true,
variants: [
{ name: "control", provider: "gpt-4", traffic: 0.5 },
{ name: "variant", provider: "claude-opus", traffic: 0.5 },
],
metrics: ["quality", "latency", "cost", "user_satisfaction"],
},
});
// Automatically routes traffic and collects metrics

// ✅ Good: Always have fallback
fallback_chain: ["primary", "secondary", "tertiary"];
// ❌ Bad: Single point of failure
providers: ["gpt-4"]; // What if OpenAI is down?
// ✅ Good: Protect users and comply with regulations
moderation: { enabled: true, provider: 'openai' }
// ❌ Bad: No safety checks
moderation: { enabled: false } // Risky!
// ✅ Good: Set budget limits
budget: { daily_limit: 100.00, alert_threshold: 0.80 }
// ❌ Bad: Unlimited spending
// Could lead to unexpected bills
// ✅ Good: Ensure minimum quality
routing_strategy: 'cost_optimized',
quality_threshold: 0.85
// ❌ Bad: Cheapest without quality check
routing_strategy: 'cost_optimized' // Might use low-quality models

// ✅ Good: Use environment variables or key vaults
apiKey: process.env.OPENAI_API_KEY;
// ❌ Bad: Hardcoded keys
apiKey: "sk-1234..."; // Security risk!
// ✅ Good: Use providers with data protection
providers: [
{
name: "azure",
data_residency: "eu-west", // GDPR compliant
no_training: true, // Don't use for model training
},
];
orchestrator.on("request", (event) => {
auditLog.record({
timestamp: event.timestamp,
provider: event.provider,
model: event.model,
input_hash: hash(event.input), // Don't log sensitive data
user_id: event.user_id,
cost: event.cost,
});
});

Multi-LLM orchestration complies with:

  • ARAL-CORE-1.0 (L4: Reasoning, requirements L4-013 to L4-020)
  • GDPR: Data residency, right to erasure
  • SOC 2: Audit logging, access controls
  • HIPAA: PHI protection, moderation
  • ISO 27001: Security controls

class LLMOrchestrator {
constructor(config: OrchestratorConfig);
async complete(
prompt: string,
options?: CompletionOptions
): Promise<LLMResponse>;
async completeParallel(
prompt: string,
options?: ParallelOptions
): Promise<LLMResponse[]>;
async completeWithConsensus(
prompt: string,
options?: ConsensusOptions
): Promise<ConsensusResult>;
getStats(): ProviderStats;
getProvider(name: string): LLMProvider;
setRoutingStrategy(strategy: RoutingStrategy): void;
on(event: OrchestratorEvent, handler: EventHandler): void;
}

Yes, ARAL fully supports multi-LLM orchestration and moderation!

Key capabilities:

  • ✅ Multiple provider support (OpenAI, Anthropic, Azure, etc.)
  • ✅ Intelligent routing (cost, quality, latency)
  • ✅ Automatic fallback chains
  • ✅ Load balancing
  • ✅ Consensus mode for critical decisions
  • ✅ Content moderation and safety guardrails
  • ✅ Cost tracking and budget limits
  • ✅ A/B testing and performance monitoring

Requirements: ARAL v1.2.0+ with Layer 4 (Reasoning) enhancements.


Learn More:


© 2026 ARAL Standard — CC BY 4.0