Research topic
Nov. 2025 - present
Design and analysis of multi-agent AI systems
This page covers my current research on building, operating, and evaluating LLM-based multi-agent systems. The central question is not whether agents can produce correct outputs in a demo, but whether a multi-agent architecture can be designed so that its failure modes are known, measurable, and addressable.
See also: timeline view for reliable agentic AI, with adjacent tracks for cloud security, network validation, and earlier realtime work.
Working thesis
LLM-based multi-agent systems fail more often than single-agent ones — not because the models are weaker, but because coordination introduces a new class of failures: silent cognitive failures. An agent produces a plausible output that is epistemically wrong. The action fires, the task metric succeeds, but the reasoning behind it was unsupported, contradicted, or incomplete. These failures are invisible to standard benchmarks that only measure task success.
The goal is to make those failures visible, measurable, and structurally addressable.
Lines of work
MAS Framework
A Python framework for building, orchestrating, and evaluating LLM-based multi-agent systems. Three core packages:
- agent-runtime: single-agent lifecycle, contracts, design-pattern plugins, execution hooks. Declarative manifests (YAML) as the source of truth for every component.
- mas-ctl: multi-agent topology, flavour selection, inter-agent policies, and MAS execution. Agents are composed declaratively; the runtime enforces contracts and emits structured telemetry.
- mas-lab: evaluation lab, benchmarking infrastructure, IoC scoring, knowledge graph tooling, and a demo UI. The platform used to run experiments and produce interpretable traces.
Concrете applications built on the framework include Claris (a trip-planner MAS used as a tractable reference application) and an SRE-triage agent (incident triage scenario used as the primary evaluation target).
Internet of Cognition (IoC)
Outshift’s measurement and analysis program for multi-agent cognitive failures.
Seven challenge types (C1–C7) cover: common ground deficits, ontology mismatch, knowledge asymmetry between agents, unsupported assertions, communication failures, incomplete verification, and unverified action execution. Five outcome metrics (O1–O5) cover task success, efficiency, and evidence quality.
The IoC program defines how to detect each challenge type from OpenTelemetry traces, how to attribute them causally to architectural properties, and how to distinguish mitigation strategies using Shapley analysis across 2³ overlay combinations. A companion formal theory grounds the C1–C7 taxonomy in a Mealy machine model of agent behavior.
Measurement and interpretability platform
mas-lab is the platform that makes the above measurable in practice:
- IoC evaluation plugin: computes challenge and outcome metrics from traces automatically
- Trajectory metrics: semiring over (prompt, response) sequences enabling RCA, Shapley attribution, and semantic drift detection
- Knowledge graph backend: Neo4j-backed representation of agent interactions, tool calls, and evidence chains
- Benchmark infrastructure: reproducible overlays, fixture-based scenarios, and multi-run statistical summaries
Formal foundations
The system is modeled as a Mealy machine. Formal theory tracks cover:
- Liveness and deadlock (Petri net extension for concurrent agents)
- Hook algebra: plugin composition as a non-commutative monoid; phase-disjoint plugins commute
- Simulation relation: spec machine simulates controlled machine; flavor invariance as corollary
- Trajectory metric semiring: RCA and Shapley as operations on the semiring
- Context management: working memory as an orthogonal Mealy machine
Relationship with earlier work
This direction builds directly on the earlier work on agentic network change validation (Swisscom deployments, Aether MAS) and on cloud security (Panoptica, CNAPP). Those tracks raised the same question from a product angle: how do you build agentic workflows that remain grounded, interpretable, and trustworthy when the system state is partial, heterogeneous, and fast-moving? The current work addresses that question as a research program rather than a single product pipeline.
Publications and outputs
A paper targeting CAIS 2026 is in preparation, covering the MAS Framework design, the IoC taxonomy, and experimental results from the SRE-triage scenario.
Public outputs from the adjacent network validation case study:
- Outshift and Swisscom: Building Agentic AI Networks
- Cisco Live US 2025: Agentic AI and NetDevOps Session
- AI & Enterprise Workflow Forum 2025: Multi-Agent Systems and Knowledge Graphs
- Swisscom Whitepaper: Production Deployment Results
- Cisco Live EMEA 2026: Agentic Observability and Evaluation
Public material
- Recent project updates: news section
- Broader research context: research homepage
- Historical publications: publications page
- Public video: AIEWF — Multi-Agent AI and Network Knowledge Graphs
- Public video: Galileo — Architecting Reliable Agentic AI | Giovanna Carofiglio
- Public video: Cisco Live EMEA 2026 — Agentic Observability and Evaluation
Related artifacts
4-
2026-03-24 video aether, agentic-network-change-validation, realtime-experience-observability
-
2025-08-22 video aether, agentic-network-change-validation, realtime-experience-observability
-
2026-03-25 video agentic-network-change-validation, realtime-experience-observability
-
2025-07-09 video agentic-network-change-validation, realtime-experience-observability