Design and evaluation of robust agentic AI applications

Research topic

Nov. 2025 - present

Design and analysis of multi-agent AI systems

This page covers my current research on building, operating, and evaluating LLM-based multi-agent systems. The central question is not whether agents can produce correct outputs in a demo, but whether a multi-agent architecture can be designed so that its failure modes are known, measurable, and addressable.

See also: timeline view for reliable agentic AI, with adjacent tracks for cloud security, network validation, and earlier realtime work.

Working thesis

LLM-based multi-agent systems fail more often than single-agent ones — not because the models are weaker, but because coordination introduces a new class of failures: silent cognitive failures. An agent produces a plausible output that is epistemically wrong. The action fires, the task metric succeeds, but the reasoning behind it was unsupported, contradicted, or incomplete. These failures are invisible to standard benchmarks that only measure task success.

The goal is to make those failures visible, measurable, and structurally addressable.

Lines of work

MAS Framework

A Python framework for building, orchestrating, and evaluating LLM-based multi-agent systems. Three core packages:

agent-runtime: single-agent lifecycle, contracts, design-pattern plugins, execution hooks. Declarative manifests (YAML) as the source of truth for every component.
mas-ctl: multi-agent topology, flavour selection, inter-agent policies, and MAS execution. Agents are composed declaratively; the runtime enforces contracts and emits structured telemetry.
mas-lab: evaluation lab, benchmarking infrastructure, IoC scoring, knowledge graph tooling, and a demo UI. The platform used to run experiments and produce interpretable traces.

Concrете applications built on the framework include Claris (a trip-planner MAS used as a tractable reference application) and an SRE-triage agent (incident triage scenario used as the primary evaluation target).

Internet of Cognition (IoC)

Outshift’s measurement and analysis program for multi-agent cognitive failures.

Seven challenge types (C1–C7) cover: common ground deficits, ontology mismatch, knowledge asymmetry between agents, unsupported assertions, communication failures, incomplete verification, and unverified action execution. Five outcome metrics (O1–O5) cover task success, efficiency, and evidence quality.

The IoC program defines how to detect each challenge type from OpenTelemetry traces, how to attribute them causally to architectural properties, and how to distinguish mitigation strategies using Shapley analysis across 2³ overlay combinations. A companion formal theory grounds the C1–C7 taxonomy in a Mealy machine model of agent behavior.

Measurement and interpretability platform

mas-lab is the platform that makes the above measurable in practice:

IoC evaluation plugin: computes challenge and outcome metrics from traces automatically
Trajectory metrics: semiring over (prompt, response) sequences enabling RCA, Shapley attribution, and semantic drift detection
Knowledge graph backend: Neo4j-backed representation of agent interactions, tool calls, and evidence chains
Benchmark infrastructure: reproducible overlays, fixture-based scenarios, and multi-run statistical summaries

Formal foundations

The system is modeled as a Mealy machine. Formal theory tracks cover:

Liveness and deadlock (Petri net extension for concurrent agents)
Hook algebra: plugin composition as a non-commutative monoid; phase-disjoint plugins commute
Simulation relation: spec machine simulates controlled machine; flavor invariance as corollary
Trajectory metric semiring: RCA and Shapley as operations on the semiring
Context management: working memory as an orthogonal Mealy machine

Relationship with earlier work

This direction builds directly on the earlier work on agentic network change validation (Swisscom deployments, Aether MAS) and on cloud security (Panoptica, CNAPP). Those tracks raised the same question from a product angle: how do you build agentic workflows that remain grounded, interpretable, and trustworthy when the system state is partial, heterogeneous, and fast-moving? The current work addresses that question as a research program rather than a single product pipeline.

Publications and outputs

A paper targeting CAIS 2026 is in preparation, covering the MAS Framework design, the IoC taxonomy, and experimental results from the SRE-triage scenario.

Public outputs from the adjacent network validation case study:

Public material

Recent project updates: news section
Broader research context: research homepage
Historical publications: publications page
Public video: AIEWF — Multi-Agent AI and Network Knowledge Graphs
Public video: Galileo — Architecting Reliable Agentic AI | Giovanna Carofiglio
Public video: Cisco Live EMEA 2026 — Agentic Observability and Evaluation

Related artifacts

AIEWF 2025 Talk — Ola Mabadeje, Cisco

2026-03-24 video aether, agentic-network-change-validation, realtime-experience-observability
Multi Agent AI and Network Knowledge Graphs for Change — Ola Mabadeje, Cisco

2025-08-22 video aether, agentic-network-change-validation, realtime-experience-observability
Agentic Observability and Evaluation | Cisco Live EMEA 2026

2026-03-25 video agentic-network-change-validation, realtime-experience-observability
Architecting Reliable Agentic AI | Cisco's Giovanna Carofiglio on the AGNTCY Collective

2025-07-09 video agentic-network-change-validation, realtime-experience-observability

Previous research

Research news