Autonomous Knowledge Acquisition — The Research Agenda of Dimitris Panagopoulos

Part I

What Is the Problem?

Autonomous systems are increasingly expected to operate in environments where the information they need is not embedded in sensor readings — it is held by people, expressed in natural language, and scattered across unstructured verbal reports. Yet these systems have no principled way to acquire that knowledge through conversation, monitor whether the conversation is productive, or act on what they learn in real-time.

Consider a search-and-rescue robot arriving at a disaster scene. Sensor data tells it about terrain and obstacles, but the most actionable intelligence — where victims were last seen, which routes are blocked, what hazards have emerged — lives in the minds of witnesses, first responders, and bystanders. This information must be elicited through dialogue, not simply perceived.

SAR is the motivating domain, but the underlying challenge is universal. Any autonomous system that must gather structured knowledge from humans — medical intake robots, disaster coordinators, investigative agents, field survey systems — faces the same constellation of problems:

The Core Gap

There is no integrated framework for how autonomous agents should conduct, monitor, and adapt information-seeking conversations with humans — one that spans from understanding natural language reports, through measuring conversational efficiency in real time, to controlling the agent's questioning behaviour under domain constraints.

This gap sits at the intersection of multiple research frontiers — human-robot interaction, reinforcement learning, dialogue systems, representation engineering, and cognitive science — none of which, individually, provides a complete answer.

Part II

Why Is This Necessary?

The necessity is both practical and scientific.

Practical Necessity

In time-critical operations, wasted interview turns cost lives. An agent that cannot detect it is asking redundant questions, or cannot adapt its questioning strategy when information flow degrades, or cannot translate a verbal hazard report into a navigational constraint, is not just inefficient — it is operationally dangerous. Human interviewers rely on decades of implicit feedback (micro-expressions, tonal shifts, meta-communication) to regulate the flow of information. Autonomous systems have none of this. They need explicit, computational substitutes — signals that monitor whether information is accumulating productively, not signals that diagnose the human's internal state.

Scientific Necessity

Across AI and robotics, there is a missing measurement and control layer for information-seeking dialogue. Navigation has pose estimation and SLAM. Manipulation has force-torque sensing. Dialogue-based knowledge acquisition has no equivalent telemetry. There are no standard turn-level signals for "how much useful information remains," "is this conversation stalling," or "is the agent's questioning behaviour protocol-compliant." Without such signals, no closed-loop control is possible — and without closed-loop control, autonomous dialogue remains open-loop and fragile.

Furthermore, the field faces a coupling problem. Most systems that use language embed it directly inside the learning or planning loop, entangling language-understanding failures with control failures and making diagnosis, update, and redeployment expensive. And most approaches to behavioural control in dialogue offer only coarse-grained, static guidance — inadequate for the multi-objective, multi-turn, protocol-constrained conversations that characterise professional information gathering.

Part III

The Thesis Statement

An autonomous agent can efficiently, safely, and adaptively gather mission-critical knowledge from humans through dialogue if three capabilities are developed as modular, interoperable layers: (1) a measurement layer that provides turn-level observables for conversational progress and health; (2) a grounding layer that translates streaming natural language into control-relevant signals decoupled from the decision-maker; and (3) a behavioural control layer that dynamically steers questioning strategy under multi-objective constraints using the measurement layer as feedback.

The key architectural insight is separation of concerns. Rather than building a monolithic "dialogue agent," the PhD decomposes autonomous knowledge acquisition into independently validated layers that can be combined, replaced, and transferred across domains and control paradigms.

Part IV

The Three Pillars — How We Address It

The PhD is structured around three interlocking research pillars. Each addresses a distinct research question, produces distinct contributions, and feeds the others.

Pillar 1 — Measurement

📡

Dialogue Telemetry: Instrumenting Conversations

RQ: What turn-level observables can instrument schema-grounded information-gathering dialogues so that progress and stalling can be monitored online?

What: Dialogue Telemetry (DT) — a framework producing two model-agnostic signals after each question-answer exchange: a Progress Estimator (PE) quantifying residual information potential per knowledge category (including an information-theoretic bits-based variant), and a Stalling Index (SI) detecting throughput degradation via repeated category probing with semantically similar, low-gain responses.

Why it's needed: No existing turn-level measurement layer is both model-agnostic and actionable. Prior work focuses on pre-turn question selection or post-hoc breakdown detection. DT fills the gap with after-each-turn telemetry designed for online monitoring and closed-loop control.

How: PE uses schema-grounded informativeness rates, completeness tracking, and semantic embeddings to estimate "how much useful information remains now" per category. SI fuses discrete repetition counts with embedding-space cosine similarity in a trailing window, flagging observable failure signatures without requiring causal diagnosis. Crucially, DT monitors information flow — whether knowledge is accumulating — not the human's internal cognitive or emotional state. Stalling may arise from stress, confusion, knowledge exhaustion, or simply poor questioning; DT detects the symptom and leaves causal attribution outside its scope. Both signals are validated in SAR-inspired interview simulations and integrated as RL policy observations.

Pillar 2 — Grounding

🔗

LUCIFER: Language-to-Signal Middleware

RQ: Can online language grounding be externalised from a decision-maker via a stable signal contract, while still yielding safe and efficient behaviour?

What: LUCIFER (Language Understanding and Context-Infused Framework for Exploration and Behaviour Refinement) — a training-decoupled middleware that converts streaming verbal reports and client-agnostic telemetry into four standardised control-relevant signals: policy priors, reward potentials, admissible-action constraints, and action predictions.

Why it's needed: Most systems couple language processing inside the learner/planner, creating redeployment burden and diagnostic opacity. Messy, self-correcting language from stressed humans requires semantic reasoning, not pattern-matching. The field needs a principled interface boundary between language understanding and decision-making.

How: A Context Extractor (RAG-augmented LLM) maps verbal reports to structured semantic objects; a Signal Contract transduces these into mathematical quantities. An optional Discovery service (Exploration Facilitator) predicts high-value information-gathering actions from telemetry alone. Validated with two structurally distinct clients (hierarchical RL and hybrid A*+heuristic planner), demonstrating client-agnosticism: grounding→safety, discovery→efficiency, combination→both.

Pillar 3 — Behavioural Control

🎛️

Steering Conversational Behaviour Under Constraints

RQ: How can we achieve fine-grained, adaptive, and interpretable behavioural control over multi-turn conversational AI for information-seeking dialogue under protocol constraints?

What: A systematic exploration of seven candidate control mechanisms for modulating interviewing behaviour — from Representation Engineering (RepE) with RL-controlled compositional concept weights, through hybrid sensor-actuator architectures, to lightweight Thompson Sampling bandits and RAG-based dynamic prompting — all evaluated against the same DT-based metrics.

Why it's needed: LLMs generate fluent text but offer limited mechanisms for controlling how they generate. Cognitive interviewing requires simultaneous balancing of openness, semantic novelty, probing depth, context reinstatement, and leading-tendency suppression — a compositional, dynamic, multi-turn control problem that no single existing method fully solves. The controlled variable is the agent's own questioning behaviour (strategy selection, category targeting, question phrasing), not the human's internal state. The steering policy regulates information flow by adapting what the agent does in response to observable flow signals, rather than by attempting to diagnose or modulate the interviewee's cognition.

How: The research formalises cognitive interviewing as a constrained MDP with DT signals as state observations and multi-objective rewards. Each approach makes different trade-offs between interpretability, sample efficiency, fine-grained control, and stability. The design space spans RepE activation steering, prompt engineering, latent action learning (CoLA), and RAG-based knowledge grounding — providing the community with a systematic map, not a single "best" solution.

Part V

How the Pillars Interlock

The three pillars are not independent papers stapled together — they form a closed-loop architecture for autonomous knowledge acquisition:

Pillar 2 · Grounding (LUCIFER)

Agent arrives at an information source. Language middleware converts streaming verbal reports into navigational constraints, reward potentials, and query predictions — all through a client-agnostic signal contract.

↓

Pillar 1 · Measurement (Dialogue Telemetry)

During the interview, DT instruments each turn. PE quantifies how much useful information remains per category. SI flags when questioning becomes unproductive. These signals are exposed as real-time observables.

↓

Pillar 3 · Behavioural Control

An RL policy (or bandit, or RAG system) consumes DT signals as state observations, selects the next questioning strategy and concept emphasis, and generates the next question — adapting dynamically to maximise information gain while maintaining protocol adherence.

↓

Closed Loop

The response updates the DT state, which informs the next control decision. Grounded language from new verbal reports can arrive at any time, dynamically updating the agent's constraints and priorities mid-conversation.

This layered architecture mirrors mature control domains: Pillar 1 is the sensor suite (telemetry), Pillar 2 is the perception layer (language grounding), and Pillar 3 is the controller (behavioural policy). The contribution is not just each layer individually, but the demonstration that they compose into a functioning system.

The Evaluation Infrastructure Gap

Demonstrating the full closed loop requires an interviewee that reacts to how the agent asks questions. The current validation environment uses a static, pre-generated corpus — sufficient for component-level validation of DT and LUCIFER individually, but fundamentally unable to capture the interaction effect that makes adaptive steering meaningful: how you ask should change what you get back.

This is the classic controller-vs-plant problem. The PhD builds the controller; what was missing is a credible plant model. A collaborative MSc project (Summer 2026) is constructing a reactive, persona-configurable LLM interviewee simulator — grounded in distributional statistics from real task-oriented human communication — that exposes the same interface as the current static environment. If it succeeds, every downstream evaluation becomes stronger. If it partially succeeds, even approximate behavioural realism improves on deterministic lookup. If it fails entirely, the existing static corpus remains valid — all published results already use it. The intellectual contributions of the PhD stand independently of the simulator's fidelity; the simulator determines how convincingly the integration can be demonstrated, not whether the architecture works.

Part VI

The Research Outputs

Each pillar is realised through concrete, published or submitted works. Two additional pieces extend the programme into supporting capabilities.

Published · IEEE SMC 2024 Pillar 2 — Grounding Foundation

Selective Exploration and Information Gathering in Search and Rescue Using Hierarchical Learning Guided by Natural Language Input

The foundational paper that established the architecture. Introduced the integration of LLMs with hierarchical RL for SAR, demonstrating how verbal inputs from human stakeholders can be transformed into actionable RL insights via an Information Space, Context Extractor, Strategic Decision Engine, and Attention Space. Showed that language-infused hierarchical agents outperform flat RL agents — especially in sparse-reward environments — and that domain-knowledge infusion via RAG produces properly grounded outputs.

Contribution to the arc: Established that language-guided hierarchical decision-making is viable and that attention-based policy shaping from verbal inputs improves both performance and safety. Set up the Information Space formalism reused across all subsequent works.

Submitted · IEEE Journal Pillar 2 — Grounding

LUCIFER: Grounding and Discovery Services for Language-Assisted Decision-Making — A Signal Contract Architecture

The full maturation of the grounding layer. Formalises language grounding as training-decoupled middleware with a four-signal contract (priors, potentials, constraints, action predictions). Validates robustness on messy, self-correcting language (91-100% accuracy where pattern-matching baselines collapse to 20-36%), and demonstrates client-agnosticism with two structurally distinct downstream consumers. Introduces the Discovery service for telemetry-based query prediction.

Contribution to the arc: Proved that language can be externalised from the decision-maker through a stable mathematical interface, enabling safe and efficient behaviour without embedding NLP inside the controller. The dual-client validation demonstrates the approach is architectural, not algorithm-specific.

Submitted · IEEE Journal Pillar 1 — Measurement

Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering

Defines and validates the measurement framework. Introduces PE (Progress Estimator) in two forms — heuristic expected-gain and Shannon-based expected information gain (bits) — and SI (Stalling Index) combining discrete repetition with semantic similarity analysis. Demonstrates that DT accurately distinguishes efficient from stalled dialogue traces (100% detection, 0% false positives in controlled scenarios) and improves RL policy performance when stalling carries operational costs.

Contribution to the arc: Provided the sensor suite for the entire programme. PE and SI are the state observations consumed by all Pillar 3 control approaches. The ablation reveals that SI penalties help when stalling triggers failure (Condition B) but hinder when it doesn't (Condition A) — a nuanced finding about when telemetry should shape reward vs. merely inform observation.

Submitted · IEEE Conference Supporting — Priority Adaptation

CA-MIQ: Context-Aware Max-Information Q-Learning for Priority-Driven Information Gain

Addresses what happens after information is gathered: how does an agent adapt when mission priorities shift? Introduces a dual-critic RL framework where an intrinsic critic fuses state-novelty, information-location awareness, and real-time priority alignment, with a shift detector triggering transient exploration boosts and selective critic resets. Achieves nearly 4× higher mission success rates than baselines after priority shifts and 100% recovery where baselines fail.

Contribution to the arc: Extends the Information Space formalism to handle dynamic priority orderings. Demonstrates that the information structures established in the foundational work can be made responsive to evolving mission context — a critical requirement for real-world deployment where "what matters" changes as new intelligence arrives.

Draft Supporting — Human Factors

ITLI: An LLM-Derived Intrinsic Task-Load Index for Cognitive Load Estimation

Bridges cognitive science and AI engineering. Proposes computing intrinsic cognitive load from textual information streams by having an LLM extract linguistically-grounded features (entity count, relation count, temporal/conditional clauses, quantitative constraints, token complexity) mapped to Sweller's element-interactivity scale. Offers a scalable, non-intrusive alternative to physiological sensors or subjective surveys.

Contribution to the arc: Addresses the human side of human-robot teaming. If the agent can estimate how much cognitive bandwidth an upcoming information stream demands, it can pace delivery, defer handoffs, or raise autonomy. ITLI complements DT (which monitors conversation health) with a signal about the human's capacity to process what the agent communicates.

Draft / Design Report Pillar 3 — Behavioural Control

Behavioural Control for Information-Seeking Dialogue: A Systematic Design Space Exploration

The capstone design report. Formalises cognitive interviewing as a multi-objective constrained control problem and systematically maps seven candidate approaches spanning the full design space: (a) Original RepE with RL-controlled concept weights, (b) Hybrid RepE-Sensor + Prompt-Actuator, (c) RePS (enhanced vector identification with orthogonalisation), (d) CoLA (learned latent action codes), (e) Adaptive Style Modulation via Thompson Sampling, (f) Prompt-R1 (end-to-end RL prompt generation), and (g) RAG-Based Dynamic Prompting. Each is formalised as an MDP, with architectural diagrams, theoretical strengths/limitations, and connections to DT-based evaluation metrics.

Contribution to the arc: Provides the control layer that consumes DT signals and actuates dialogue behaviour. Rather than advocating a single approach, it maps the full solution space — enabling principled empirical comparison. The shared DT-based evaluation framework (PE for information gain, SI for stalling, violation detection for protocol adherence) unifies assessment across approaches.

Part VII

Transfer Beyond SAR — Why This Matters Generally

SAR is the testbed, not the boundary. Every contribution in this PhD addresses a general pattern that recurs wherever autonomous systems interact with language-rich, human-populated environments:

Contribution	SAR Instance	General Pattern	Transfer Domains
Dialogue Telemetry (PE + SI)	Monitoring witness interviews for stalling and information coverage	Turn-level instrumentation for any schema-grounded information-gathering dialogue	Medical intake, insurance claims, investigative journalism, customer support, educational tutoring
Signal Contract (LUCIFER)	Converting hazard/victim reports into navigational constraints	Training-decoupled middleware for converting streaming language into control-relevant signals	Any robotic system receiving verbal instructions or reports — warehouse logistics, autonomous vehicles, agricultural robots
Information Space + Priority Dynamics (CA-MIQ)	Adapting to shifting SAR mission priorities	Piecewise-stationary priority adaptation for information-directed exploration	Dynamic resource allocation, adaptive surveying, intelligence analysis, clinical triage prioritisation
Behavioural Control Design Space	Steering cognitive interview behaviour	Fine-grained, adaptive, interpretable control of LLM generation under multi-objective constraints	Any high-stakes conversational AI — therapy chatbots, negotiation agents, legal deposition assistants, educational dialogues
ITLI (Cognitive Load Estimation)	Estimating operator cognitive load from mission reports	Non-intrusive cognitive bandwidth estimation from textual information streams	Adaptive UIs, variable autonomy, intelligent tutoring, accessibility systems

The Unifying Principle

The PhD's deepest contribution is the recognition that autonomous knowledge acquisition from humans requires the same engineering discipline as physical control: you need sensors (DT), a perception layer (LUCIFER), and a controller (behavioural policy) — and they must be modular, composable, and independently validated. This architectural pattern transfers wholesale to any domain where agents must gather, monitor, and act on human-communicated knowledge.

Part VIII

What Didn't Exist Before

To be precise about the new knowledge this PhD produces:

New Contributions

1. Turn-level telemetry for goal-directed dialogue. PE and SI are the first model-agnostic, real-time signals designed for schema-grounded information gathering that work both as diagnostic readouts and as RL policy observations.

2. Training-decoupled language grounding via signal contracts. LUCIFER is the first middleware to formalise a client-agnostic mathematical interface between streaming language and heterogeneous decision-makers, validated across learning and non-learning paradigms.

3. Priority-adaptive information-gain exploration. CA-MIQ is the first information-gain-driven exploration strategy that responds to piecewise-stationary shifts in mission-level priorities.

4. Systematic design space for behavioural control of information-seeking dialogue. The seven-approach comparison, unified by DT-based evaluation metrics, provides the first systematic map of how to steer LLM questioning behaviour under multi-objective constraints.

5. LLM-based intrinsic cognitive load estimation. ITLI is the first attempt to computationally estimate element interactivity from textual information streams using LLMs.

What Is NOT Claimed

This PhD does not claim to have solved "the physics of dialogue" or created a universally optimal interviewing agent. It claims to have built practical, validated tools — telemetry signals, a grounding middleware, and a design space of controllers — that are useful, computable, principled, and transferable. The deeper formalisations (e.g., continuous manifold structure for dialogue dynamics) remain future work.

A critical distinction: this work regulates information flow, not human communicative capacity. The thesis formalisation includes a latent variable C(t) representing the human's effective transmission capacity, which can degrade under inappropriate querying. This is an empirically motivated abstraction drawn from neurocognitive and crisis-communication literature — not a variable the system claims to measure, estimate, or directly control. The Information Extraction Paradox (that urgency degrades the human's ability to provide information) motivates why adaptive strategy selection is necessary. DT and steering implement adaptation by monitoring observable information-flow symptoms (residual gaps, revisitation, low gain) and responding with strategy changes. Whether those strategy changes also benefit the human's internal state is a plausible hypothesis supported by the interviewing literature, but proving it would require validated models of how interviewer behaviour affects interviewee cognition — a question at the intersection of computational psycholinguistics and control theory that lies beyond the scope of this engineering-focused thesis. The system acts as a thermostat for information flow: it detects deviations from productive exchange and actuates strategy changes, without diagnosing why the deviation occurred.

Part IX

The Research Philosophy

This PhD was driven by a conviction that autonomous systems need to become active participants in the knowledge-gathering process, not passive recipients of pre-processed data. Humans naturally seek help when facing uncertainty — they ask questions, interview witnesses, consult experts. Robots should do the same, and they should do it well.

The path was not linear. The early inspiration came from an analogy between dialogue dynamics and fluid flow — the idea that conversations have "gradients" toward information and "circulation" when stuck. But formalising that analogy fully would require mathematical machinery beyond the scope of a single PhD. Instead, the work pivoted to computational proxies — PE and SI — that capture the functional intent of those metaphors using information theory and embedding geometry. This is the tractable, validated path rather than the imaginative but shallow one.

A related pivot shaped the thesis's relationship with psychology. The Information Extraction Paradox draws on well-established findings from neurocognition and crisis communication: that stress impairs recall, that aggressive questioning can reduce information quality, that interaction style affects communicative capacity. These findings motivate the engineering architecture but are not claims the thesis sets out to prove. The system monitors information flow — observable patterns in what the interviewee says — not the human's latent cognitive or emotional state. It regulates the agent's own behaviour (which questions to ask, in what style, targeting which knowledge gaps) based on those observable patterns. This is a deliberate engineering stance: treat the human side of the channel with respect for its complexity, instrument what is observable, and control what is controllable — namely, the agent's own actions. Validated models of how interviewer behaviour affects interviewee cognition would enable a deeper, truly co-regulatory architecture; this thesis identifies that as a frontier rather than claiming to have reached it.

The tools work. The deeper mathematics remains for the future. And that is a mature, honest research contribution.

Each component was designed with a principle: build frameworks others can extend, not just deliverables for a contracted project. The DT signals are model-agnostic. The Signal Contract is client-agnostic. The behavioural control design space is approach-agnostic. This modularity is intentional — it is what makes the work transferable beyond SAR, beyond any single application domain, and beyond the lifetime of this PhD.

The same philosophy extends to the research infrastructure. In any multidisciplinary PhD, the researcher is effectively building both the controller and the plant model — a situation where, in most engineering research, one of these pre-exists. Recognising this, the evaluation infrastructure (a reactive, behaviourally grounded interviewee simulator) has been scoped as a self-contained collaborative project: the contributor builds the plant model without needing to understand the control theory, while the PhD researcher retains sole ownership of the intellectual contributions. This separation is deliberate — it mirrors the same modularity principle that governs the technical architecture itself.

The evaluation strategy reflects this scope honestly. Component-level validation (DT monitoring accuracy, LUCIFER grounding fidelity, steering policy performance) uses controlled simulation environments. Integration-level validation uses a reactive, persona-configurable simulator where different interviewee profiles present different information landscapes, and the steering policy must adapt its trajectory accordingly. This tests adaptive information-space navigation — the claim that DT-guided steering efficiently harvests available information across diverse interviewee types. It does not test whether steering regulates the human's internal capacity, which would require persona dynamics models that evolve in response to interviewer behaviour — a capability identified as immediate future work requiring validated computational models of interviewer–interviewee co-regulation.

Closing

The Research Arc in One Sentence

This PhD develops the measurement, grounding, and control layers that enable autonomous agents to conduct, monitor, and adapt information-seeking conversations with humans — treating dialogue-based knowledge acquisition with the same engineering rigour applied to physical sensing and control.