NivTA: Towards a Naturally Interactable Edu-Metaverse Teaching Assistant for CAVE

Authors

Jia, Ye, Sin, Zackary P. T., Wang, Xiangzhi Eric, Li, Chen, Ng, Peter H. F., Huang, Xiao, Dong, Junnan, Wang, Yaowei, Baciu, George, Cao, Jiannong, Li, Qing

Published in

2024 IEEE International Conference on Metaverse Computing, Networking, and Applications (MetaCom) (2024)

Keywords

Metaverse Large language models Education Collaboration User interfaces Mirrors Feeds virtual teaching assistant LLM agents cave automatic virtual environment natural user interface

View Publication

Nivta integrates LLM agent, tracking system, virtual avatar and K-Cube CAVE-VR.

Abstract

Edu-metaverse is a specialized metaverse dedicated for interactive education in an immersive environment. Its main purpose is to immerse the learners in a digital environment and conduct learning activities that could mirror reality. Not only does it enable activities that may be difficult to perform in the real world, but it also extends the interaction to personalized and CL. This is a more effective pedagogical approach as it tends to enhance the motivation and engagement of students and it increases their active participation in lessons delivered. To this extend, we propose to realize an interactive virtual teaching assistant called NivTA. To make NivTA easily accessible and engaging by multiple users simultaneously, we also propose to use a CAVE virtual environment (CAVE-VR) as a "metaverse window" into concepts, ideas, topics, and learning activities. The students simply need to step into the CAVE-VR and interact with a life-size teaching assistant that they can engage with naturally, as if they are approaching a real person. Instead of text-based interaction currently developed for large language models (LLM), NivTA is given additional cues regarding the users so it can react more naturally via a specific prompt design. For example, the user can simply point to an educational concept and ask NivTA to explain what it is. To guide NivTA onto the educational concept, the prompt is also designed to feed in an educational KG to provide NivTA with the context of the student’s question. The NivTA system is an integration of several components that are discussed in this paper. We further describe how the system is designed and implemented, along with potential applications and future work on interactive collaborative edu-metaverse environments dedicated for teaching and learning.

Why CAVE-VR for a Teaching Assistant?

Most VR-based learning systems target head-mounted displays (HMDs). NivTA targets a CAVE (Cave Automatic Virtual Environment) instead — a room-sized projection space where multiple users see the same virtual content without wearing headsets. This architectural choice is not incidental. A teaching assistant is inherently a shared resource: multiple students should be able to approach it, point at the same concept, and hear the same explanation simultaneously. In an HMD, each user is isolated in their own render; sharing an interaction with a virtual agent requires networking, avatar synchronization, and the social friction of not being able to see your classmates' real faces. In a CAVE, the virtual teaching assistant is simply there, life-size, on the wall, addressable by anyone in the room.

The CAVE becomes what the authors call a "metaverse window" — a portal into the digital knowledge space that preserves the social dynamics of a physical classroom. This is a meaningfully different interaction paradigm from both text-based LLM chatbots (ChatGPT, Claude) and HMD-based VR tutors.

How NivTA Works

NivTA integrates four components: a large language model (the reasoning engine), an educational knowledge graph (the domain grounding), a tracking system (for interpreting user gestures and pointing), and a life-size virtual avatar rendered in the CAVE.

The prompt design is the system's intellectual core. When a student points at a concept — say, a node labeled "backpropagation" in a displayed knowledge graph — the tracking system identifies the target, queries the KG for contextual information (what course is this from? what prerequisite concepts does it depend on? what follows from it?), and constructs an LLM prompt that includes: the student's question, the targeted concept, the KG-derived context, and instructions to respond as a teaching assistant at an appropriate level. The prompt constrains the LLM to the educational domain and to the specific concept the student indicated, preventing the kind of free-ranging responses that make generic chatbots unreliable for structured learning.

The interaction is designed to be natural: point at something, ask about it, get an answer. There is no text input, no chat window, no typing in VR. The authors argue — correctly, from an HCI perspective — that the text-based interaction paradigm inherited from chatbots is a poor fit for an embodied learning environment. When a student approaches a human teaching assistant, they point at the whiteboard and ask; NivTA replicates this same embodied interaction pattern.

What This Enables That Chatbots Don't

LLM-based chatbots (including education-specialized ones) operate on a one-user-one-session model. Each student gets a private tutor. NivTA inverts this: the assistant is public, shared, and physically co-located with the learners. This changes the social dynamics of asking questions. In a private chatbot session, a confused student might hesitate to ask a "stupid question" for fear of being judged by the AI (or more realistically, by the instructor reviewing logs). In the CAVE, the assistant is part of the classroom environment — asking it a question is more like asking the teacher at the front of the room. Other students overhear the answer. Follow-up questions emerge organically. The interaction becomes a public good rather than a private utility.

The KG grounding is the other differentiator. A generic LLM, when asked "explain backpropagation," will produce a reasonable-sounding answer that may or may not align with the specific course's framing, prerequisites, and learning objectives. NivTA's prompt, enriched with the course's KG, constrains the response to the relevant conceptual neighborhood. If the KG says backpropagation depends on understanding partial derivatives and feeds into understanding optimization algorithms, NivTA's answer is anchored to those relationships. This is not a technical breakthrough — it's prompt engineering with a structured knowledge base — but it solves a real deployment problem: instructors won't trust an AI teaching assistant that might contradict their curriculum.

Boundaries

The paper is primarily a system description with architecture and design rationale. It does not include a user study evaluating NivTA against a baseline (e.g., a text-based LLM tutor, a human TA, or no assistant). Claims about engagement, learning effectiveness, and naturalness are design arguments, not empirical findings. The system is presented as a working prototype, but the evaluation section describes potential applications and future work rather than measured outcomes.

The CAVE-VR constraint is also a deployment constraint. CAVEs are expensive, space-intensive, and rare outside research universities. The design principles — KG-grounded prompts, embodied pointing-based interaction, shared public access — are portable to other display form factors (large touchscreens, AR glasses, HMDs with passthrough), but the specific implementation is tied to a hardware platform with limited reach. The authors acknowledge this implicitly by framing the paper as "towards" rather than "here is a validated system."