Human-Al Collaboration in the STEM Classroom: A Systematic Literature Review of GenAI as a Complement in Higher Education

Authors

Ye Jia, Chen Li, JAlexander Chan, Qing Li

Published in

The 8th International Conference on Technology in Education (2025)

Keywords

Large Language Models Higher Education STEM Education Systematic Literature Review Generative AI Human-AI Collaboration

Human-Al Collaboration in the STEM Classroom: A Systematic Literature Review of GenAI as a Complement in Higher Education

Beyond Tool and Tutor

The discourse around GenAI in education has been dominated by two frames: GenAI as tool (a calculator for writing, a search engine that synthesizes) and GenAI as tutor (a personalized instructor that adapts to the learner). Both frames position the AI as serving the human in a one-way relationship. This review examines a third, less explored frame: GenAI as collaborator — an entity that works alongside learners and instructors in a bidirectional relationship where both parties contribute, critique, and build on each other's output.

This distinction is not semantic. A tool is used and put down; a collaborator is engaged with, argued with, and learned from. The review argues that STEM education — with its emphasis on problem-solving, iterative refinement, and collaborative sensemaking — is the domain where the collaborator frame is most natural and most understudied.

The Review Method

The review followed PRISMA guidelines, systematically searching major databases for studies published between 2020 and 2024 that examined GenAI use in STEM higher education settings. Inclusion criteria required that the study treat GenAI as a collaborative partner (not merely as a tool or tutor) and that it report empirical data on learning processes or outcomes. The resulting corpus — drawn from computer science education, engineering, mathematics, and natural sciences — was analyzed through thematic synthesis to identify recurring patterns of human-AI collaboration, their effects on learning, and the conditions under which collaboration succeeds or fails.

What the Literature Shows

The review identifies several distinct patterns of human-AI collaboration in STEM classrooms:

Co-creation. Students and GenAI jointly produce artifacts — code, proofs, experimental designs, data analyses. The student proposes a direction; the AI generates a draft; the student critiques and refines; the AI incorporates feedback and regenerates. This iterative cycle, when it works, mirrors pair programming or collaborative writing between humans. When it fails — and the review documents failures — it's typically because the student accepts AI output without critique, collapsing collaboration into copying.

Socratic dialogue. Students use GenAI not to get answers but to be questioned. "Challenge my proof." "What assumptions am I making?" "Where is this argument weak?" The AI serves as a critical thinking partner, and the student must defend or revise their reasoning. This pattern is particularly relevant in mathematics and theoretical computer science, where the goal is developing rigorous reasoning rather than producing artifacts.

Explanation generation and critique. GenAI generates explanations of complex STEM concepts; students evaluate the explanations for accuracy, completeness, and clarity. The learning comes not from reading the explanation but from the evaluation — identifying what the AI got wrong, what it omitted, what it oversimplified. This inverts the typical AI-as-tutor model: the student tutors the AI by diagnosing its errors.

Scaffolded exploration. GenAI generates variations on a problem or concept ("what if we changed this parameter?" "how would this proof work for a different class of graphs?"), enabling students to explore the problem space more broadly than a fixed problem set allows. The AI does not solve problems; it generates new problems for the student to solve.

Design Principles That Emerge

Across these patterns, the review synthesizes several cross-cutting design principles for human-AI collaboration in STEM education:

Explainability over correctness. Systems designed for collaboration should prioritize showing their reasoning over guaranteeing correct answers. A wrong answer with visible reasoning is pedagogically more valuable than a correct answer with hidden reasoning, because the former gives the student something to critique.
Complementary capability. The AI should contribute what the human is bad at (generating variations, retrieving relevant examples, checking consistency), while the human does what the AI is bad at (evaluating relevance, exercising judgment, connecting to personal experience). Collaboration works best when the division of labor exploits comparative advantage.
Friction by design. Systems should introduce productive friction — prompts that ask "are you sure?" or "what about this counterexample?" — rather than smoothing the interaction to maximize efficiency. The goal is learning, not task completion speed.
Transparency of limitations. Students who understand what GenAI cannot do (reason causally, verify facts, maintain consistency across long interactions) collaborate more effectively than students who treat it as omniscient.

The Award Context

This paper received the Excellent Paper Award at ICTE 2025. The award recognition likely reflects the timeliness and the systematic rigor: the review arrives at a moment when universities are scrambling to develop GenAI policies, and it provides an evidence-based framework for moving beyond the binary "ban it or embrace it" debate toward more nuanced pedagogies of collaboration.

Boundaries

The review is limited by the youth of the field. Most included studies were conducted between 2023 and 2024, with GenAI models (GPT-3.5, GPT-4, Claude 2) that are already multiple generations behind the frontier. Whether the collaboration patterns identified here generalize to more capable models — which may be harder to critique because they make fewer obvious errors — is an open question. The review also captures primarily Western university contexts; cultural dimensions of human-AI collaboration (how students from different educational traditions approach critiquing an AI, for instance) are not addressed.

The "collaborator" frame itself has limitations. Calling an LLM a collaborator anthropomorphizes a system that has no intentions, no understanding, and no accountability. The review is careful to use "collaboration" to describe the interaction pattern, not the AI's ontological status — but the distinction may be lost on students, who already tend to over-attribute agency to AI systems. Managing this framing risk — encouraging productive collaboration without fostering inappropriate trust — is a practical challenge the review identifies but does not resolve.