Module-Level Human-Machine Collaborative Software Engineering Architecture Design

AI Software Engineering

👤 Software engineers, AI researchers, technical managers, and practitioners focused on human-machine collaboration and automated software development.

This paper addresses the issues of poor quality, unclear boundaries, and slow speed in existing AI Agents for code module implementation by proposing a module-level human-machine collaborative software engineering architecture. The architecture generates Protocol Spec through rapid intent alignment, then parallelly generates implementation, test, and benchmark specifications, and ensures implementation quality through multi-level arbitration mechanisms. Core designs include layered collaboration, specialized division of labor, and separation of concerns, with clear acceptance criteria (unit test passing, no performance degradation) to establish trust mechanisms and eliminate human control desires. The paper also discusses unresolved issues such as improving Protocol Spec quality and avoiding arbitration loops, and envisions the possibility of using higher-level AI to replace human supervision.

✨ Existing AI Agents have issues with poor quality, unclear boundaries, and slow speed in code module implementation

✨ Proposes a module-level human-machine collaborative architecture that generates Protocol Spec through rapid intent alignment

✨ The architecture adopts layered collaboration, parallelly generating Implementation Spec, Test Spec, and Benchmark Spec

✨ Ensures implementation quality through multi-level arbitration mechanisms, reducing human intervention

✨ Defines clear acceptance criteria: unit test passing and no performance degradation, to establish trust mechanisms

📅 2026-01-05 · 1,384 words · ~7 min read

Human-Machine Collaboration
Software Engineering
LLM
AI Agent
Module Design
Arbitration Mechanism
Automated Development

Module-Level Human-AI Collaborative Software Engineering Architecture

2026-01-05

Problem Background

Design an LLM-based architecture for module-level human-AI collaborative engineering, aiming to efficiently complete the design, implementation, and iteration of industrial-grade application modules, reducing the cost of human intervention.

Existing AI Agents (Claude Code, CodeX) produce poor-quality code module implementations, still requiring significant human intervention, rework, and review.
Existing AI Agents struggle to construct clear module boundaries during implementation, leading to code with unnecessary complexity.
Existing AI Agents are too slow; a task from assignment to acceptance takes 10-30 minutes.

Problem Insights

According to the viewpoint in this article, the human desire for control stems from rational concerns about losing control over outcomes. Establishing a mechanism for controllable trust is the solution.
According to the viewpoint in this article, I believe the physical and economic mechanisms of LLMs inherently make it difficult for them to complete all work perfectly in one go.

The key to liberating human productivity lies in eliminating the human desire to control details. Once that happens, humans, adopting a "good enough" mentality, will no longer excessively scrutinize the AI's output.

So, after which checks will a human judge that they are no longer capable of intervening or that further action is unnecessary?

The conceptual naming and style of the module's external interfaces align with requirements. This alleviates concerns about poorly designed interfaces propagating downstream in the system.
Passing unit tests. This alleviates concerns about whether the module functions correctly.
Optimization or no regression in benchmark tests. This alleviates concerns about the module's efficiency. The first point can be identified early on, while the latter two are only known after experimentation. If all three are satisfied, humans have little reason to forcibly intervene in the AI's completed work.

As for whether this module can truly handle real-world data patterns, it must be tested with production environment data. Then, humans can summarize the patterns and construct a new module via intent to solve new problems. This issue is temporarily outside the scope of this article.

Priority Objectives

Reduce human intervention.
Reduce runtime, improve speed.
Reduce Token usage, lower LLM costs.

Design

graph TD

   subgraph Agent
   A_1[Protocol Spec]
   A_1 --> A_2[Protocol Code]
   A_1 --> A_A[Implementation Spec]
   A_1 --> A_B[Test Spec]
   A_1 --> A_C[Benchmark Spec]
   A_A --> A_A_1[Implementation Code]
   A_B --> A_B_1[Test Code]
   A_C --> A_C_1[Benchmark Code]
   A_A_1 --> A_D[Report]
   A_B_1 --> A_D
   A_C_1 --> A_D
   end

   subgraph Human
   H_1[Intention] -->|Dispatch| A_1
   A_D -->|Review| H_1
   end

Rapid Intent Alignment

The human quickly aligns the module's functional requirements with the Agent through an intent description, resulting in a Protocol Spec.

This Protocol Spec includes the module's interface definitions, input/output data formats, functional descriptions, etc., essentially similar to an RFC document. The human needs to focus on interface definitions and functional descriptions to ensure clear module boundaries, particularly paying attention to the style and taste of the interfaces.

This process can involve multiple rounds of interaction. The Agent will continuously revise the Protocol Spec based on human feedback until approved by the human.

Following this, there will be a lengthy automated implementation process during which human intervention is not required. Two outcomes are possible: 1. The module implementation succeeds, generating a final report for human review; 2. The module implementation fails, generating an arbitration request for human intervention.
Generate Protocol Code from Protocol Spec

The Agent generates the skeleton code (Protocol Code) for the module based on the Protocol Spec, including interface definitions and comments. The Protocol Code will be used for subsequent implementation, test, and benchmark code generation. Its primary purpose is to ensure clear module boundaries and avoid unnecessary complexity during implementation.
Generate Implementation Spec, Test Spec, Benchmark Spec in Parallel from Protocol Spec

Different Agents are tasked to generate the Implementation Spec, Test Spec, and Benchmark Spec based on the Protocol Spec, describing the module's implementation details, test cases, and benchmark testing plan, respectively.
Generate Test Code from Test Spec

A specialized testing Agent generates the module's unit test code (Test Code) based on the Protocol Spec and Test Spec, including various test cases and assertions. Interface-based testing methods must be used to avoid coupling with implementation details.
Generate Benchmark Code from Benchmark Spec

A specialized benchmarking Agent generates the module's benchmark test code (Benchmark Code) based on the Protocol Spec and Benchmark Spec, including performance test cases and measurement metrics. Interface-based testing methods must be used to avoid coupling with implementation details.
Generate Implementation Code from Implementation Spec

A specialized implementation Agent generates the module's implementation code (Implementation Code) based on the Protocol Spec, Implementation Spec, Test Spec, and Benchmark Spec. Once implementation is complete, unit tests are run immediately.

If unit tests fail, analyze the cause.
- If the issue is believed to be with the Implementation, modify the Implementation Spec and regenerate the Implementation Code. Repeat this process.
- If the issue is believed to be with the Test, collect details of the test failure and compile them into a counter-argument. This will then be handled by a higher-level arbitration Agent.
  - If the counter-argument is accepted, the arbitration Agent can choose to modify the Test Spec and rerun the tests. Repeat this process.
  - If the counter-argument is rejected, the arbitration Agent generates an explanatory opinion, instructing the implementation Agent to modify the Implementation Spec and restart the implementation process. Repeat this process.
  - If the arbitration Agent deems it unable to judge, it will request human intervention for arbitration.
If unit tests pass, proceed to benchmark testing.
Run Benchmark Tests

The Implementation Code that passed unit tests can now run benchmark tests.

If no other comparable implementation version exists, mark the current implementation as the baseline version, run the benchmark tests, record performance metrics, and pass the benchmark test.

If other comparable implementation versions exist, run the benchmark tests and record performance metrics. Generate a comparison report for the Agent to analyze the performance changes of the current implementation version.
- If the current implementation version's performance regresses, analyze the cause of regression.
  - If the issue is believed to be with the Implementation, modify the Implementation Spec and regenerate the Implementation Code. Repeat this process.
  - If the issue is believed to be with the Benchmark, collect details of the benchmark test failure and compile them into a counter-argument. This will then be handled by a higher-level arbitration Agent to judge.
    - If the counter-argument is accepted, the arbitration Agent can choose to modify the Benchmark Spec and rerun the benchmark tests. Repeat this process. If the counter-argument is rejected, the arbitration Agent declares the task failed and generates a final report for human review.
    - If the counter-argument is rejected, the arbitration Agent returns the counter-argument to the implementation Agent, instructing it to modify the Implementation Spec and restart the implementation process. Repeat this process.
    - If the arbitration Agent deems it unable to judge, it will request human intervention for arbitration.
- If the current implementation version's performance does not regress, the benchmark test is passed.
Generate Final Report

Once the Implementation Code passes both unit tests and benchmark tests, generate a final report containing implementation details, test results, and benchmark test results. The final report is submitted to the human for review. If the human approves the current implementation, the task is complete; otherwise, collect the human's feedback, compile it into a counter-argument, and handle it via a higher-level arbitration Agent. If the counter-argument is accepted, the arbitration Agent can choose to modify the Protocol Spec and restart the entire implementation process. Repeat this process.

Summary

The core of the architecture is layered collaboration, specialized division of labor, and separation of concerns.
A multi-level arbitration mechanism ensures implementation quality and reduces human intervention.
Clear acceptance criteria (passing unit tests, no performance regression) establish a trust mechanism, eliminating the human desire for control.

Some unresolved issues remain:

How to improve the quality of the Protocol Spec to ensure clear module boundaries? Add an automatic review step.
How to avoid infinite arbitration loops? For example, set a maximum limit on automatic arbitration attempts.
How to control actual execution time and Token usage within reasonable limits? Measure first, then optimize.
How to guarantee the taste of interface design? For example, incorporate a team style guide.

Some prospects:

The human's role doesn't necessarily have to be a human; it's essentially a Supervisor. In the future, could a higher-level AI replace humans for intent alignment and final review? This would further reduce human intervention and improve efficiency.
Can this approach be extended beyond module-level tasks to larger-scale system design and implementation? For example, full-stack development tasks involving frontend, backend, and database? This would significantly enhance the application value of AI in the field of software engineering.

RE:CZ