Coding Agent Second Half: From Individual Efficiency to Organization-Level R&D System

Nowadays, developers who still write code the old-fashioned way are aiming to become heirs to an intangible cultural heritage; the vast majority are already using Coding Agents like Claude Code and Cursor. The direction is right, but with different scenarios, the solutions vary — a developer installing an AI assistant locally to boost personal efficiency, and building an AI-driven R&D collaboration system within an organization, are things of entirely different dimensions. The former already has mature products, while the latter is just starting. This article is about the latter.

What Is an Organization-Level Coding Agent, and Who Is Building Them?

From late 2025 to early 2026, an interesting thing happened: Stripe, Ramp, and Coinbase publicly disclosed their respective internal Coding Agents almost simultaneously — Stripe called theirs Minions, Ramp called theirs Inspect, and Coinbase called theirs Cloudbot. Developed independently by three companies without referencing each other, they unexpectedly converged on almost identical architectures.

This is no coincidence. When you upgrade a Coding Agent from "one person using it in a terminal" to "the whole team triggering it at any time via Slack / GitHub Issue," you will be pushed down the same path by the same set of engineering problems — you need sandbox-isolated execution environments, you need agents to resume prior work after being interrupted, you need to support various entry points like Slack, GitHub, and Feishu, and you need to prevent a single user's runaway loop from burning through the entire company's model quota.

The LangChain team released Open SWE in March 2026 — a project that distills the common pattern of Stripe/Ramp/Coinbase into an open-source framework. The README of Open SWE cuts straight to the chase:

Elite engineering orgs like Stripe, Ramp, and Coinbase are building their own internal coding agents — Slackbots, CLIs, and web apps that meet engineers where they already work.

"Meet engineers where they already work" — this phrase highlights the core design philosophy of organization-level Coding Agents: instead of asking engineers to learn a new tool, let the agent integrate into the Slack channels, GitHub Issues, and IM chats they are already using, becoming part of the team workflow.

The AgentScope Harness module of AgentScope Java 2.0 follows the same path. This article uses the official example agentscope-examples/agents/agentscope-codingagent as a clue to explain how a production-grade Coding Agent is assembled using Harness — what problem each line of configuration solves, and how it evolves all the way from a local CLI to an enterprise service running behind a GitHub Webhook.

Let's Clarify the Positioning First

Before going further, we must differentiate "what we are doing" from "local tools like Claude Code / Cursor."

Claude Code optimizes for "me writing code faster by myself" — you type, it works, you watch it work, and you interrupt and correct it at any time. The state is on your local machine, the trigger is yourself, and the trust boundary is that you trust your own machine.

What we are building in this article solves another problem: "I don't even need to look at certain small tasks in the team; I just toss them to the agent, and review the PR once it's done." The trigger could be anyone commenting on an Issue, and the agent runs remotely for fifteen minutes to an hour without anyone watching. An engineer at Stripe tags @Minions in Slack saying "help me fix this bug," and later receives a draft PR — this is what an organization-level Coding Agent should look like.

These two forms overlap in terms of feature sets — both can write code, run commands, and modify files — but their underlying engineering constraints are completely different. An analogy: Claude Code is your own private car, and because you trust the driver (yourself), you don't need protections other than airbags. An organization-level Coding Agent is a taxi fleet vehicle — the passengers (triggers) are not the owners, the driving (execution) happens remotely, so you need dashcams, GPS tracking, mileage limits, emergency braking, and you must ensure one broken car doesn't impact the whole fleet.

Open SWE summarizes this philosophy in one sentence: "Isolate first, then give full permissions inside the boundary." The design of AgentScope Harness is exactly the same.

What About Cloud Agents from Vendors?

In fact, many vendors are also offering SaaS products. For example, GitHub Copilot Coding Agent can already be triggered by assigning it on an Issue, running automatically in the cloud to open a draft PR; Claude Code also has a headless mode, which can be programmatically invoked in CI.

There is no fundamental difference in philosophy — sandbox isolation, asynchronous triggers, and PR-driven outputs — vendors have productized the patterns validated by leading companies into out-of-the-box SaaS services. In contrast, companies like Stripe, Ramp, and Coinbase chose to build in-house, mostly due to the uniqueness of their own engineering systems: deep integration with internal systems, data compliance requirements, and the level of workflow customization led them down the self-built path.

These two paths are not contradictory; which one is more suitable depends on the organization's own constraints and needs. What AgentScope Harness aims to do is to abstract the engineering problems of implementing this system (such as sandboxing, session recovery, multi-channel access, long-term memory, etc.) into composable base capabilities, so that teams choosing to build in-house do not have to start from scratch.

Run in 5 Minutes: Get an Intuitive Feel First

The fastest path to experience it — one environment variable, one Maven command, running an interactive REPL on the local file system. No Docker, no webhook, no GitHub App required.

# 1. Set the model API key (Default: DashScope; OpenAI / Anthropic are also supported)
export DASHSCOPE_API_KEY=sk-...

# 2. Build dependencies in the repository root directory (can be skipped in subsequent runs)
cd agentscope-java
mvn install -pl agentscope-examples/agents/agentscope-codingagent -am -DskipTests -q

# 3. Start the CLI
mvn exec:java -pl agentscope-examples/agents/agentscope-codingagent

After starting, a banner will appear, followed by a You> prompt. The agent works in its own workspace ~/.agentscope/codingagent/workspace/ — the standard approach is to clone your target repository into it before operating:

You> write hello.txt with a haiku about Java
You> clone https://github.com/owner/repo into the workspace and tell me what it does
You> review https://github.com/owner/repo/pull/42
You> /exit

Even with zero configuration, you get a complete workspace, session persistence, and long-term memory right out of the box. This is the first level of value provided by AgentScope Harness.

Want to isolate each session into a Docker sandbox? Just one more step:

docker build \
  -t agentscope/coding-sandbox:latest \
  agentscope-examples/agents/agentscope-codingagent/src/main/docker/coding-sandbox/

export SANDBOX_TYPE=docker
mvn exec:java -pl agentscope-examples/agents/agentscope-codingagent

This leads us to the true engineering core of organization-level Coding Agents.

The Real Challenge: From "Running Once" to "Serving a Team 24/7"

Running a demo is quick. The hard part is letting it stably serve an entire team in a production environment, handling dozens of Issues and PRs a day, running each to completion without mixing data, running out of memory, or burning through API quotas.

Stripe, Ramp, and Coinbase each stumbled through these engineering challenges, Open SWE built an abstraction at the framework level, and AgentScope Harness has also provided its own solution. Let's break them down by problem domain below.

Sandboxing: Letting Agents `rm -rf` with Peace of Mind

The biggest engineering conflict of a Coding Agent is: you want the model to have real execution capabilities — git clone, npm install, mvn test, arbitrary shell commands — but you cannot let it damage the host machine.

Coinbase solves this problem with its own in-house sandbox infrastructure. Ramp uses Modal's cloud containers. Open SWE provides an abstraction layer supporting multiple backends like Modal, Daytona, Runloop, and more. AgentScope Harness implements the same abstraction — FilesystemSpec is the unified interface, where Docker containers, remote KV stores, and local file systems are all pluggable implementations. Taking the Docker backend as an example:

HarnessAgent agent = HarnessAgent.builder()
    .name("coding")
    .model(model)
    .workspace(workspace)
    .filesystem(new DockerFilesystemSpec()
        .image("agentscope/coding-sandbox:latest")
        .isolationScope(IsolationScope.SESSION))
    .build();

With just this single line of .filesystem(...), all built-in tools like read_file, write_file, and execute automatically route through the sandbox backend, and the agent code requires absolutely no modifications. IsolationScope.SESSION ensures that each GitHub Issue / PR / IM conversation runs independently — the most natural and secure approach.

Cross-Invocation Recovery: The Second Round of call() is the True Test

A user leaves a comment on a PR: "add another test." The agent must be able to continue from the environment of the previous round — nobody wants to wait five minutes to git clone + npm install all over again.

This is the problem Open SWE solves with "persistent sandboxes" — follow-up messages in the same thread reuse the same sandbox. AgentScope Harness's approach is more refined: the sandbox packages the workspace state into a snapshot when each call() ends and saves it, restoring it on-demand next time:

Container still exists → reuse it directly (fastest)
Container is gone → spin up a new one using the snapshot and restore the workspace
No snapshot → perform a full initialization (cold start)

Pluggable snapshot backends include LocalSnapshotSpec (single local machine), OssSnapshotSpec (S3-compatible, for multi-replica scenarios), and RedisSnapshotSpec (low-latency, for small workspaces). Adding a line of configuration in production is all it takes:

.filesystem(new DockerFilesystemSpec()
    .image("agentscope/coding-sandbox:latest")
    .snapshotSpec(new OssSnapshotSpec(ossClient, "my-bucket", "agentscope/")))

Long Session Memory: The Context Window Is Not Infinite

A long Issue running for dozens of rounds of conversation, with git diff outputting tens of thousands of characters, and mvn test logs spans dozens of KBs — the model's context window will quickly hit its limit.

AgentScope Harness's solution is a set of four independent, composable mechanisms. Dialogue Summary Compression is automatically triggered when there are too many messages, keeping recent raw messages and compressing earlier ones into a summary. Large Tool Result Eviction writes outputs exceeding 80K characters into workspace files, keeping only about 2K from the beginning and end in the context, along with a read_file path prompt — if the agent wants to view the full content, it can read it again. Argument Truncation cuts down huge input arguments of write_file since this content has already been written to a file and isn't needed in subsequent conversations. Overflow Fallback performs emergency compression and retries when encountering context_length_exceeded.

HarnessAgent.builder()
    .compaction(CompactionConfig.builder()
        .triggerMessages(50)
        .keepMessages(20)
        .truncateArgs(CompactionConfig.TruncateArgsConfig.builder()
            .maxArgLength(2000).build())
        .build())
    .toolResultEviction(ToolResultEvictionConfig.defaults())
    .build();

This is not optional. Coding Agents will inevitably run long sessions and produce massive diffs; without enabling these two configurations, you will hit a wall sooner or later.

Meanwhile, MEMORY.md periodically merges long-term facts from the daily conversational logs. After running a Coding Agent for some time, MEMORY.md might contain records like:

- The test command for the repository `owner/repo` is `mvn -pl module test`. Avoid using `mvn test` in the root directory because it is too slow.
- The `main` branch is protected and changes must be merged via PRs; the naming convention for feature branches is `feat/`.
- GitHub Actions is used for CI, and the configuration file is located at `.github/workflows/ci.yml`.

The agent learns the team's rules on its own and won't need to ask again next time. All conversations sharing the same workspace will benefit.

Session Persistence: Conversations Must Not Break If Nodes Crash

An organization-level Coding Agent is a long-lived application. An Issue session might span from morning to night, during which services might undergo rolling updates, scaling, or replica switchovers — but users should perceive that "conversations never drop."

By default, AgentScope Harness stores the state in local files, which is sufficient for development. Switch to Redis for multi-replica production with a single line of configuration:

HarnessAgent.builder()
    .stateStore(RedisAgentStateStore.builder().lettuceClient(redisClient).build())
    .build();

After switching to Redis: if a node crashes, the session drifts to another node; during rolling updates, old pods automatically save and new pods automatically restore; you can even chat halfway on GitHub Issues and switch to DingTalk to continue — as long as the sessionId remains consistent, the memory is preserved.

Organization-Specific Engineering Issues

The sandboxing, recovery, memory, and persistence discussed above constitute the infrastructure for letting a Coding Agent "run reliably in production." However, there are unique problems to solve in organization-level scenarios.

Multi-Channel Access: One Agent Handles All Entrances

Stripe's Minions use Slack, Coinbase's Cloudbot also uses Slack, and Open SWE integrates Slack, Linear, and GitHub simultaneously. A consensus for organization-level Coding Agents is: don't make users switch to a new interface to find the agent; let the agent appear where users already work.

Coding Agent adds a Channel Adapter layer on top of AgentScope Harness, mapping events from different entry points uniformly to (threadId, message):

github:issue:owner/repo#42   → SHA-256 → UUID → coding agent thread
dingtalk:<appKey>:<staffId>  → SHA-256 → UUID → coding agent thread
feishu:<tenantKey>:<chatId>  → SHA-256 → UUID → coding agent thread

This deterministic mapping guarantees that all comments under the same Issue route to the exact same agent session — with conversation history automatically restored, without any manual effort from the user.

Multi-Tenant Isolation: Who Must Not Conflict with Whom

Personal tools do not need to consider this problem — with only one user, all states are naturally isolated. Organization-level services, however, are multi-tenant from day one: dozens of Issues, PRs, and IM conversations are running at the same time, each with its own repository, dependency directory, search/chat history, and long-term memory, and they must never interfere with one another.

AgentScope Harness controls the level of isolation using IsolationScope. SESSION (default) isolates a sandbox for each sessionId — meaning for Coding Agents, each Issue / PR / IM conversation runs on its own, which is the most natural and secure. USER allows multiple conversations from the same user to share the same repository clone, suitable for "personal workbench" scenarios. Isolation isn't just at the sandbox level — session state, memory, and sub-agent tasks are all isolated with the same granularity, saving developers from worrying about it.

Workspace: Personalities, Memories, and Skills Are All Files

AgentScope Harness organizes everything that needs to be preserved across calls and restarts into a single directory — the workspace. In the industry, this kind of design is now referred to as "Context Engineering." Interestingly, almost all mainstream Coding Agents have independently arrived at the same pattern: Claude Code has CLAUDE.md, GitHub Copilot has .github/copilot-instructions.md, and Open SWE has AGENTS.md — repository-level conventions should not be hard-coded in system prompts; they should be files that can be versioned, code-reviewed, and updated independently.

~/.agentscope/codingagent/workspace/
├── AGENTS.md            ← Persona + behavioral guidelines
├── MEMORY.md            ← Long-term memory
├── skills/              ← Reusable skills (SOPs like commit conventions, testing guidelines, etc.)
├── subagents/           ← Sub-agent declarations
├── knowledge/           ← Domain knowledge (API docs, coding standards)
└── plans/               ← Plan files for Plan Mode

This brings three engineering values:

Team conventions take effect as files. Want all PRs to follow commit message standards? Write a skill and place it in skills/commit-style/SKILL.md. All agent instances will apply it in their next call(), with no need for restarts or code modifications.

The agent understands the team more as it is used. The first time it asks "which testing framework do we use," and you tell it "JUnit 5 + Mockito." Next time it calls, it will remember — and all conversations sharing the same workspace will benefit.

Manage the workspace with Git.AGENTS.md + skills/ + subagents/ + knowledge/ serve as the agent's "configuration repository" — managed with Git, validated by CI, and hydrated into all replicas during deployment. Content inside the workspace should change frequently, not the Java code.

Sub-agents: Delegating Independent Tasks

Open SWE uses the task tool from Deep Agents for sub-agent dispatching, Stripe's Minions use Blueprints for orchestration, and Ramp's Inspect uses Sessions + Child Sessions. AgentScope Harness also supports sub-agents, and the usage is lightweight — just write a markdown file in the workspace:

# workspace/subagents/researcher.md
---
description: Research sub-agent. Used when you need to understand an external repository or documentation before making modifications.
workspace:
  mode: isolated
tools: [read_file, grep_files, fetch_url, web_search]
---

You are a research assistant. Use fetch_url / web_search to gather materials, and read_file / grep_files to inspect the code. Provide the main agent with a briefing containing key points and citations.

The main agent invokes agent_spawn agent_id="researcher" task="investigate key upgrades for ABC library v2". The sub-agent runs in an isolated context and returns results to the main agent. Setting a background call with timeout_seconds=0 ensures the main agent is not blocked. Once done, the framework automatically injects the results into the next round of inference.

Plan Mode: Think Clearly Before Making Big Changes

Letting a Coding Agent directly tackle high-risk tasks like "refactoring the entire authentication module" is risky — it might change things as it thinks, breaking a whole section. AgentScope Harness's Plan Mode formalizes this into a workflow of "think first → write a plan → human confirms → then execute." Once turned on, the agent enters a read-only phase, only permitted to call read tools and four plan-related safelisted tools, and requires human confirmation to exit the plan.

This is similar to Coinbase Cloudbot's "Agent Councils" concept — introducing human approval steps prior to high-risk operations, relying on process constraints rather than "praying that the model doesn't make mistakes."

Curation of Tools and Deterministic Backstops

When sharing discoveries from Minions publicly, Stripe mentioned an observation: their agent had about 500 tools, but they emphasized that "tool curation matters more than tool quantity" — more tools isn't always better; curation and maintenance are more important than piling up numbers. Open SWE also adopted this philosophy, exposing only about 15 core tools. Harness follows a similar approach, limiting the built-in toolset to file operations, shell execution, and memory retrieval, with business-specific tools registered as needed via toolkit.register(...).

Another industry consensus is: you cannot simply rely on prompts to tell the model "remember to run tests"; critical steps must be guaranteed by deterministic logic. GitHub Copilot Coding Agent relies on the repository's existing CI pipeline for validation; Open SWE has an open_pr_if_needed middleware as a safety net — if the agent forgets to open a PR, the middleware does it automatically. Harness's middleware mechanism (MessageQueueHook, ThreadBudgetHook, etc.) follows the same mindset: clearly draw the line on what is left to the model's decision and what is guaranteed by deterministic code.

Another point worth mentioning: Draft PR as an Output Contract. Whether it is Copilot Coding Agent, Open SWE, or Stripe Minions, the final output of the agent is a draft PR, which always requires human review before merging. The agent does not directly modify production code — this is a basic safety premise of organization-level Coding Agents.

From Single Machine to Enterprise: An Evolutionary Path

AgentScope Harness lets you start with the most basic form and switch on-demand — the exact same agent code logic can manifest different capabilities based on configuration upgrades.

Stage 1: Local CLI. Zero configuration. execute runs on host sh -c, and the state is stored in local files. Use it only on your trusted local machine — this is a supercharged local Coding Agent with built-in memory and skill-loading capabilities.

Stage 2: Local + Docker Sandboxing. Add a line of .filesystem(new DockerFilesystemSpec()...) to route all execution into containers. This is designed for the GitHub Webhook mode — spin up an ephemeral container for each Issue/PR to protect the host from vulnerability exposures.

Stage 3: Multi-Replica + Distributed. Switch stateStore to Redis, store sandbox snapshots in OSS, and add executionGuard for concurrency control. At this step, the Coding Agent scales horizontally — running N replicas behind a load balancer, with any replica capable of handling any conversation from any user.

.filesystem(new DockerFilesystemSpec()
    .image("agentscope/coding-sandbox:latest")
    .isolationScope(IsolationScope.USER)
    .snapshotSpec(new OssSnapshotSpec(ossClient, "bucket", "prefix/"))
    .executionGuard(RedisSandboxExecutionGuard.builder(jedis)
        .leaseTtl(Duration.ofMinutes(30)).build()))
.stateStore(RedisAgentStateStore.builder().lettuceClient(redisClient).build())

Stage 4: Observability and Rate-Limiting. Spring Boot Actuator exposes health probes and Prometheus metrics, ThreadBudgetHook and ModelCallLimitHook guard the model budgets, and FallbackModel acts against upstream rate-limiting. Combining these results in the solid architecture a Coding Agent "needs to run stably after deployment."

Summary

To recap the projects mentioned in this post — Stripe Minions, Ramp Inspect, Coinbase Cloudbot, LangChain Open SWE, GitHub Copilot Coding Agent, Claude Code, and AgentScope Harness — although they differ in programming languages, ecosystems, and deployment models, they are highly aligned on core architectural decisions: per-session isolated sandboxes, deterministic thread ID routing, middleware interception chains, agent runtime message queue injection, repository-level instruction files, and draft PRs as output contracts.

The first half of the Coding Agent evolutionary path was about personal efficiency — smarter models, better autocomplete, and smoother local tools. The battleground for the second half has shifted to engineering: how to transform "running a demo once" into "stably serving an entire team 24/7." From Stripe to GitHub, and from LangChain to AgentScope, everyone is marching towards the same architecture from different starting points. This convergence itself is the best guidepost.

The Coding Agent mentioned in the text is a complete and readable example, but it is still far from a production-ready product. It is recommended to clone and run it once before diving into the source code — it maps the engineering problems discussed in this article directly into real code to refine it.

We highly recommend checking the official AgentScope 2.0 documentation: https://github.com/agentscope-ai/agentscope-java

Community

Coding Agent Second Half: From Individual Efficiency to Organization-Level R&D System

What Is an Organization-Level Coding Agent, and Who Is Building Them?

Let's Clarify the Positioning First

What About Cloud Agents from Vendors?

Run in 5 Minutes: Get an Intuitive Feel First

The Real Challenge: From "Running Once" to "Serving a Team 24/7"

Sandboxing: Letting Agents `rm -rf` with Peace of Mind

Cross-Invocation Recovery: The Second Round of call() is the True Test

Long Session Memory: The Context Window Is Not Infinite

Session Persistence: Conversations Must Not Break If Nodes Crash

Organization-Specific Engineering Issues

Multi-Channel Access: One Agent Handles All Entrances

Multi-Tenant Isolation: Who Must Not Conflict with Whom

Workspace: Personalities, Memories, and Skills Are All Files

Sub-agents: Delegating Independent Tasks

Plan Mode: Think Clearly Before Making Big Changes

Curation of Tools and Deterministic Backstops

From Single Machine to Enterprise: An Evolutionary Path

Summary

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Alibaba Cloud Model Studio

Qwen

CloudMonitor

AI Acceleration Solution

Community

Coding Agent Second Half: From Individual Efficiency to Organization-Level R&D System

What Is an Organization-Level Coding Agent, and Who Is Building Them?

Let's Clarify the Positioning First

What About Cloud Agents from Vendors?

Run in 5 Minutes: Get an Intuitive Feel First

The Real Challenge: From "Running Once" to "Serving a Team 24/7"

Sandboxing: Letting Agents rm -rf with Peace of Mind

Cross-Invocation Recovery: The Second Round of call() is the True Test

Long Session Memory: The Context Window Is Not Infinite

Session Persistence: Conversations Must Not Break If Nodes Crash

Organization-Specific Engineering Issues

Multi-Channel Access: One Agent Handles All Entrances

Multi-Tenant Isolation: Who Must Not Conflict with Whom

Workspace: Personalities, Memories, and Skills Are All Files

Sub-agents: Delegating Independent Tasks

Plan Mode: Think Clearly Before Making Big Changes

Curation of Tools and Deterministic Backstops

From Single Machine to Enterprise: An Evolutionary Path

Summary

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Alibaba Cloud Model Studio

Qwen

CloudMonitor

AI Acceleration Solution

Sandboxing: Letting Agents `rm -rf` with Peace of Mind