Managing Agents or Seeing Vibe Fairies?

A zero-code experiment in distinguishing AI hallucinations from hard truths

Sep 15, 2025

I take pride in zero-hype, zero-BS posts. Yet this article is about trying to build production-level B2B software, with AI, solo and without coding a single line.

Everything about this is uncomfortable, and in many ways that is the point.

Transformation is Uncertainty and Discomfort

I've spent twenty years in transformation, the last two running accelerators coaching teams on the difference between incremental change and fundamental reframes. Here's the pattern I keep seeing: genuine transformation is uncomfortable, uncertain, and usually preceded by the phrase "this might not work."

Convergent, or iterative change, is where organisations spend most of their time. It is comfortable. You change things without questioning anything. Move reports from spreadsheets to dashboards. Automate processes to do the same work faster. It's organisational busy work dressed up as progress.

But here’s the issue we don’t discuss. Convergent change is exhausting precisely because it is comfortable. You do fundamentally the same work, just more of it. There is little sense of progress and you’re working toward a finish line that keeps moving as efficiency gains create new expectations.

Most LLM uses cases are textbook examples of this:

Customer Support AI: Makes problem-solving cheaper but doesn't question why problems exist in the first place.
Content Generation: Produces more content faster without reimagining what marketing relationships could be. We drown people in personalised noise.
Code Copilots: Accelerate coding for developers without really changing how software is conceived, designed, built.

The pattern is clear, we're using transformative technology for convergent ends. As LLMs have made so many problems solvable, we keep working on the same problems.

Falling for the Convergent Trap

I started Datent when I became a Dad and needed a break from six years of running transformation. The goal was to shake things up. To have an accelerator that taught teams to run transformation themselves and break the dependency many organisations have on external consultancies for change.

It’s gone well. We’ve had outstanding feedback since day 1, got repeat clients and worked with the cohorts to define and iterate common transformation patterns that are now enabling us to automate lots of strategy work and make transformation easier.

But unless something changes we’ve fallen for the trap we coach others against. Strategy automation has transformative potential but also a huge risk of just doing the same work faster. Now that my second child has started nursery, I have the energy and appetite for genuine transformation again (and the headspace to be back on Substack).

So I set myself a test that would force discontinuous change: Could I build production B2B software without writing a single line of code? Not as a technical exercise, but to both test the limits of working with LLMs and challenge one of my deepest limiting beliefs - that I’m not technical enough to do this.

And really I shouldn’t be. My coding experience is limited to VBA macros, a single python course and basic (embarrassingly basic for a former CDO) SQL.

The Test: Constraints Forcing New Approaches

The parameters were deliberately absurd:

0% human code (if I need to rework 1 line, the experiment fails)
Production-ready, not a prototype
B2B compliant (auth, persistent secure data, path to SOC2/ISO 27001)
Built solo, no technical team

This isn't about proving AI can help developers to code faster. We know it can and this is building things I don’t know how to build.

This isn't about building blind either. Like many people I’m convinced LLMs have a lot more potential than is currently being realised. There is a clear product vision.

And the transformation experiment isn’t limited to reducing software development costs. I’m convinced (and experiments are starting to evidence) that if this is possible it also opens up opportunities for new kinds of software.

Experiment 0: The Non-Experiment Experiments

First off I need to mention that prior to starting the experiment I did investigate low-code tools and auto build tooling. This could be an article in itself, so briefly the conclusions were:

Low code tooling (Softr, Bubble, Retool etc.) fails from day 0 because as config heavy software you can’t realise the benefits of AI ways of working and it is fundamentally limited i.e. you hit capability constraints fast.

Auto build tooling (Replie, Base 44 etc.) is an absolute marvel and I highly recommend a trial if you haven’t used them. But whilst they’re great for prototypes, software needs ongoing development and it is very hard to do that when the original build process is black box.

Lastly, to start with I ruled out building one function or one module at a time as it both felt too slow and too close to traditional software development to be a valid transformation test.

Experiment 1: The Prototype

Wanting to dip my toe in first the first test was to see how quickly it would be possible to build a secure, functioning, multi-LLM chat. Nothing revolutionary but enough complexity to stress test the concept of building with AI.

After three days of planning, making tech decisions, researching building with AI I narrowed down to working with Claude Code and had a set of tickets for the agent to build out a code base ready to deploy on Vercel and Railway.

1.5 hours later I had a multi-LLM chat interface, behind an auth layer and complete with bonus conversation search functionality. This wasn’t in scope but I’d designed tickets that gave the agent context of why we were building this and that context included a list of frustrations users have with current LLM chat apps (like no search menu) and the agent decided to deliver this whether it was in scope or not.

A review of the architecture documents by other LLMs guessed that with the complexities like ‘atomic Lua scripts’ for cost control this was 2-3 months work.

Just one problem, whilst the GUI was lovely half the features didn't work. What took 1.5 hours to "build" took 15 hours to debug due to a result of missing components or APIs that weren’t actually wired up.

Conclusion: LLMs can build high spec prototypes. But speed without structure creates technical debt not value.

This wasn’t a failure. It increased my confidence that building with AI is possible, it just needed more planning and control.

Experiment 2: The Infrastructure Lesson

Full of confidence from the first experiment I decided to jump from building prototypes straight to building production level apps on owned cloud infrastructure.

This was a huge jump, but, perhaps LLMs could develop the code needed to setup everything through infrastructure as code approaches?

Lots more research went into this phase, and as any good product manager would I based my tech decisions this time off of the teams preferences. Whilst experiment 1 was done in python research showed LLMs have far lower error rates in TypeScript.

Within 2 hours, infrastructure-as-code had set up a complete GCP project. I was blown away. Two days later I’d yet to deploy the most basic functionality as every attempt to get one part of the app to speak to another resulted in another round of debugging permission errors.

Timelines running out and I pulled the plug on GCP around one Thursday lunchtime.

Conclusion: Complex cloud infrastructure is still out of scope of a solo developer working with AI.

Experiment 3: Emerging Way of Working

AI takes the emotion out of pivoting. There are no emotions on ditching work you put time and effort into. By Thursday afternoon we’d landed on and setup on a new platform. Render both offered simpler infrastructure and SOC2 and ISO27001 compliance.

By early Friday afternoon, the whole app was up and working. This time it took three hours development and 1.5 hours of debug. In total 4.5 hours compared 16.5 in experiment 1.

We finished early enough that we looked at the backlog from sprint 2 and ran a test to see if we could get google workspace integration working (loading files from Google Drive, saving to Google Drive etc.). It took 30 minutes!

There were a lot of differences in approach between experiment 1 and 3 and I’ll go into details in future articles but mostly it comes down to 2 things:

Agent Specialisation

In experiment 1 I worked within 1 LLM window to design the whole app and tickets to build it. In experiment 3 I was moving between 4 windows (architect, security, front end and back end). This led to more focus on these specific contexts and a very bizarre incident.

In the planning phase when we were trying to aligning on tech stack the back end window told me “these are typical requests from a front end team. We’ve not had to work with them before so I propose we continue with our existing tech decisions.”

They'd never worked together and no code had yet been written. They were hallucinating about team dynamics before a line of code had been written.

Process for the human PM role

Experiment 1 was about a proof of concept that I had no plans to maintain. I let the LLMs guide me entirely on how to approach it. Experiment 3 I saw myself as the product manager and ensured I understood and signed off on each module, contract, decision etc. It was an exhausting process (62 documented decisions in the first 12 planning hours) but improved controls and was the start of data strategy that is solving the amnesia challenge of working with agents.

Conclusion: Context windows don't just hold information; they develop perspectives. Set competing contexts and manage the alignment process. Then once you’re clear on objectives move to atomic tasks and tunnel vision.

The Reality Check

I’m claiming it is possible to achieve months worth of software development work in hours with little coding experience - so where’s the catch?

First, this is not easy. Managing AI agents as they plan, code and debug intensive. The closest analogy I can give is to manning a conference stand for 8 hours straight - fielding complex questions every 30 seconds while maintaining context across multiple conversations.

And you do need to stay involved. I’m continuing to refine the rules for how Claude code works but I’m still jumping in regularly when I see a note like ‘That API is still failing but it isn’t part of the current scope so I will apply a quick fix to get this deployed’. Nice but undocumented quick fixes quickly build swamps of tech debt.

Second, after experiment three I realised a need to change some of the backend architecture for new functionality. Refactoring with AI and our current processes led to another painful multi-day debug session.

In short this isn’t easy and software development is about more than building v1s. But so far I’ve not come across anything which says that these aren’t also solvable problems AND both owning the entire build process and moving at this pace is opening up lots of interesting new opportunities.

What’s Next?

The experiment is not over. I’m currently 55% confident it will lead to a valuable, viable, secure B2B SaaS offer and 75% confident it will result in a platform that enables us to offer transformational services.

This is why transformation is uncomfortable. I’m investing heavily in something with a 45% confidence level I will have nothing to sell at the end.

I honestly don’t know if I am actually moving to managing agents (a phrase that just 6 months ago I would have been very cynical of) or whether I’m attempting the impossible and seeing fairies here.

But when I look back on the most transformational work I’ve done before, I wasn’t certain of outcomes and success at the start either.

But there are two commitments I'm making:

Building in public: This is the first of weekly posts sharing what actually happens when you lean in to working with AI. No hype, just the reality of attempting something that I would previously have said shouldn't be possible.

Final Accelerator: October is our last cohort because we're betting on this experiment. As a minimum we’ve already had enough product success to run the last accelerator with lots of AI support from HITL research packs that we deliver to AI strategy write ups based on the workshops. More on this soon.

And now the experiment continues. Each week I'll share what breaks, what emerges, and what shouldn't be possible but is.

Because if transformation doesn't make you question your sanity occasionally, you're probably still in your comfort zone.

Datent