Eight questions to ask before you hire someone to build an AI agent for your business

4 March 2026 · Dark Oak · 8 min read

We’ve been on both sides of this conversation. We’ve sold AI builds to SMEs in ecommerce, manufacturing and professional services. We’ve also bought work from other shops in the past and got it wrong — and spent a few painful months recovering projects we’d have rather scoped properly the first time. Here are the eight questions we wish more people asked us before signing, and which would have saved us a few of those recoveries when we were on the buying side.

We’re aware that we’re a shop a reader might be considering hiring. We’ll answer the questions honestly anyway. If anything, we’d rather you turn up to a discovery call with this list in hand and put it to us directly. The shops that genuinely ship will welcome the questions. The ones that don’t will get visibly uncomfortable around question two or three, and that itself is useful information.

1. What specific workflow is this agent going to live inside?

The single biggest predictor of whether an AI build goes well isn’t the model, the tooling, or the budget. It’s whether anyone on either side can name the workflow in one sentence. If the answer to this question is “improve customer experience” or “make the team more efficient”, you don’t have a project yet — you have an aspiration. Aspirations don’t have inputs, outputs, or a definition of done, which means they can’t be scoped at a fixed price or evaluated when they ship.

A good answer sounds like:

One named workflow that already exists, with a person or team currently doing it
A clear input (an email, an invoice PDF, a customer enquiry, a row in a system) and a clear output (a routed message, an extracted JSON object, a draft reply for a human to approve)
A rough volume figure — even a back-of-envelope “about 200 of these a week”
An honest answer to “what happens to this work today, and what’s wrong with that”

Red flags: a list of five workflows, none specified; a pitch deck that talks about platforms before workflows; or a vendor who tries to expand the scope in the first meeting rather than narrow it.

2. How will you know it’s working — what’s your evaluation harness?

Every agent is going to be wrong some of the time. The question isn’t whether it makes mistakes; it’s whether you can measure how often, and tell whether a change to the prompt or the model made things better or just felt like it did. Without an evaluation set, “it works on my laptop” turns into “it broke in production”, and nobody can tell you when it broke or why.

A good answer sounds like:

An eval set of 50 to 100 representative examples, with the expected output for each one, built before any prompts are written
A scoring approach appropriate to the task — exact-match for extraction, a rubric for classification, a small human review pass for generative outputs
The eval suite is run on every prompt change and every model swap, not just at the end of the build
Numbers are shared with you, not just kept internally

Red flags: “We’ll know it’s working when you tell us it is.” Or worse: a vendor who doesn’t understand the question.

3. What’s the plan for the cases the agent gets wrong?

If the eval set tells you the agent is right 92% of the time, the entire build hinges on what happens to the other 8%. An 8% silent error rate against a system of record is a disaster. An 8% rate that lands cleanly in a human review queue is just a normal, healthy production agent.

A good answer sounds like:

A human-in-the-loop review queue for any consequential action — a write to the ledger, an outbound message, a record update
A specific target escalation rate the vendor will commit to, and a plan for what they’ll do if it drifts
A clear answer to “who reviews the queue, on what cadence, and how do they push corrections back into the system”
Distinction between low-risk reads (where full autonomy is fine) and consequential writes (where it isn’t)

Red flags: a pitch for an “autonomous” agent with no human-in-the-loop at all on workflows that touch real money or real customers.

4. Where does our data go, and what’s your EU-data position?

For Irish and EU businesses this isn’t a nice-to-have. Between GDPR and the EU AI Act, you need to be able to say — on paper, to your auditor or your insurer — where your customer data is processed, by whom, and under what data-processing agreement. A vendor who waves their hand at this question is creating a problem you’ll inherit.

A good answer sounds like:

Explicit EU-region hosting for the routine, higher-volume parts of the workflow — typically open-weights models running on EU infrastructure
Named providers for any flagship hosted model used in the build (Anthropic, OpenAI, Mistral or similar), with the specific region disclosed
Data-processing addenda already in place with those providers, not “we’ll sort that later”
A clear position on training: your data is not used to train anyone’s model

Red flags: any version of “the API calls go to the US but it’s fine”; or a vendor who hasn’t read either the GDPR or the AI Act sections that apply to them.

5. What happens when the model provider changes their pricing or deprecates a model?

Hosted model providers change pricing, deprecate older models, and occasionally rate-limit or change terms in ways that ripple through to your bottom line. If the entire build is a thin wrapper around one provider’s API, you’ve taken on that provider’s commercial risk by proxy. The shops that have been doing this for a few years tend to build with that switching cost in mind from the start.

A good answer sounds like:

A model-agnostic architecture, with the model behind an interface the rest of the system doesn’t care about
A demonstrated ability to swap providers — or swap between a hosted flagship and a self-hosted open-weights model — without rewriting the business logic
Honest reasoning about why a particular model is the right call for a particular step, rather than one provider used for everything because it’s familiar
An eye on running cost, not just per-call cost — including what happens if traffic doubles

Red flags: a vendor who’s only ever shipped on one provider, or who treats this question as paranoid.

6. How are you going to maintain it after you ship?

This is the question that separates shops that ship from shops that present. Production AI agents degrade silently as the upstream data shifts — a supplier changes their invoice layout, a category gets renamed in the ERP, a model update moves an edge case. Without monitoring and a clear maintenance owner, an abandoned agent quietly gets worse and worse until somebody notices it’s been costing money for six weeks.

A good answer sounds like:

A flat-fee Care plan with defined response times for issues, not hourly fire-fighting after the fact
Active monitoring — error rates, escalation rates, latency, cost — with someone watching the dashboards
Quarterly re-runs of the evaluation suite, with results shared with you
A clear handover plan if you ever want to take maintenance in-house

Red flags: “Maintenance is on a time-and-materials basis.” Translation: nobody’s watching unless you call.

7. What’s the smallest version of this you’d ship in 4 weeks?

A vendor’s ability to scope down is the strongest signal we know of for whether they actually ship things. Anyone can describe a six-month programme of work. Far fewer can describe the four-week version that proves the workflow and earns the right to do more. If your vendor can’t compress the pitch, the build will sprawl.

A good answer sounds like:

A single named workflow, one document type, one user role, one integration point
A working environment you can use yourself within a month — not a slide deck, not a demo video
An honest list of what’s deliberately out of scope for that first slice, with a plan for what gets added in slice two
A fixed price for the first slice, so the commercial conversation doesn’t get in the way of the scoping conversation

Red flags: an inability to say no to anything during the scoping conversation, or a four-week pitch that’s secretly a six-month pitch with the difficult bits postponed.

8. Who specifically is going to do the work, and have they done it before?

In a lot of agency proposals, the people in the meeting and the people on the keyboard are different people. “We have a team of fifty” frequently translates to “your project will be done by someone who started this morning, supervised at a distance by someone who’s busy on three other accounts”. For an SME-sized build, where the margins for misunderstanding are tight, you want to meet the hands-on builders before you sign anything.

A good answer sounds like:

The specific people who will do the work, introduced before contract — not the partner or account director who pitched
A portfolio of comparable agents that those specific people shipped to production, with references where appropriate
An honest answer to “have you done this exact thing before, or is some of it new for you” — both are fine, but you need to know
A small team — three to five people on a typical SME build — with named roles, not an opaque resource pool

Red flags: a refusal to name the people, or a sudden swap-out between the pitch team and the delivery team after the contract is signed.

How we’d answer these ourselves

For the avoidance of doubt: every Dark Oak build is delivered by a named, hands-on team you meet before you sign. Every engagement is fixed-price, scoped to a single named workflow, with the smallest viable version as the first deliverable — typically four weeks from contract to a working environment you can use. Every shipped agent goes onto a Care plan from day one. The evaluation harness is built in week one, before any prompts are written. EU data residency is the default, not an option. We use a mix of EU-hosted open-weights models and named flagship providers, behind a model-agnostic interface, so we can swap when the economics shift.

We don’t claim to get every one of these right every time. We do claim that we’d rather have the conversation about each of them at the start of an engagement, openly, than have it as a series of awkward emails six months in.

If you’d like to put these questions to us — or to anyone else you’re talking to — we’d rather you did. Bring the list to a discovery call with us at /#contact and ask us each one in turn. We’ll give you the honest answers, including the ones where we’d point you somewhere else.