A year of putting AI agents into Irish SME workflows — what worked, what didn't

14 January 2025 · Dark Oak · 8 min read

Looking back at the agents we shipped in 2024, the most useful thing we can do is be honest about which ones we’d build the same way again — and which we’d refuse altogether. We’ve spent the year putting agents into invoice processing, enquiry routing, internal Q&A, and a handful of other workflows for SMEs in ecommerce, manufacturing, and professional services. Some of those builds are still humming along nicely. Some had to be quietly rebuilt. A couple we wouldn’t take on again at all.

This isn’t a victory lap. It’s the kind of post we wish we’d been able to read in January 2024, before we made some of the calls we made. If you’re an SME owner or operations lead thinking about putting an agent into a real workflow this year, hopefully a bit of this saves you a quarter.

What worked

The wins were less glamorous than we expected. The agents that paid for themselves quickly weren’t the impressive ones we could demo at a meet-up. They were the small, narrow, slightly boring ones that took a job a human did badly and made it boring in a good way.

A few patterns stood out:

Document-extraction agents. Invoices, dockets, purchase orders, delivery notes. Narrow scope, clear inputs, clear outputs, and a finance or ops person who already knew exactly what the data should look like. These were our fastest path to ROI, the easiest to scope at a fixed price, and the most stable in production. If a client came to us with “we want AI” and nothing more, this was usually where we landed after the first conversation.
Inbound-enquiry classification and routing. A small build by our standards — usually a couple of weeks — but the impact was disproportionate. Enquiries getting to the right person on the same day instead of three days later. Sales leads no longer sitting in a shared inbox over a bank holiday. Low risk, because the worst case is a misrouted email rather than a wrong invoice in the ledger.
Knowledge-base agents over a defined corpus. Internal Q&A over a company’s own docs, SOPs, supplier terms, or HR handbook. The trick here was being ruthless about the corpus. Point it at one well-maintained SharePoint folder and it works. Point it at “everything we have” and it stops being trustworthy within a month.
The agent proposes, a person decides. Every build that had a human review queue in front of any consequential write went well. Every build where we initially skipped that step had to have one retrofitted. We now treat it as the default, not the cautious option.
Care plans from day one. We started putting every shipped agent onto a Care plan from the day we deployed it. That’s how we caught the small drifts — a supplier changing their invoice layout, a model update shifting one classification edge, an upstream system slightly changing a date format. Without it, we wouldn’t have known until a client noticed, which is too late.

The common thread across all of these is that none of them are trying to be impressive. They take a piece of work that was already happening — slowly, expensively, or inconsistently — and make it faster, cheaper, or more consistent. The agent isn’t the point. The workflow is the point. The agent is just the bit that finally made it tractable.

What didn’t

The misses are more interesting and more useful to talk about, so we’ll spend a bit more time here.

The “do everything” agent. Every time we tried to build a single agent that handled multiple workflows — say, an ops assistant that did enquiry routing, plus invoice triage, plus internal Q&A — it degraded faster than the equivalent four small agents would have. The reasoning got fuzzier. The prompts got harder to maintain. Edge cases in one workflow started bleeding into another. We’re now firmly in the camp of small, well-scoped agents that do one job each. If a client asks for a single front-door agent, we’ll often put a thin router in front of several narrow agents rather than one big one.
Agents on top of unstable upstream systems. We had a couple of builds where the underlying data was a mess — duplicate customer records, inconsistent product codes, half-migrated legacy fields — and we agreed to put an agent on top of it anyway because the client was keen to get moving. In every case, the agent ended up looking broken even when the model and the prompts were doing exactly what they were asked. We’ve learned to push back here. If the system underneath isn’t stable, we’d rather do a smaller piece of remediation first and put the agent in once there’s something solid to stand on.
Voice agents for customer-facing use. We tried two voice agent builds for outbound customer contact in 2024 and we’d defer both again if we were starting now. The technology is improving fast, but the gap between “demo-good” and “production-good” for a customer-facing voice agent in an Irish accent context is still wider than we’d be comfortable shipping. We’d revisit this in 2025 — but for customer-facing voice work in 2024, the honest answer was usually “not yet”.
Anything autonomous writing to a system of record. We had one build where, against our better judgement, an agent was allowed to update records in a CRM directly without a proposed-action step. It worked fine for several weeks, then made a confident, plausible-looking mistake on roughly forty contacts in an afternoon. Recoverable, but exactly the kind of thing that erodes trust. We will not take on builds like that again. Every consequential write goes through a human queue, even if it’s a thin one.

There’s a pattern in those misses. In each case, we knew at the start of the build that we were taking a shortcut on something — scoping, data quality, autonomy. The model didn’t fail us. Our own discipline did.

Where we changed our minds

A few things we genuinely believed at the start of 2024 that we no longer believe.

Evaluation harnesses are not a nice-to-have

We used to treat evaluation as something we’d add in once we had a working agent. By mid-year we’d flipped that completely. We now insist on a basic evaluation harness in week one of a build — a small set of representative inputs with expected outputs, run automatically, so we can tell whether a prompt change or a model swap actually made things better or just felt like it did. It sounds obvious written down. It wasn’t obvious to us in January.

“The model will improve” is not a scoping strategy

We saw a lot of loose scoping in early 2024, including from us, on the assumption that capability would catch up before the client noticed. It didn’t, not in the ways we needed it to. Some things got dramatically better over the year; other things barely moved. Scoping has to be tight from day one, on the assumption that the model you’re shipping on is the model you’ll have for the next twelve months. Anything you get for free on top of that is a bonus.

The best clients aren’t necessarily the ones with the strongest IT teams

This one surprised us. We assumed clients with a strong in-house IT function would be the easiest to deliver into. Sometimes that was true. But our most successful builds in 2024 were the ones with one or two willing process owners on the client side — usually an ops manager or a finance lead — regardless of how deep the IT team was. A willing process owner who knows the workflow inside out is worth more to an agent build than a full IT department who doesn’t. We weight that heavily now when we’re sizing up new work.

What we’re doing differently in 2025

A short list of changes we’ve already made in how we engage.

Care plans are the default, not an upsell. Every agent we ship now goes onto a Care plan from day one. If a client genuinely doesn’t want one, we’ll have a conversation about whether the build should go ahead at all.
No builds without a clear human-in-the-loop pattern. We’ll happily build agents that propose actions, draft replies, classify, extract, summarise, or route. We won’t build ones that take consequential actions on a system of record without a person in the loop. We’re comfortable losing the occasional piece of work over this.
Pushing back harder on undefined briefs. “We want AI but don’t know for what yet” is a perfectly reasonable starting point, but it’s not a build. Our answer to it is now a €1.5k Readiness Sprint to figure out the right shape of the work, or a polite no. We’ve stopped trying to convert vague enthusiasm into a project plan in the first meeting.
More open-source models hosted in EU regions for routine traffic. For the higher-volume, lower-risk parts of a workflow — basic classification, extraction, routing — we’re hosting more open-weights models in EU regions ourselves. It keeps the cost predictable, keeps the data inside the EU residency commitments our clients care about, and reduces our exposure to pricing or availability changes on any single vendor. The flagship hosted models still earn their keep for the harder reasoning bits; they’re just no longer the default for everything.

That’s the year. We shipped more agents than we expected to, refused more than we expected to, and changed our minds on a handful of things we’d have argued about a year ago. The headline lesson from twelve months of this is unromantic: small, scoped, human-supervised agents on stable upstream systems, monitored after they ship. Most of what went well in 2024 came from doing that. Most of what didn’t came from straying from it.

If you’re an Irish SME thinking about where an agent might actually earn its keep in your business this year — or about whether to put one on a workflow you’ve already been quietly frustrated with — we’d be glad to have a conversation. We’re happy to tell you when it’s not the right call, too; we’ve had plenty of practice. Get in touch at /#contact and we’ll set up a chat.