ERP & Systems9 min read

The Go-Live Failure: What We Learned from a Three-Month Delay

The go-live was scheduled for the first of the month. We went live on the ninety-third day. The delay cost the client ~$180K in extended implementation fees, parallel system costs, and the time of four internal staff who spent three months doing double data entry. The original cause was not a technical failure. It was a user acceptance testing process that was designed to find zero issues.

ERP go-live failures are almost never caused by the software. The ERP works. The modules are configured. The integrations run. The failure comes from a UAT process that validates happy-path scenarios written by the implementation team, a data migration that was never properly stress-tested on full data, and a go/no-go decision made by people who did not understand the risk they were signing off on.

We have seen this pattern three times. The first time, we did not catch it in time. The second time, we caught it six weeks in and delayed go-live ourselves rather than let it reach cutover. The third time, we redesigned the UAT process before build started. This is what we learned.

UAT that was designed to pass

Most ERP UAT processes test the scenarios the implementation team designed the system around. Business users click through flows that work correctly on clean data and controlled conditions. The scripts cover standard purchase orders, clean invoices, straightforward delivery receipts. They were written by the team that built the system, not by the people who will use it under operational pressure.

Nobody tests what happens when a purchase order has a delivery split across three partial shipments, one of which arrived before the PO was approved due to a verbal authorization that was formalized later. That scenario does not exist in the UAT script. It exists every week in production at a busy warehouse.

We now require that at least 30% of UAT test cases come directly from the operational team's list of "the weird things that happen" — not from the implementation team's scope document. A logistics warehouse manager who has been processing deliveries for seven years can name fifteen edge cases in an afternoon. None of them were in the original UAT script.

93 dayslater than scheduled — caused by UAT designed to pass, a data migration never stress-tested on full data, and 14 open issues approved for go-live

The data migration that was never stress-tested

The client had 11 years of data. The migration was tested on a 5% sample. This is the single most common data migration mistake we see — sample-based testing on migrations that will run on full datasets at cutover.

The full migration, run for the first time on go-live weekend, took 31 hours instead of the estimated 6. It also produced 4,200 records with referential integrity errors that had never appeared in the sample. The cutover window was 48 hours. By hour 36 we were triaging which errors could be fixed post-go-live and which would block operations. The AP workflow was in the second category.

The fix is not complicated: run the full migration, on full data, at least twice before go-live. The first run is a diagnostic — you find the edge cases, the timing issues, the records that break the migration scripts. The second run is a rehearsal — you know the problems and you confirm your fixes worked.

Watch out

Never estimate migration time from a sample run. The relationship between sample size and migration time is not linear — large datasets surface contention, lock timeouts, and resource constraints that never appear in a 5% test. The only reliable timing estimate comes from a full migration run.

The parallel run that did not happen

The plan included a two-week parallel run — operating both the old and new system simultaneously to catch discrepancies between what each system calculated. It was cut to five days during scope negotiations because the client's operations team said they could not sustain double data entry for two weeks. We agreed. This was a mistake.

The parallel run caught three critical calculation differences in the last four days. If it had run for two weeks, we would have found them before go-live. Instead, we found them two weeks after go-live when a supplier payment came out wrong. The payment was for $47,000. The discrepancy was a rounding logic difference in the tax calculation that only appeared on invoices above $40,000.

The supplier got a corrected payment. The relationship survived. The lesson did not cost us a critical client. It cost us credibility and a painful three-week post-launch period of remediation. We have not negotiated a parallel run shorter than two weeks since.

The go/no-go meeting that said go

The go/no-go meeting had 14 open issues on the tracker. Eight were rated Low, four were Medium, two were High. The project manager presented them as "manageable post-go-live items." The client stakeholders nodded. We went live.

The two High issues were in the AP workflow. They had been rated High for a reason: they affected the invoice matching logic for a category of vendor invoices that represented about 40% of monthly payment volume. Everyone in the room knew they were High. Nobody wanted to delay the go-live that had already been planned, communicated, and prepped for. The pressure to go was social, not technical.

The AP workflow broke on day three when the first high-volume payment run hit the unresolved issues. The rule we use now: no High issues at go/no-go, no exceptions, no waivers. If a stakeholder wants to waive a High issue, they write it down with their name on it and a statement of what they are accepting. Nobody has signed that paper yet.

Our take

Rate issues by business impact, not by fix complexity. A High issue does not mean "hard to fix." It means "if this is wrong at go-live, operations stop." Keep that definition consistent across the project so the go/no-go rating is not negotiated based on schedule pressure.

What a three-month delay actually costs

Extended implementation fees are the visible cost and the one that appears on invoices. The hidden costs are larger and harder to attribute.

Staff doing double data entry: four people averaging 2 extra hours per day for 11 weeks is roughly 880 person-hours. At a loaded cost of ~$40/hour, that is $35,200 in internal time that appears nowhere on a project budget. Supplier relationships under strain from delayed payments during the broken AP period — one supplier put the account on hold for two weeks. Finance unable to close the month properly, requiring manual journal entries to reconcile the parallel period: another 120 person-hours.

A delayed go-live also delays the ROI timeline. Every month of delay is a month the efficiency gains the project was meant to deliver do not arrive. The client had projected $280K in annual efficiency savings. Three months of delay cost them $70K in unrealized benefits on top of everything else. The ~$180K figure we quote in the headline understates the real cost by a significant margin.

A UAT process that has never found a show-stopping issue in the first week of testing has never been properly designed. If every UAT script passes cleanly on first run, the scripts were written by people who wanted them to pass — not by people who were trying to break the system.