Recovering a Ghost blog from a migration failure on Fly.io

Thursday, 4th of January 2024

First, let me blow some steam

I maintain some Ghost blogs, and two of them are hosted on Fly. In the beginning I liked both of them, but now I kinda… hate Ghost, and I think I’ll be soon moving all my stuff off Fly.

Fly is too unstable for what it wants to be. Things change fast, and a lot.

Response times are all over the place, ranging from 15ms one day to 2-3 seconds the next. With properly scaled machines which host Go services, that have had two concurrent users at most. It’s definitely something wonky in their networking, and it happens randomly.

Using the CLI feels like interacting with something that was built against any do what I mean philosophy.

You get machines stuck in weird states, where they can’t be started, stopped, and sometimes even destroyed.

If you want support as a small user you have to go to the forum and create a new thread. And post information that you shouldn’t (and would rather not) post for the whole world to see.

Ghost is… open-source (MIT), and you can freely host it yourself.

But, if you do, you will have to squeeze your butt cheeks and pray to the bad code gods that the upgrade you’re trying to do won’t fail, and you won’t have to spent another hour troubleshooting it. It’s not like this in the beginning, but you’ll get there.

Ghost will also run out of memory for no reason, sometimes when it has no traffic but it runs its internal jobs. And if you tell me I need to give it 2GB of memory to run it properly… I have things to say to you, but they’re not kind.

Yes, I feel better now.

The problem

  1. User deploys ghost:5-alpine on Fly.
  2. User waits for a few months and re-deploys it.
  3. Migration fails and the machine enters a hopeless restart loop.
[2024-01-03 23:58:49] ERROR Migration lock was never released or currently a migration is running.

 Migration lock was never released or currently a migration is running.
 "If you are sure no migration is running, check your data and if your database is in a broken state, you could run `yarn knex-migrator rollback`."
 Error ID:
     500
 ----------------------------------------
 MigrationsAreLockedError: Migration lock was never released or currently a migration is running.
     at /var/lib/ghost/versions/5.75.2/node_modules/knex-migrator/lib/locking.js:62:23
     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
     at async DatabaseStateManager.getState (/var/lib/ghost/versions/5.75.2/core/server/data/db/DatabaseStateManager.js:40:13)
     at async DatabaseStateManager.makeReady (/var/lib/ghost/versions/5.75.2/core/server/data/db/DatabaseStateManager.js:73:25)
     at async initDatabase (/var/lib/ghost/versions/5.75.2/core/boot.js:69:5)
     at async bootGhost (/var/lib/ghost/versions/5.75.2/core/boot.js:503:9)

The solution

The fix is indeed to run yarn knex-migrator rollback, but you won’t be able to SSH into the container because your machine is in a restart loop.

What you need to do is to leverage the experimental exec option to override the entrypoint with the rollback command.

Make sure you have an [experimental] section in your fly.yoml and configure exec like this:

[experimental]
  exec = ["sh", "-c", "su node -c 'cd /var/lib/ghost/current && yarn knex-migrator rollback'"]

Then issue another fly deploy, squeeze your butt cheeks, and hopefully the rollback will complete successfully. After, remove / comment the exec option, and perform the butt cheek & fly deploy ritual again.

So yeah, you try to upgrade, the migration fails, but if you roll it back and apply it again… it works. That’s bad software.

If it worked, I’m happy for you, and if it didn’t… good luck.