Friday 21 February 2014

A Continuous Integration Train Smash

If you are doing it right then very soon your Jenkins server will become essential to the functioning of every element of your business.

Jenkins will handle the testing, measurement, packaging and deployment of your code.

We have one Jenkins installation, this has grown in capability as it has in importance.

Like most organisations we grew our CI infrastructure organically. Developers from Android, iOS, Java core, front end and sysadmin teams all added jobs and plugins to support them. Some jobs are built on the server and some on slaves.

No record of who installed which plugins, when or why was kept.

We were aware that we needed to backup this crucial element of infrastructure, though never did a recovery dry run.

We decided to add the configuration of the server to git, initially manually and then using the SCM Sync configuration plugin, however we did not test restoring from this.

After a while we noticed errors in the Jenkins logs, and on the screen, about git. The errors in the logs came from some bad git config files, manually fixed. The problems with the SCM Sync configuration plugin were worse and were tracked down to Renaming job doesn't work with Git. The work around given does work, but the plugin breaks in a very nasty and difficult to fix way which requires the server to be restarted. We decided to remove the plugin, even after fixing the current symptoms.

All was good to go, we had a working server, no errors in the logs, all clean and uptodate.

Snatching defeat from the jaws of victory

Prior to the restart I had updated all plugins to their latest versions. This is something I have done many times over the last five years and it has never failed. As the first question one is asked in forums is "Have you updated to the latest version?" it is a step I have, until now taken for granted.

After running a few builds, the following day, Jenkins froze.

The last build to be run was a new, complex, important one, involving Virtual Machines, written by my boss.

I restarted the server, taking the opportunity to update three more plugins as I went.

Again it limped on for a while then the UI froze.

We disabled all plugins (a very competent colleague had joined me) by creating a .disabled file in the plugins directory:

for x in *.jpi; do touch $x.disabled; done

Then we set about re-enabling them, one letter at a time, repeat for a-z :


rm -v a*.jpi.disabled
sudo /etc/init.d/jenkins restart

This revealed that the problem was in a plugin starting with t, one of:

tasks.jpi
thinBackup.jpi
throttle-concurrents.jpi
translation.jpi
token-macro.jpi

Whilst it looked like it might be token-macro.jpi it then appeared not to be, the restarts were taking an increasing length of time.

At this point we decided that it would be better to revert to a backup.

The sysadmin team initiated the restore from backup, then discovered that there was still a spinning, 100% CPU, process and that it was from the throttle-concurrent plugin.

A quick google lead to JENKINS-21044, a known Blocker issue. On the wiki this is flagged:

Warning!
The version has a "blocker" issue caused by threads concurrency. See JENKINS-21044 for more info.

It was however too late to stop the restore. The backup failed at 8.00pm

By 7.00pm the following evening, Friday, after a day of configuration by most of the developers, we were back to where we had been on Wednesday night.

The long tail of the event continues through Monday.

No comments:

Post a Comment