It’s Tuesday… Jenkins is down
I woke up Tuesday morning to an email from AWS reporting a malicious activity on one of our instances. The report found an activity resembling “scanning remote hosts on the Internet…”. This confirmed our suspicion that something might be wrong with the CI instance. The instance contained our Jenkins (V2.32) server and some of our internal tools.
During the weekend (the Monday was a public holiday) Jenkins had been misbehaving. On Sunday morning, I had been on Jenkins but by afternoon, one of my colleagues complained to me that Jenkins was inaccessible. So, I opened my browser to check it and it wasn’t reachable. After confirming that other Internet-facing applications running on the same instance were up, I ssh into the instance and started Jenkins, or so I thought I did. On the browser it showed “Jenkins is starting…” and I was like ok Jenkins is running. Little did I know that Jenkins would be starting for rest of the weekend?
Back to Tuesday morning. Rolling off my bed, I refreshed my email to see the mail from AWS and that the CI instance had been stopped. Let me take a moment to explain what that means. Without the CI instance, all code committed would remain on the repo without being built and shipped to our clients (continuous delivery and deployment have stopped), our databases won’t be backed up, automated CI checks, tests, builds and everything else that happens when a developer pushes code also won’t run. Also, our internal tool that is used to simplify setting up new tenants for our products and other things (more on this in another post) was down.
Again, back to the Tuesday morning. The email now a thread had a reply from another colleague with details on his investigation. Having ssh into the instance was able to see all connections to the instance, the program using suspicious connections, with the help of IP lookup the location of the initiated connection (and as you can guess its…) and where on the file system the program was executing. This all happened before the day even broke. The emails were exchanged around 5–6am and the instance were shut down. As the day progressed, we were able to find what the vulnerability was on Jenkins (SECURITY-429 / CVE-2017–1000353) which upgrading to the latest version would have solved it.
What actually happened? A Russian “IP” exploited an Apache strut vulnerability on our Jenkins deployment to upload a malicious program. The program scans through all ports on the network to try to replicate itself on other machines. It uses all available resources on that machine to run kworker34 (a malware for mining bitcoin). And that was why Jenkins couldn’t get enough resources to start. Why it stopped in the first place, I don’t know.
And again back to Tuesday, now the race was on to set up Jenkins with all the previous jobs. The first thing we did was to set up a new Jenkins but this time it was on GKE (I’ve been wanting to take Jenkins to google cloud because the EC2 ephemeral slave isn’t cost effective and using Kubernetes to spin up JNLP slaves seems like a better idea). Then setup users access control on Jenkins and start configuring the jobs. This was what we did for days with one issue after another like; JNLP docker slaves on the k8s weren’t building our Docker images (docker-in-docker). It also didn’t have Flyway, AWS CLI, etc. so, unfortunately, as a temporary fixed I had to set up an EC2 slave ?. Then built a Flyway docker image FROM jenkinsci/jnlp-slave.
Again, I woke up to an email, but this time it was a week later. The email was an appreciation for removing the shackles on developers placed by the lack of an automation server and other internal tools caused by this incident. Having had the issues properly configuring the new GCP Jenkins and our jobs, to switch back to an EC2 instance to deploy our ci tools. Fortunately, we were able to back up the Jenkins home directory. The new plan was then to containerize Jenkins (of course mounting the Jenkins home backup) and setup SSL (which was missing on the old Jenkins). A post on this coming up later, IsA.
* GCP is hard to switch to from AWS.
* It’s very important to update Jenkins (or any software) as soon as a stable version is released especially if there’s a security fix.
* Take security seriously, it could have been a lot worst. If the malware had infected all our instances, all our products would have been down.
* Also, ensure only those that NEED access, get it. Don’t just share ssh keys, allow network access from all IPs, put multiple instances on the same subnet, etc.
* I’ll still try to set up Jenkins on GKE although my free credit is exhausted. (ps. Google, I could do with some hand holding guide and free credit ¯\_(ツ)_/¯ )
* Add SSL to all Internet-facing applications. (Let’s encrypt gives free SSL)
* Always setup backup to s3 for the home/data dir of all apps and services. (Another future post maybe)
* Ensure all deployments are reproducible at any time. (Ansible or terraform maybe)
Doing some research while writing this, I found out that this exploit was widespread at the same time it happened on our server. It affects many of servers running old versions of Jenkins.
This was first published on medium
Take advantage of your Red Hat Developers membership and download RHEL today at no cost.