QEMU Live Patching

Sysadmins know what the risks are of running unpatched services. Given the choice, and unlimited resources, most hardworking administrators will ensure that all systems and services are patched consistently.

But things are rarely that simple. Technical resources are limited, and patching can often be more complicated than it appears at first glance. Worse, some services are so hidden in the background, that they just don't make it onto the list of things to be patched.

QEMU is one of those services that tend to create difficulties with patching. It works away in the background and is easy to take for granted. Plus, patching QEMU involves significant technical and practical challenges – while requiring enormous resources.

In this article, we'll address some of the difficulties around patching QEMU, and point to a solution that takes the toughest bits out of QEMU patching.

Ignoring QEMU patching is a big risk

You'll probably know about it if you're using QEMU – short, of course, for Quick EMUlator – because QEMU will be delivering critical virtualization capabilities that support your workloads. That said, what you might not realize is that just like the host OS, the virtualized OS, and all your applications, QEMU also needs to be updated on a regular basis – even though it works in the background.

It's not just a scare story. QEMU has been proven to be just as vulnerable as any other service, library, or component. For example, in 2015, the virtual floppy controller in QEMU was found to be vulnerable: it was called the Venom bug, and affected systems whether the QEMU virtual floppy was in use, or not.

Likewise, in 2019, organizations that use the KVM/QEMU hypervisor to run Linux instances were at the receiving end of a security flaw that put countless systems at risk. And, just like any other commonly used software, it's likely that more flaws will be discovered in QEMU.

In other words, if you don't patch, your systems will be at risk. But there's a problem: when it comes to QEMU, patching isn't straightforward because patching QEMU affects the underlying virtualized workload: while you stop to restart QEMU, the virtual workload must stop too.

Your options for patching QEMU

Patching a single service on a single system usually isn't an issue – assuming you remember to do it – and even patching a single OS isn't that hard as you can usually cope with a single restart, but it is disruptive nonetheless as every application restarts. Patching a fleet of operating systems is a lot tougher, because it could imply thousands of restarts and disruption to countless numbers of apps.

Because QEMU is a virtualization service patching has far bigger implications than simply patching another application. Patch QEMU and you have to restart the underlying operating systems that run on it.

In others, applying a patch to a single service – QEMU – can lead to the forced restart of thousands of operating systems. It significantly complicates QEMU patching – and it can mean that tech teams sometimes delay patching QEMU, trying to justify taking a risk with vulnerabilities because they view the disruption as too big.

Patching is a must, however, and there are of course shortcuts when it comes to updating QEMU – and the right way to do it. Here are some of your options.

The quick but very risky method

Your simplest, but most disruptive option, is to simply apply a patch, restart, and see what happens. If it's just a single machine, you could be okay – after all, you'll be aware that you're going to need to restart your workload.

However, if you're managing QEMU across a server fleet, or in environments where there are external stakeholders evolved, simply patching and triggering reboots across all the machines will, without a doubt, lead to many upset people.

A sensible approach

Instead of just restarting, most level-headed sysadmins will go and add a bit more planning to the above procedure. To start off with, you'll notify everyone affected by setting up a planned maintenance window with scheduled downtime – say, a month in advance. The problem is, of course, that you'd have to hope you are not hacked within that month.

However, during the maintenance window, you'll have an opportunity to patch without upsetting anyone, permitted a few hours of no service is tolerated. Once you restart QEMU, all the virtual machines should restart, and you can inform the stakeholders that patching is complete.

Nonetheless, you're likely going to set yourself up for a fair period of troubleshooting after the restarts, and though you won't get anything thrown at you, even planned maintenance windows are challenging for everyone involved. There are also many scenarios where planned maintenance that involves actual downtime simply wouldn't be acceptable.

Enterprise-grade approach

Some workloads won't deal well with the disruption caused by operating system restarts. In enterprise environments, you'll need another plan. You'll need to take a much more involved approach: a live migration of the QEMU workload.

You can only do this if your workload is already split across multiple hosts, and where you have high availability activated across those nodes. You then kick off patching by informing your stakeholders that a maintenance window will be due, which will affect performance – but that it shouldn't affect availability.

Relying on your high availability operation, you migrate the virtual machines across, then stop QEMU, patch it, and restart it. After the restart, you migrate the virtual machines back to the patched QEMU instances.

Done correctly, patching by migration ensures that your QEMU instances are safely patched without upsetting stakeholders through real downtime.

The problem with QEMU migration

We've talked about three different approaches to patching QEMU, and the migration route is, without a doubt, the best option for organizations that rely on QEMU to drive large workloads. But even this enterprise-grade approach carries risks. You are performing a very complex procedure that, as all complex procedures, can always fail.

Some of the things that go wrong include:

  • Performance may be significantly degraded during migration – which may impact stakeholder and user satisfaction, particularly where migration takes longer than expected.
  • Coordinating a maintenance window, which is nonetheless required due to possible performance disruption, can still be challenging and time-consuming – while leading to a degree of annoyance for stakeholders.
  • During the migration operation, minor network packet loss should usually be tolerated – but some workloads can be sensitive to this, which can cause significant problems.
  • You need to test and verify post-migration – you can't assume that everything has migrated smoothly, and you may need to involve stakeholders through this testing process.

Performing QEMU updates through a migration process limits disruption, but your team nonetheless needs to invest significant amounts of time in the process. The risk that something goes wrong remains – and there's a small risk of catastrophic failure.

So, while it's unlikely your stakeholders will see significant disruption, your team will need to do careful planning. Finally, it's worth considering that any adverse outcome of the migration process – small as the risk may be – will reflect negatively on you and your team.

Live patching as an alternative

In the past, patching always depended on a stop, patch, restart process. Yes, migration helps by ensuring the instances that require restarting are available. But a fresher approach has become increasingly common: patching on the fly, without restarting the software that's being patched.

Called live patching, this approach significantly simplifies the patching process. Instead of requiring a restart, live patching updates your server or the service you need patching on the go. That's the case for QEMU live patching too, where you can now install the latest patches for QEMU – without setting up a maintenance window, nor the need to execute and plan migration.

That's why QEMUCare, from TuxCare, is a game-changer for teams that run workloads on QEMU. QEMUCare doesn't just make the update and migration process easier – it takes it away completely. Your QEMU/KVM instances are patched instantly with no impact on the underlying virtual machine.

Choosing the live patching route brings a whole range of advantages:

  • Consistent patching. A good live patching solution such as QEMUCare will automatically detect the release of a new patch and initiate the patching process. Your team doesn't even need to monitor for patch releases: QEMUCare just takes care of it. That means that your team patches more consistently – reducing the risk that your QEMU instances are vulnerable to a new exploit.
  • Happy stakeholders. Because QEMUCare works in the background, automatically patching without rebooting QEMU, your stakeholders – including internal users and your customers or clients – won't even know that you're performing patching. It all happens seamlessly without the need for planned maintenance windows.
  • Eliminates labor hours. While you have the option of trying to take a shortcut, the enterprise-grade, migration-driven process for patching we described before is your only realistic choice. It is very labor-intensive, however, consuming lots of hours from your team – whereas QEMUCare consumes almost zero hours from your team.
  • Minimizes risk of error. Because you don't have to migrate your workloads manually there is less risk that patching QEMU will cause you significant problems. There are no migration glitches or network errors to worry about – and you and your team members don't need to worry about your jobs.

Clearly, live patching greatly simplifies the process of keeping your QEMU instances up to date: it happens automatically, you don't need to worry about anything going wrong – and you don't need to invest a lot of time to get it done.

QEMU patching is essential – and live patching makes it much easier

QEMU may be quietly doing its job in the background, but you can't ignore it from a cybersecurity perspective.

You must patch QEMU, but it's understandable that your team may be daunted by the prospect.

While thorough planning and a maintenance window will get you there, live patching just makes it so much easier – you can patch more frequently, and with less effort. So, if you're dependent on QEMU for your workload, consider how live patching from TuxCare can benefit your team.

Found this article interesting? This article is a contributed piece from one of our valued partners. Follow us on Twitter and LinkedIn to read more exclusive content we post.