Virtio-IOMMU interrupt remapping design
Virtio-IOMMU interrupt remapping turned out to be much harder than I realized. The main problem is that interrupt remapping is set up very early in boot. In fact, Linux calls the interrupt remapping probe function from the APIC initialization code: x86_64_probe_apic -> enable_IR_x2apic -> irq_remapping_prepare(). This is almost certainly much before PCI has been initialized. Also, the order in which devices will be initialized is not something Linux guarantees at all, which is a problem because interrupt remapping must be initialized before drivers start setting up interrupts. Otherwise, the interrupt remapping table won't include entries for already-existing interrupts, and things will either break badly, not get the benefit of interrupt remapping security-wise, or both. The reason I expect this doesn't cause problems for address translation is that the IOMMU probably starts in bypass mode by default, meaning that all DMA is permitted. If the IOMMU is only used by VFIO or IOMMUFD, it will not be needed until userspace starts up, which is after the IOMMU has been initialized. This isn't ideal, though, as it means that kernel drivers operate without DMA protection. Is a paravirtualized IOMMU with interrupt remapping something that makes sense? Absolutely! However, the IOMMU should be considered a platform device that must be initialized very early in boot. Using virtio-IOMMU with MMIO transport as the interface might be a reasonable option, but the IOMMU needs to be enumerated via ACPI, device tree, or kernel command line argument. This allows it to be brought up before anything capable of DMA is initialized. Is this the right path to go down? What do others think about this? -- Sincerely, Demi Marie Obenour (she/her/hers)
On Sun, Jun 15, 2025 at 02:47:15PM -0400, Demi Marie Obenour wrote:
Is a paravirtualized IOMMU with interrupt remapping something that makes sense?
IMHO linking interrupt remapping to the iommu is a poor design, interrupt routing belongs in the irq subsystem, not in the iommu. The fact AMD and Intel both coupled their interrupt routing to their iommu hardware is just a weird design decision. ARM didn't do this, for instance. So I would not try to do this at all, you should have a para-virtualized IRQ interface, not an extension to virtio-iommu adding interrupt handling. :\ AFAIK hyperv shows how to build something like this. Jason
On 6/16/25 09:20, Jason Gunthorpe wrote:
On Sun, Jun 15, 2025 at 02:47:15PM -0400, Demi Marie Obenour wrote:
Is a paravirtualized IOMMU with interrupt remapping something that makes sense?
IMHO linking interrupt remapping to the iommu is a poor design, interrupt routing belongs in the irq subsystem, not in the iommu.
I agree.
The fact AMD and Intel both coupled their interrupt routing to their iommu hardware is just a weird design decision. ARM didn't do this, for instance.
Arm did the right thing here, IMO.
So I would not try to do this at all, you should have a para-virtualized IRQ interface, not an extension to virtio-iommu adding interrupt handling. :\
I don't disagree at all.
AFAIK hyperv shows how to build something like this. Would this need KVM patches? I'm concerned that implementing this in userspace would interact badly with the irqfd fast path. -- Sincerely, Demi Marie Obenour (she/her/hers)
On Mon, Jun 16, 2025 at 12:53:40PM -0400, Demi Marie Obenour wrote:
AFAIK hyperv shows how to build something like this. Would this need KVM patches? I'm concerned that implementing this in userspace would interact badly with the irqfd fast path.
I don't know. I think you get the same issues even if you did virtio-iommu irq handling, it shouldn't be any different. I'm not sure there even is a fast path here, remapping happens during initial vector setup/affinity change only. That isn't fast path. So long as the MSI is delivered to the correct CPU vector entirely in KVM it seems OK. And the hyperv approach of asking the hypervisor for the addr/data pair to achieve certain parameters will work alot better with existing Linux than trying to build a iommu emulation where the guest is building its own private addr/data pairs :\ Jason
On Mon, Jun 16, 2025 at 10:20:31AM -0300, Jason Gunthorpe wrote:
On Sun, Jun 15, 2025 at 02:47:15PM -0400, Demi Marie Obenour wrote:
Is a paravirtualized IOMMU with interrupt remapping something that makes sense?
IMHO linking interrupt remapping to the iommu is a poor design, interrupt routing belongs in the irq subsystem, not in the iommu.
The fact AMD and Intel both coupled their interrupt routing to their iommu hardware is just a weird design decision. ARM didn't do this, for instance.
why does it matter in which device it resides? Way I see it, there is little reason to remap interrupts without also using an iommu, so why not a single device. what did I miss?
So I would not try to do this at all, you should have a para-virtualized IRQ interface, not an extension to virtio-iommu adding interrupt handling. :\
AFAIK hyperv shows how to build something like this.
Jason
On Tue, Jun 17, 2025 at 03:44:20PM -0400, Michael S. Tsirkin wrote:
On Mon, Jun 16, 2025 at 10:20:31AM -0300, Jason Gunthorpe wrote:
On Sun, Jun 15, 2025 at 02:47:15PM -0400, Demi Marie Obenour wrote:
Is a paravirtualized IOMMU with interrupt remapping something that makes sense?
IMHO linking interrupt remapping to the iommu is a poor design, interrupt routing belongs in the irq subsystem, not in the iommu.
The fact AMD and Intel both coupled their interrupt routing to their iommu hardware is just a weird design decision. ARM didn't do this, for instance.
why does it matter in which device it resides?
It would cleanup the boot process if the IRQ components were available at the same time as the IRQ drivers instead of much later when the iommu gets plugged in.
Way I see it, there is little reason to remap interrupts without also using an iommu, so why not a single device. what did I miss?
Remapping interrupts can be understood to be virtualizing the MSI addr/data pair space so that the CPU controls where the interrupt goes though its internal tables not the device through the addr/data. On x86 you also need to use remapping to exceed the max CPU count that can be encoded in the MSI, no iommu required to need this. There is also some stuff related to IMS that could get improved here. You don't need an iommu to enjoy those benefits. Jason
On Tue, Jun 17, 2025 at 04:57:20PM -0300, Jason Gunthorpe wrote:
On Tue, Jun 17, 2025 at 03:44:20PM -0400, Michael S. Tsirkin wrote:
On Mon, Jun 16, 2025 at 10:20:31AM -0300, Jason Gunthorpe wrote:
On Sun, Jun 15, 2025 at 02:47:15PM -0400, Demi Marie Obenour wrote:
Is a paravirtualized IOMMU with interrupt remapping something that makes sense?
IMHO linking interrupt remapping to the iommu is a poor design, interrupt routing belongs in the irq subsystem, not in the iommu.
The fact AMD and Intel both coupled their interrupt routing to their iommu hardware is just a weird design decision. ARM didn't do this, for instance.
why does it matter in which device it resides?
It would cleanup the boot process if the IRQ components were available at the same time as the IRQ drivers instead of much later when the iommu gets plugged in.
Way I see it, there is little reason to remap interrupts without also using an iommu, so why not a single device. what did I miss?
Remapping interrupts can be understood to be virtualizing the MSI addr/data pair space so that the CPU controls where the interrupt goes though its internal tables not the device through the addr/data.
On x86 you also need to use remapping to exceed the max CPU count that can be encoded in the MSI, no iommu required to need this.
More of an x86 quirk though, isn't it?
There is also some stuff related to IMS that could get improved here.
You don't need an iommu to enjoy those benefits.
Jason
On Tue, Jun 17, 2025 at 04:01:53PM -0400, Michael S. Tsirkin wrote:
On x86 you also need to use remapping to exceed the max CPU count that can be encoded in the MSI, no iommu required to need this.
More of an x86 quirk though, isn't it?
Yes, but so is bundling IOMMU and remapping HW together <shrug> GIC fully integrates it into the interrupt controller architecture. Jason
On Sun, Jun 15, 2025 at 02:47:15PM -0400, Demi Marie Obenour wrote:
Virtio-IOMMU interrupt remapping turned out to be much harder than I realized. The main problem is that interrupt remapping is set up very early in boot. In fact, Linux calls the interrupt remapping probe function from the APIC initialization code: x86_64_probe_apic -> enable_IR_x2apic -> irq_remapping_prepare(). This is almost certainly much before PCI has been initialized. Also, the order in which devices will be initialized is not something Linux guarantees at all, which is a problem because interrupt remapping must be initialized before drivers start setting up interrupts. Otherwise, the interrupt remapping table won't include entries for already-existing interrupts, and things will either break badly, not get the benefit of interrupt remapping security-wise, or both.
The reason I expect this doesn't cause problems for address translation is that the IOMMU probably starts in bypass mode by default, meaning that all DMA is permitted. If the IOMMU is only used by VFIO or IOMMUFD, it will not be needed until userspace starts up, which is after the IOMMU has been initialized. This isn't ideal, though, as it means that kernel drivers operate without DMA protection.
Is a paravirtualized IOMMU with interrupt remapping something that makes sense? Absolutely! However, the IOMMU should be considered a platform device that must be initialized very early in boot. Using virtio-IOMMU with MMIO transport as the interface might be a reasonable option, but the IOMMU needs to be enumerated via ACPI, device tree, or kernel command line argument. This allows it to be brought up before anything capable of DMA is initialized.
Is this the right path to go down? What do others think about this? -- Sincerely, Demi Marie Obenour (she/her/hers)
The project for this discussion is also virtio-comment, this ML is for driver work.
participants (3)
-
Demi Marie Obenour -
Jason Gunthorpe -
Michael S. Tsirkin