Moving the implementation to userspace allows us more flexibility, and
more consistency in the implementation of timekeeping for the various
clock chips; it becomes easier to follow the nuances of real hardware
in this area.
Interestingly, while the IOAPIC/PIC code was written we proposed
making it independent of the local APIC; had we done so, the move
would have been much easier (simply dropping the existing code).
Advantages of a move
====================
1. Reduced kernel footprint
Good for security, and allows fixing bugs without reboots.
2. Centralized timekeeping
Instead of having one solution for PIT timekeeping, and another for
RTC and HPET timekeeping, we can have all timer chips in userspace.
The local APIC timer still needs to be in the kernel - it is much too
high bandwidth to be in userspace; but on the other hand it is very
different from the other timer chips.
3. Flexibility
Easier to have wierd board layouts (multiple IOAPICs, etc.). Not a
very strong advantage.
Disadvantages
=============
1. Still need to keep the old code around for a long while
We can't just rip it out - old userspace depends on it. So the
security advantages are only with cooperating userspace, and the other
advantages only show up.
2. Need to bring the qemu code up to date
The current qemu ioapic code lags some way behind the kernel; also
need PIT timekeeping
3. May need kernel support for interval-timer-follows-thread
Currently the timekeeping code has an optimization which causes the
hrtimer that models the PIT to follow the BSP (which is most likely to
receive the interrupt); this reduces cpu cross-talk.
I don't think the kernel interval timer code has such an optimization;
we may need to implement it.
4. Much churn
This is a lot of work.
Proposed interface
==================
1. KVM_SET_LINT_PIN (vcpu ioctl)
Sets the value (0 or 1) that a vcpu's LINT0 or LINT1 senses.
Note: problematic; may be high frequency but ignored due to masking at
the local APIC LVT level. Will also be broadcast across all vcpus by
userspace with typical configurations. We may need a way to tell
userspace we'll be ignoring those signals.
May also be extended to emulate thermal interrupts if someone feels
the need.
An alternative is a couple of new fields in kvm_run which are sampled
on every entry (unless masked).
2. KVM_EXIT_REASON_INTACK (kvm_run exit reason)
Informs userspace that the vcpu is running an INTACK cycle; userspace
should provide the interrupt vector on the next KVM_VCPU_RUN.
3. KVM_APIC_MESSAGE (vm ioctl)
Sends an APIC message on the APIC message bus, if the destination is
in the kernel (typically IOAPIC interrupt messages).
4. KVM_EXIT_REASON_APIC_MESSAGE (kvm_run exit reason)
Sends an APIC message on the APIC message bus, if the destination is
not in the kernel (typically IOAPIC EOI messages).
The above are all architectural, and correspond to wires on physical
systems. This increases the confidence that they are correct.
5. KVM_REQUEST_EOI (vcpu ioctl) / KVM_EXIT_EOI (kvm_run exit reason)
We will get EOI messages via KVM_EXIT_REASON_APIC_MESSAGE for
level-triggered interrupts. However, for timekeeping we will also
need a an EOI for edge triggered interrupts (if we choose the ack
notifier method for timekeeping).
6. KVM_EXIT_REASON_LVT_MASK (kvm_run exit reason)
A notification that the LVT LINT0 or LVT LINT1 mask bit has changed,
and thus we don't need to issue useless KVM_SET_LINT_PIN ioctls; also
useful for timekeeping (can disable PIT if configured with ExtInt mode
or lapic disabled).
7. KVM_EXIT_REASON_APIC_MESSAGE_ACK (kvm_run exit reason)
If we use the current timekeeping method of detecting coalesced
interrupts, we'll need an acknowledge when an APIC message is accepted
by a local APIC, with the result (interrupt queued or interrupt
coalesced). This will need to be selectable by vcpu and vector number.
8. KVM_CREATE_IRQCHIP (vm ioctl)
A new flag that tells kvm not to create a PIC and IOAPIC.