-
Notifications
You must be signed in to change notification settings - Fork 158
Properly CoW page tables to the scratch region #1198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Previously, this was inconsistent. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
08e43a5 to
7e3f9a5
Compare
Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
whp can now map any region as long as it contains a file handle (which, presently, only snapshot and scratch regions do---not user-provided regions). Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
This fixes an error introduced by a merge conflict in a previous PR. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
1b10f54 to
22173e0
Compare
Previously, in 7ca0a78, MAP_SHARED was replaced with MAP_PRIVATE, primarily to bring the Miri and non-Miri code paths closer together, since it did not appear to have any detrimental effect. However, this seems to cause issues with the scratch mapping on mshv on Linux. It is not entirely clear what mshv is doing here; possibly something like "mapping a page which has never been written to" causes the problem. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
As per the comments preexisting in src/hyperlight_common/src/layout.rs. This avoids some fairly cumbersome proliferation of #[cfg(feature = "init-paging")], and may be developed further in the future. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Instead of updating page tables in place, and using the heap region memory for new page tables, this changes the hyperlight_guest virtual memory subsystem to treat snapshot page tables as copy-on-write and to allocate all new page tables in the scratch region. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
These were already being CoW'd. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Now that some important things are being written to the scratch region, this makes the snapshot process aware of it. In particular, rather than just copying the current snapshot region into a new snapshot, the snapshot process now traverses the guest page tables and picks all pages that are mapped into guest virtual memory, placing them into the new snapshot along with new page tables which account for all physical memory having moved to the snapshot region. Several tests are temporarily disabled in this commit. One, `snapshot_restore_handles_remapping_correctly`, will be re-enabled in the next commit; its failure is due to an issue with state in the guest page mapping process not being reset on snapshot. Several others, related to the stack guard page, are also disabled, because the snapshot process loses the context, kept in the host, of which physical page is the stack guard page. These will be re-enabled in an upcoming patch series, which moves stack management fully into the guest. This commit also disables an optimisation around restoring to a just-taken snapshot in certain cases; see the comment at the head of `MultiUseSandbox::restore` for details. Note that the optimisation of taking a new snapshot immediately after a restore remains intact. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Previously, the guest tried to infer the addresses of the snapshot page tables from the values of cr3 that it saw at various times, but this was brittle and irritating to do without writing to the wrong address at times. This replaces that mechanism with an extra word in the metadata portion of the scratch region that explicitly provides this information. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
22173e0 to
350a605
Compare
Remove the unused constant `SCRATCH_TOP_USED_OFFSET`. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
andreiltd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvement! I'm looking forward to removing more special regions from shared memory. Having an extra pair of eyes on the changes would be great also.
dblnz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I left some small comments
| pub use arch::{PAGE_SIZE, PAGE_TABLE_SIZE, PageTableEntry, PhysAddr, VirtAddr}; | ||
| /// This is always the page size that the /guest/ is being compiled | ||
| /// for, which may or may not be the same as the host page size. | ||
| pub use arch::PAGE_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the page size differs? Does the host signal it somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to use each of these in the correct places (although I'm not sure whether or not the host page size is important to the guest---maybe in the future as a hint for zerocopy buffer alignment?), but as with some of the other portability stuff that's incidental to this series it's more aspirational/laying the groundwork for the future than "all issues related to this fixed".
| regions.extend(self.get_mapped_regions().cloned()); | ||
| // Include dynamically mapped regions | ||
| // TODO: include the snapshot and scratch regions | ||
| let regions: Vec<MemoryRegion> = self.get_mapped_regions().cloned().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the crashdump to work, we need to generate the MemoryRegions for snapshot and scratch.
We'd have to walk the page tables for that, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether you would want the crash dump to be virtually indexed or physically indexed. If the former, yeah, this should be replaced by code that does the same page table walk as snapshot.rs does right now, I think.
src/hyperlight_host/src/mem/mgr.rs
Outdated
| #[allow(clippy::unwrap_used)] | ||
| self.scratch_mem | ||
| .write::<u64>(size_offset, scratch_size as u64) | ||
| .unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just return an error and Poison the sandbox?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree that we should not crash the host if we can absolutely avoid it, maybe we need a new error type besides Poisoned that means something along the lines of "something very bad happened and you should consider restarting your process because we'd really like to panic here"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I agree about this. Things that are a clearly impossible should panic because who knows what is wrong / whether something else in either hyperlight or the host has been corrupted. See also this thread on the previous PR: #1182 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in the case where this should never happen, I get nervous about being in the failure path as a library that could cause down time. If we poison the sandbox and return an error that is specific to things that should not happen then we provide the consumer the ability to handle it fit to their particular needs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is not the right place to have the discussion about panicking on invariant violations in general, so while I still maintain my position that we should, I have for now added fallibility to this function in 54cd065.
A close inspection of the code has however convinced me that there is one caller (SandboxMemoryManager<ExclusiveSharedMemory>::build) for which panicking (at least from our code) is more completely impossible. I have added a slightly more detailed comment to that function arguing why it is morally infallible, but have added a Result type to its return for now anyway in the interests of expediency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on Lucy's side on this one in general. The tricky part here is that one may modify write function in the future without realizing it is breaking upstream assumption. If we decide to keep an error I think having debug_assert is a good way to document violation of invariant.
| } | ||
| let (mem, off) = access_gpa(snap, scratch, scratch_size, gpa); | ||
| if off + PAGE_SIZE <= mem.as_slice().len() { | ||
| Some(&mem.as_slice()[off..off + PAGE_SIZE]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the observation in the comment above happen here also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure which comment above you are referring to---if it's the one about page-unaligned accesses reading beyond memory, then I don't think so since the if condition checks for that here?
simongdavies
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fwe questions/comments, I think there are a few places where there are types that are pub that could be pub (crate) , didn't look for them all, but possibly worth a pass of the whole PR for this.
|
|
||
| pub const MAIN_STACK_TOP_GVA: usize = 0xffff_feff_ffff_f000; | ||
|
|
||
| pub fn scratch_size() -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this and the following functions be pub(crate) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following two functions are definitely used in hyperlight_guest_bin. This one isn't currently but I see little downside to making it match.
src/hyperlight_host/src/mem/mgr.rs
Outdated
| #[allow(clippy::unwrap_used)] | ||
| self.scratch_mem | ||
| .write::<u64>(size_offset, scratch_size as u64) | ||
| .unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree that we should not crash the host if we can absolutely avoid it, maybe we need a new error type besides Poisoned that means something along the lines of "something very bad happened and you should consider restarting your process because we'd really like to panic here"
| // impossible unless an extremely significant invariant violation | ||
| // has occurred. | ||
| #[allow(clippy::panic)] | ||
| unsafe fn alloc_table(&self) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can ew split TableOps into a TableOpsRead and TableOpsWrite to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably could, but I didn't think it was worth the effort at the moment. It is pretty clear semantically that these implementations could not possibly be r/w and I can't see any natural way in which host code would want to use them that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer compile time checks to make this impossible but OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a downside to splitting the trait into read and write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, and I agree that it is the correct thing to do, I just wasn't sure if it was worth the hassle. Since it seems like there is more interest in getting right, I went ahead and did it in a2ece56.
| // should possibly ever call an update_parent. Therefore, this is | ||
| // impossible unless an extremely significant invariant violation | ||
| // has occurred. | ||
| #[allow(clippy::panic)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above can we split up the TableOps trait into read/write and then avoid this entirely?
Make virt_to_phys use the aligned and un-canonicalised version of its argument for traversal. This does not actually matter, because only certain bits are actually taken from vmin by the (chained) `modify_ptes` iterator(s). However, since it doesn't matter, this is probably better for clarity. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Fix bug that resulted in the first page of scratch memory being unconditionally copied into the snapshot. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
Be slightly more careful about handling offsets from the guest in snapshot.rs: previously, a maliciously-constructed sandbox could possibly cause underflow panics, and another function could be misused in a way that would result in a slice out-of-bounds panic (although it was not currently being misused in such a way). Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
1fa74ff to
2a2b444
Compare
|
Thanks for the comments everyone! I have added three fixups that address some of them. Sorry for the commit message summaries---the first line is used by autosquash so has to just be "which commit this is amending", although I did add a few words tying each of them to an issue identified in the review.
I did a quick grep for added lines starting with |
ludfjig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me :D
| if let Some(snap) = &self.snapshot | ||
| && snap.as_ref() == snapshot.as_ref() | ||
| { | ||
| // If the snapshot is already the current one, no need to restore | ||
| return Ok(()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the self.snapshot or is it used somewhere else? (Maybe in future PR if deemed OK?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still used for the optimisation of taking a snapshot immediately after a restore
…tch region Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
This splits `hyperlight_common::vmem::TableOps` into `TableReadOps` and `TableOps`, so that implementations like the one in snapshot.rs that cannot support writes don't have to provide the (extremely partial!) write functions. This is slightly involved, because we need to keep to track of the right information in the type system about whether writes are possible, and, if so, whether they can move existing page tables. The implementation of of writing to cr3 to update page tables is also moved into `hyperlight_guest_bin::paging`, where it belongs, which is a simple change. Signed-off-by: Lucy Menon <168595099+syntactically@users.noreply.github.com>
This combines a few commits from the original guest-CoW PR, all of which involve properly CoW'ing the page tables into the scratch region, dealing with this in snapshots, etc.