A toy OS with no virtual memory
thoughts on context switches
Add notes on shell languages
why so many inaccuracies in docs
brain y u keep thinking of things
Thoughts on context switching on RISC-V
Okay I was wrong, cooperative task switching still sucks ass.

heads

tip
browse log

clone

read-only
https://hg.sr.ht/~icefox/mongoose
read/write
ssh://hg@hg.sr.ht/~icefox/mongoose

#Mongoose

A small toy OS written in Rust, just to take a break from other stuff. I want to go intentionally minimalist and somewhat oldschool, so this will probably look kinda like old OS's, and kinda like embedded RTOS's. Direct-ish inspirations include DOS, classic MacOS, Oberon, GEOS, GEM, and others.

Design goals:

  • No virtual memory
  • No preemptive multitasking(?)
  • More than DOS, less than Unix
  • Cyberpunk movie UI
  • Targets RISC-V 64-bit ISA via QEMU (5.2.0 specifically)

#Current state

I worked on this for a month and am reluctantly leaving it to go back to more imporant things. I definitely want to come back to it from time to time though. Here's where it's at:

  • It has the nascent skeleton of a HAL and the start of a few hardware drivers
  • It can allocate memory for kernels and userland processes, though the kernel allocator has a minimum block size of 1 kb and doesn't actually coalesce freed blocks, so kinda needs some love
  • It can handle interrupts and it would be easy to add a proper ISR table if I ever got around to it
  • There is a buffered kernel log stream
  • It can load ELF executables and start processes, though there's no block device to load 'em from, so at the moment it's just the exe files being stuck into the kernel through include_bytes!()
  • There are two system calls and only one of them does anything (ie, prints "hello world")
  • I tried to get both cooperative and preemptive scheduling working but didn't have time to make it work right, so right now it's essentially single-tasking. Userland processes get run to completion, in order.
  • No SMP yet, only starts one core

Next steps for when I pick this up again:

  • Go through and give the whole thing a good read, fix any TODO's and BUGGO's that inspire you
  • Add a proper ISR table so doing more with interrupts is easier
  • Userland: Make context switching from kernel to userland work. This is tricky 'cause swapping all the registers going from kernel to userland is tricky. You might have to do it as part of an interrupt or ecall instruction(???) or stash necessary state in thread-local storage or something weird like that. If doing it as an exit from an interrupt, you might need to have separate functions for "start process" and "resume process", which is inconvenient but not the worst. Failing at that, go for cooperative multithreading with a dedicated yield syscall. Maybe do that anyway just for fun.
  • I think my problem with (cooperative) context switches was I did things the wrong way around. I wrote a Rust function to do it, then tried to inline asm to juggle the registers, but the Rust function was using some of the registers already so things got complicated. Should have just written a repr(C) function that takes a couple register structures or such as arguments and does the switching for you, that way you don't need to save a or t registers.
  • The current syscall-as-vtable implementation is broken, see associated BUGGO comment
  • Hardware: Adding more drivers should be pretty straightforward. Make them query the device tree and set interrupt handlers appropriately instead of having interrupt numbers hardcoded. Make a block device driver, then that should make it easy to add a virtio raster framebuffer.
  • Userland: idk systemcalls or something

#Toolchain setup

rustup install nightly
rustup +nightly component add rust-src
rustup target add riscv64gc-unknown-none-elf
rustup update

apt install qemu-system-misc binutils-riscv64-unknown-elf
# Non-critical but useful tools
apt install device-tree-compiler

#Hardware Resources

#Docs/tutorials

#License

GPL-2.0 and not interested in arguing about it

#Target machine notes

Target machine: qemu-system-riscv64 -machine virt

We should at least vaguely try to have support for different machines. ...Eh, maybe later.

Hardware:

qemu virt machine memory map, for qemu 5.2:

address-space: memory
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000001000-000000000000ffff (prio 0, rom): riscv_virt_board.mrom
    0000000000100000-0000000000100fff (prio 0, i/o): riscv.sifive.test
    0000000000101000-0000000000101023 (prio 0, i/o): goldfish_rtc
    0000000002000000-000000000200ffff (prio 0, i/o): riscv.sifive.clint
    0000000003000000-000000000300ffff (prio 0, i/o): gpex_ioport
    000000000c000000-000000000c20ffff (prio 0, i/o): riscv.sifive.plic
    0000000010000000-0000000010000007 (prio 0, i/o): serial
    0000000010001000-00000000100011ff (prio 0, i/o): virtio-mmio
    0000000010002000-00000000100021ff (prio 0, i/o): virtio-mmio
    0000000010003000-00000000100031ff (prio 0, i/o): virtio-mmio
    0000000010004000-00000000100041ff (prio 0, i/o): virtio-mmio
    0000000010005000-00000000100051ff (prio 0, i/o): virtio-mmio
    0000000010006000-00000000100061ff (prio 0, i/o): virtio-mmio
    0000000010007000-00000000100071ff (prio 0, i/o): virtio-mmio
    0000000010008000-00000000100081ff (prio 0, i/o): virtio-mmio
    0000000020000000-0000000021ffffff (prio 0, romd): virt.flash0
    0000000022000000-0000000023ffffff (prio 0, romd): virt.flash1
    0000000030000000-000000003fffffff (prio 0, i/o): alias pcie-ecam @pcie-mmcfg-mmio 0000000000000000-000000000fffffff
    0000000040000000-000000007fffffff (prio 0, i/o): alias pcie-mmio @gpex_mmio 0000000040000000-000000007fffffff
    0000000080000000-000000009fffffff (prio 0, ram): riscv_virt_board.ram

Tools:

Apparently qemu provides you a device tree in the a1 register?

SBI seems a little redundant, and seems to make it difficult to get to M mode which is where I want to be if I want to play with PMP's. BUT, on the flip side, I don't need to play with PMP's, certainly not yet. Then, on the gripping hand, if qemu is going to make me use SBI anyway then I might want to use a known and controlled version of it. Ditching SBI is appealing and I don't think would be that hard... but it would also make life harder for porting the OS to new systems as well, since there would then need to be a large(r) amount of platform support code involved. So, for now let's just use the SBI it gives us and keep on rocking.

Useful qemu monitor commands:

  • info mtree -- memory map
  • info pci -- PCI tree
  • info qtree -- device tree
  • info roms -- ROMs, plus loaded files such as kernels/bootloaders
  • info mem -- Virtual memory info
  • qemu-system-riscv64 -device help prints out ID's for all supported devices
  • qemu-system-riscv64 -machine virt -machine dumpdtb=whatever.dtb -- Dump device tree blob -- use dtc to decompile it to text.

#Design

Why is it the way it is? For fun.

#Memory

No virtual memory. No memory protection. Kernel is loaded at a fixed location and has a fixed-size stack and heap; the kernel should be designed so that it hopefully never needs more than a given amount of heap memory. All userland executables are position-independent code.

#Processes

There is (probably) one privileged process, the Monitor, which is the one the user generally interacts with. It forms the "shell" that handles user interaction and has special privileges.

The Monitor can start background processes, we'll call them Tasks. They are less privileged; they cannot start other Tasks (for now), have fixed size memory allocations (for now), and can only access system calls/hardware resources/capabilities that the Monitor gives them. I think of them a little like processes in old batch-processing mainframes, where they just do a bunch of data crunching and have a fixed output. There may be a fixed maximum number of Tasks.

All processes just run in Supervisor mode, same as the OS. Considered making them run in User mode, but without memory protection there's not really any benefits to doing so. A malicious program could always just overwrite the OS's interrupt handler with its own and gain privilege that way.

Each process has a single thread and runs on a single CPU. (For now.) Processes get pre-empted by timer interrupts and those may schedule different processes to run, though it probably won't exactly be sophisticated. I wasn't originally planning on doing it this way, but frankly it ended up being so obvious it was kind of a shoo-in.

A process gets a fixed chunk of memory when it is started, and does not get more ever. The amount of memory to give to a process is an input to the "spawn process" function call, and that function call just fails if there's not enough memory available. The process gets loaded at the beginning of this section of memory, and canonically the stack starts at the end of it and grows downward. The process doesn't need to actually use a stack, technically, but system calls are functions following the C ABI for the platform so really you're gonna need a stack. Interrupts also use the stack, so.

Because there's no virtual memory, all programs must be position-independent code. I was considering having essentially a compacting GC for processes that moved them around in memory to minimize fragmentation, but that would still need fix-ups of non-PC-relative pointers that might get created during the execution of the process, so it will probably never happen.

Executable format: ELF64. It's Fine. No dynamic linking, at least not implemented by the kernel loader itself.

TODO: What about Tasks whose job is unbounded I/O, such as network servers? Ponder this. Those require some kind of blocking to be efficient, which means context switching.

#Monitor

The Monitor as I'm envisioning it currently is a tabbed frame with some status info on it, a little like a tmux session. There's one tab that is the kernel console, one tab (at least?) which is a (privileged?) command line, and programs can exist in other tabs. Each tab is either a (rich) text widget command line or a raster display. When a program (task?) is started, it gets exclusive access over one tab, and releases it when it is done.

We will see as we go how much it actually ends up being a special-case privileged thing, or whether it's just a normal process that has been given a few extra permissions.

#System calls and interrupts

There is a struct that is passed to each process when it is started, which contains function pointers to various system calls. We'll call it the "system call table". The function pointers are just normal functions using extern "C" ABI.

Each process gets its own copy of the system call table, and any or all of the functions in it may be invalid that just trap into the OS when called. The OS may then do whatever it feels like. This lets the kernel limit what a process can do, by not providing it certain system calls or maybe someday even giving that process specific versions of some system calls. If it wants the process can overwrite function pointers in the system call table to whatever it feels like -- for example, zero'ing them out to implement something similar to OpenBSD's pledge() mechanism.

A system call involves a context switch from the user program to the operating system. It saves the user program's registers to the process structure and switches to a per-CPU OS stack, does its stuff, then restores the user's registers back when it's done. Because juggling registers is a pain in the ass and gets exponentially worse the more you do it, interrupts are disabled during system calls. So we should design system calls to be short. An async programming model for system calls is a good idea anyway, even if it's not exactly oldschool.

Interrupts spill registers onto whatever stack they happen to be on, then restore them when it's done. We could have an interrupt-specific stack per CPU, and in fact did once upon a time, but it was kind of a pain to set up and was not large, so the actual benefits were minimal.

#HAL

Hardware abstraction layer currently just consists of a single serial device that may or may not exist, which is designated the "kernel console". This is where kernel debug messages get output to.

Devices we want to support:

  • [x] Bootloader via SBI
  • [x] Serial devices (uart16550)
  • [ ] Ethernet devices (rtl8139)
  • [ ] Hard drives via the virtio block device -- gpt and some kind of fat32 library at least, fat32 or fatfs or such
  • [ ] 2D raster graphics of some kind via the qemu virtio-gpu-device driver and maybe the embedded-graphics crate.
  • [ ] Realtime clock (goldfish-rtc or whatever qemu gives us)

Supporting more than one kind of any particular device is a Power Goal we are not going to worry about right now. An abstract interface for devices is something we should have but isn't necessary for now.

#Storage/Filesystem

Content-addressed. That's all I got right now. Probably an append-only (but garbage-collected) content-addressed block store, and a more user-friendly index over it that lets you assign names to things and stuff like that.

#Limitations

Runs on 64 bit platforms only. Runs on RISC-V only, though at least some attempts have been made to make it remotely possible to port to other platforms. (An AArch64 port might be fun someday, either on virt or raspberry pi hardware.) Minimum memory: The kernel tries to get a 4 MB heap all to itself, so that plus whatever the code size is plus some for userland. Currently the VM is set to have 128 MB of memory, which is way more than enough. Maximum memory: Whatever fits in the physical memory map. Maximum number of CPU's/harts: 8 -- max the qemu virt platform supports, though changing it should only involve changing one constant and recompiling.

#Random Ideas

RISC-V has a fairly simple, microcontroller-like "Physical memory protection" functionality that lets you portion out access control on up to 16 contiguous memory locations, without involving paging. It might be fun to play with that, if we feel like it. I was originally considering only having max 16 processes and giving each one its own physical memory group, but upon contemplation, if the goal is to not allow processes to touch each other then just using a few of them to cover all memory outside the current process is probably the best

Make this single processing, more or less. There can probably be separate programs that run on other cores/harts, but they will probably be basically non-interactive "background" or "service" processes and might need a different API (for example, may not be able to spawn other background processes). Probably no actual userland threads, each program probably gets one hart or is put to sleep. Maybe play with cooperative multithreading or such.

Maybe have the user shell be a unique "console" or "supervisor" program with more privilieges than background processes.

I kinda want kernel system calls to just be a table of function pointers, 'cause early MacOS did that and it always seemed cute. (When combined with cooperative multitasking this makes the kernel look a hell of a lot like a DLL that the user programs just call into.)

It would be cool to be able to run rustc on this, but oh gods. For now let's not even try to go there.

Webassembly? Don't tempt me.

Would be cool to have a memory debugger type thing that could actually draw the memory map for the OS as a whole, and also the different sections of each process.

#Filesystem

Good effin' question. FAT-32? Image-based a la Smalltalk? Something different and weird? I'm not particularly interested in files for this particular use case.

Maybe an exclusive single-level space per-process, with the Monitor being the one that lets files move around between them. I do like the idea of file access being deny-by-default, programs can only access their own little sandbox unless explicitly permitted otherwise.

CONTENT ADDRESSING! OF COURSE!

#UI

I'm imagining a simple tab-based GUI with a Lisp-based shell. Maybe it hooks into the stdout-equivalent of other background programs for faux-multiprocessing.

I do like the idea of having the "shell" be something richer than just stdin+stdout. Maybe an input stream, output stream, (structured?) logging stream, raster canvas each program can draw on, tabs/controls to switch between them, and around it lie a few basic control and status things. Imagine a tabbed GUI shoved into a particularly blinged-out tmux instance.

References for Lisp impl's and other embeddable languages in Rust. Some might be useful, some might simply be inspiration.

#Networking

It would be nice to be able to talk to other things over ethernet. Not super interesting to me yet, but eventually. Can we run an ethernet driver as a background task?

#Lessons Learned

#RISC-V stuff

Using RISC-V on qemu's virt machine setup is way, way, way, way nicer than any real hardware, but is close enough to real hardware that you are forced to pay attention to the details. I would highly recommend it for any budding OS dev. You get to the interesting stuff way faster, but the hardware peripherials are real enough to be useful and give you practice for dealing with hairier real hardware someday in the future.

Turns out RISC-V is in some ways lower level than x86. (Some would say just "simpler".) Interrupts are a good example: x86 has this complicated machinery of descriptors and reserved memory spaces and stuff that handles context swaps on interrupts. (Traps, to use the RV term.) You give it a place to stick stack pointers and spill registers and all this random stuff, then still have to write code to handle the edge cases. RV on the other hand basically just gives you a single register on each hart, sscratch, and says "yeah that oughta let you do whatever you need to do". On the one side it's easy to understand and set up at a basic level, on the other side it would be sort of nice to have more than one register available and not have to hand-roll your own interrupt stack frames and such. This has actually forced my design a little bit, it makes context switching outside of interrupts difficult.

Writing assembly for a RISC CPU is occasionally a little bit of a pain in the ass, mainly when you're doing "weird" things like making long jumps to arbitrary addresses or writing interrupt handlers. The consistent instructions and convenient calling conventions are really nice though. Amusingly, 32 registers is more than I can pay attention to at any given time, so if I were actually writing user code in assembly a compiler would probably do a better job of register allocation than I would.

Also, assuming RISC-V is a good example of what 2020-era hardware likes, the original RISC principle of "each instruction only does one thing" is now out of date. An example is the compare-and-branch instructions, which do two things: compare, and branch. But the extra difficulty of making the CPU handle the code dependency is worth it because it eliminates a data dependency; there is an extra bit of internal state in executing the instruction, but there is no flags register that multiple instructions have to try to read and write at once. Data flow dictates what the best form for the code is. Similarly, no arithmatic instructions cause traps/interrupts, even on invalid input (such as 1/0), but rather produce a guard value that can be checked against. An implicit global data dependency (do I trap or not?) has been replaced by an explicit local one (check this one register against a guard value).

Turns out most/all(?) multi-core CPU's have a distinction between on-core traps, such as invalid instruction faults, and off-core interrupts, such as a serial device becoming ready to read/write stuff. The on-core ones generally go to a per-core interrupt controller, while the off-core ones go through a separate external interrupt controller that decides which core to route it to.

Interestingly, I think that doing a context switch in RISC-V is actually impossible outside of an interrupt. You need a pointer in a register to load the registers of the task you are switching to, which is doable, but then you also need an address in a register specifying the instruction to jump to when you exit the context switch. I'm no expert but I haven't been able to juggle registers around properly, I just always need a free register somewhere that I don't have one. I thought cooperative context switches could be done using a function call and just return to the value in the ra register, but I haven't been able to make that work in practice either. Inside an interrupt it becomes much more feasible because you have the sscratch register (and know something else isn't gonna be using it), which can hold a value for whatever you want (and the csrrw instruction swaps its value another register, so reading it loses no state), and you have the sepc register which lets you tell the CPU where to return to without altering the ra register of the context you're about to return to anyway. Still, I think this deserves a bit more work; cooperative context switches really should be possible, just gotta make them work.

https://wiki.alopex.li/RiscVNotes in general.

#OS stuff

I would really like to be able to put guard pages at the ends of my stacks. I guess there's some drawbacks to rocking it old school. Adding canary values at the end of the stacks and checking them routinely (on interrupt exit for example) actually has caught some bugs, though.

Memory is kind of more important than CPU power, in terms of OS design. Having tiny amounts of memory warps an OS's design way more than having a slow CPU does. Same with a tiny address space. You can always just drop in a faster CPU of the same architecture, at least since the 70's or so, but adding more memory or a bigger address space can fundamentally change how your OS works. Even though I'm on a deliberately restricted system, a slow VM with 128 MB of memory and no virtual mem, I have a bigass 64-bit address space. This means a lot of my OS looks kinda modern, without a lot of the weird hacks and limitations that were necessary on 16-bit or 24-bit systems.

This sort of extends to hardware, too. The Intel 8080 had 16 bits of address space, plus 256 I/O ports. Basically a separate 8-bit address space dedicated to I/O. This means the whole 64k address space could be dedicated to RAM, with no space needed for mem-mapped I/O. It would have been nicer all around to have a 17-bit address space with some of it devoted to I/O... maybe even nicer on the hardware side, since you could reuse some address and I/O pins. But you would have needed bigger registers to be able to address that space, and that wasn't going to happen, so they didn't do that, and so you have this less-clean design with its dedicated I/O space... but you get the full 64k address space devoted to RAM. These days, where 32-bit microcontrollers can cost a few cents and have a full 32-bit address space even if there's no way you can attach that much memory to it, having big chunks of your address space devoted to memmapped I/O is easy. You can have an all-around better design, because you don't need to work around the limitations implicit in a tiny address space, because it's easy to throw enough transistors at the problem to support 32-bit register and ALU's and such everywhere.

The oldschool style of design is synchronous and single-threaded, because single-threaded is simple, and if you only have one thread happening anyway then having system calls and such block is usually pretty good. The modern style of design, which has really picked up pace since 2010 or so, is asynchronous and multi-threaded, which similarly go well together. If you have lots of cores you need lots of threads to use them, and if you have lots of threads then you want to minimize the cost involved with context switches, shuffling data between them, and the accounting involved. So a non-blocking model becomes more and more useful. Networked systems are an extension of this; having a system call send an event from one thread to another on separate cores is structurally very similar to one computer sending an RPC call to another. The API's that make both efficient are similar, it's just a matter of the magnitudes of the overhead.

This trend towards non-blocking code is not a fundamental law however, it is an implementation detail. (I prefer the term "non-blocking" to "asynchronous" since it's more descriptive and "async" often refers to particular feature sets in particular languages.) If we could always replace two computers with one twice as fast, we usually would. In practice though, Moore's Law has run out and making a computer twice as fast is much more than twice as difficult/expensive, so we end up with datacenters full of thousands of small computers working together instead of a few computers the size of shipping containers.

#Known bugs

  • None, any behavior is currently a feature.
  • In reality, just about every usage of static mut somewhere is evil, and there's an inconvenient number of them.