A toy OS with no virtual memory
Switch test image format from BMP to QOI.
Trying to remember how to handle the virtio GPU
Split apart virtio module

heads

tip
browse log

clone

read-only
https://hg.sr.ht/~icefox/mongoose
read/write
ssh://hg@hg.sr.ht/~icefox/mongoose

#Mongoose

A small toy OS written in Rust, just to take a break from other stuff. I want to go intentionally minimalist and somewhat oldschool, so this will probably look kinda like old OS's, and kinda like embedded RTOS's. Direct-ish inspirations include DOS, classic MacOS, Oberon, GEOS, GEM, and others.

Design goals:

  • No virtual memory
  • No preemptive multitasking(?)
  • More than DOS, less than Unix
  • Cyberpunk movie UI
  • Targets RISC-V 64-bit ISA via QEMU (5.2.0 specifically)

#Current state

Total time put into this: approx 2.5 months. I'm a little sick of it right now, so it's time to put it down for a little while again. Interrupts and hardware driver support are decent, and we have a real video driver even if it doesn't do much yet. Here's where it's at:

  • It has a few hardware drivers, including a virtio 2D display driver that can show a fixed image.
  • It can allocate memory for kernels and userland processes, though the kernel allocator has some subtle bugs that need to be tracked down.
  • There is a buffered kernel log stream, which deadlocks really easily and generally sucks.
  • It can start all the harts on a system, but they deadlock when they try to write the kernel log
  • It can load ELF executables and start processes, though there's no block device to load 'em from, so at the moment it's just the exe files being stuck into the kernel through include_bytes!()
  • Processes and system calls kinda theoretically exist but in reality need to be rewritten. System calls should just be a fixed table in kernel space, give up on the cute "vtable per process" thing for now.

Next steps for when I pick this up again:

  • We really need a real MPMC type queue for both the kernel log and for device drivers in general, so we can have interrupts producing/consuming events and CPU's consuming/producing events both at the same time without deadlocking each other.
  • Go through and give the whole thing a good read, fix any TODO's and BUGGO's that inspire you
  • The kmem allocator has bugs I don't understand yet. Make sure the size of block it returns to you is actually correct, and make sure it does something sensible when a memory request can't be fulfilled. I triggered some of these bugs by reducing the size of the kernel heap to 4 or 8 MB and trying to allocate a framebuffer.
  • Add block device driver, and maybe also a virtio input device.
  • Graphics driver: Doesn't handle multiple virtqueues, doesn't handle mouse cursor, lots of things are hardwared, doesn't handle interrupts/flips at all (just sets the framebuffer to a fixed image). Need double buffering, and a queue to let the kernel receive/send notifications when the current buffer is ready to be flipped.
  • Might need to refactor and/or add locks to the hardware interface structures
  • We might be able to dynamically allocate HartInfo's, apparently qemu 7 supports 32 RISC-V harts now.

#Toolchain setup

rustup install nightly
rustup +nightly component add rust-src
rustup target add riscv64gc-unknown-none-elf
rustup update

apt install qemu-system-misc binutils-riscv64-unknown-elf
# Non-critical but useful tools
apt install device-tree-compiler

#Hardware Resources

#Docs/tutorials

#License

GPL-2.0 and not interested in arguing about it

#Target machine notes

Target machine: qemu-system-riscv64 -machine virt

We should at least vaguely try to have support for different machines. ...Eh, maybe later.

Hardware:

qemu virt machine memory map, for qemu 5.2:

address-space: memory
  0000000000000000-ffffffffffffffff (prio 0, i/o): system
    0000000000001000-000000000000ffff (prio 0, rom): riscv_virt_board.mrom
    0000000000100000-0000000000100fff (prio 0, i/o): riscv.sifive.test
    0000000000101000-0000000000101023 (prio 0, i/o): goldfish_rtc
    0000000002000000-000000000200ffff (prio 0, i/o): riscv.sifive.clint
    0000000003000000-000000000300ffff (prio 0, i/o): gpex_ioport
    000000000c000000-000000000c20ffff (prio 0, i/o): riscv.sifive.plic
    0000000010000000-0000000010000007 (prio 0, i/o): serial
    0000000010001000-00000000100011ff (prio 0, i/o): virtio-mmio
    0000000010002000-00000000100021ff (prio 0, i/o): virtio-mmio
    0000000010003000-00000000100031ff (prio 0, i/o): virtio-mmio
    0000000010004000-00000000100041ff (prio 0, i/o): virtio-mmio
    0000000010005000-00000000100051ff (prio 0, i/o): virtio-mmio
    0000000010006000-00000000100061ff (prio 0, i/o): virtio-mmio
    0000000010007000-00000000100071ff (prio 0, i/o): virtio-mmio
    0000000010008000-00000000100081ff (prio 0, i/o): virtio-mmio
    0000000020000000-0000000021ffffff (prio 0, romd): virt.flash0
    0000000022000000-0000000023ffffff (prio 0, romd): virt.flash1
    0000000030000000-000000003fffffff (prio 0, i/o): alias pcie-ecam @pcie-mmcfg-mmio 0000000000000000-000000000fffffff
    0000000040000000-000000007fffffff (prio 0, i/o): alias pcie-mmio @gpex_mmio 0000000040000000-000000007fffffff
    0000000080000000-000000009fffffff (prio 0, ram): riscv_virt_board.ram

Tools:

Useful qemu monitor commands:

  • Getting into the monitor: Ctrl-A C
  • Force-quit emulator: Ctrl-A X
  • xp /LEN 0xADDR -- Dump physical memory at address
  • info mtree -- memory map
  • info pci -- PCI tree
  • info qtree -- device tree
  • info roms -- ROMs, plus loaded files such as kernels/bootloaders
  • info mem -- Virtual memory info
  • qemu-system-riscv64 -device help prints out ID's for all supported devices
  • qemu-system-riscv64 -machine virt -machine dumpdtb=whatever.dtb -- Dump device tree blob -- use dtc to decompile it to text.
  • Docs: https://www.qemu.org/docs/master/system/mux-chardev.html and https://www.qemu.org/docs/master/system/monitor.html
  • You can also use the included script to connect GDB to the qemu instance, it's enabled by default.

#Design

Why is it the way it is? For fun.

#Memory

No virtual memory. No memory protection. Kernel is loaded at a fixed location and has a fixed-size stack and heap; the kernel should be designed so that it hopefully never needs more than a given amount of heap memory. All userland executables are position-independent code.

#Processes

There is (probably) one privileged process, the Monitor, which is the one the user generally interacts with. It forms the "shell" that handles user interaction and has special privileges.

The Monitor can start background processes, we'll call them Tasks. They are less privileged; they cannot start other Tasks (for now), have fixed size memory allocations (for now), and can only access system calls/hardware resources/capabilities that the Monitor gives them. I think of them a little like processes in old batch-processing mainframes, where they just do a bunch of data crunching and have a fixed output. There may be a fixed maximum number of Tasks.

All processes just run in Supervisor mode, same as the OS. Considered making them run in User mode, but without memory protection there's not really any benefits to doing so. A malicious program could always just overwrite the OS's interrupt handler with its own and gain privilege that way.

Each process has a single thread and runs on a single CPU. (For now.) Processes get pre-empted by timer interrupts and those may schedule different processes to run, though it probably won't exactly be sophisticated. I wasn't originally planning on doing it this way, but frankly it ended up being so obvious it was kind of a shoo-in.

A process gets a fixed chunk of memory when it is started, and does not get more ever. The amount of memory to give to a process is an input to the "spawn process" function call, and that function call just fails if there's not enough memory available. The process gets loaded at the beginning of this section of memory, and canonically the stack starts at the end of it and grows downward. The process doesn't need to actually use a stack, technically, but system calls are functions following the C ABI for the platform so really you're gonna need a stack. Interrupts also use the stack, so.

Because there's no virtual memory, all programs must be position-independent code. I was considering having essentially a compacting GC for processes that moved them around in memory to minimize fragmentation, but that would still need fix-ups of non-PC-relative pointers that might get created during the execution of the process, so it will probably never happen.

Executable format: ELF64. It's Fine. No dynamic linking, at least not implemented by the kernel loader itself.

TODO: What about Tasks whose job is unbounded I/O, such as network servers? Ponder this. Those require some kind of blocking to be efficient, which means context switching.

#Monitor

The Monitor as I'm envisioning it currently is a tabbed frame with some status info on it, a little like a tmux session. There's one tab that is the kernel console, one tab (at least?) which is a (privileged?) command line, and programs can exist in other tabs. Each tab is either a (rich) text widget command line or a raster display. When a program (task?) is started, it gets exclusive access over one tab, and releases it when it is done.

We will see as we go how much it actually ends up being a special-case privileged thing, or whether it's just a normal process that has been given a few extra permissions.

#System calls and interrupts

There is a struct that is passed to each process when it is started, which contains function pointers to various system calls. We'll call it the "system call table". The function pointers are just normal functions using extern "C" ABI. It should probably just be a single struct in the kernel, shared between processes; there's not much point to making processes have separate copies, and it gets a little weird 'cause process structures may move around.

A system call involves a context switch from the user program to the operating system. It saves the user program's registers to the process structure and switches to a per-CPU OS stack, does its stuff, then restores the user's registers back when it's done. Because juggling registers is a pain in the ass and gets exponentially worse the more you do it, interrupts are disabled during system calls. So we should design system calls to be short. An async programming model for system calls is a good idea anyway, even if it's not exactly oldschool.

Interrupts spill registers onto whatever stack they happen to be on, then restore them when it's done. We could have an interrupt-specific stack per CPU, and in fact did once upon a time, but it was kind of a pain to set up and was not large, so the actual benefits were minimal.

#HAL

Hardware abstraction layer currently just consists of a single serial device that may or may not exist, which is designated the "kernel console". This is where kernel debug messages get output to.

Devices we want to support:

  • [x] Bootloader via SBI
  • [x] Serial devices (uart16550)
  • [ ] Ethernet devices (virtio network device)
  • [ ] Hard drives via the virtio block device -- gpt and some kind of fat32 library at least, fat32 or fatfs or such
  • [X] 2D raster graphics of some kind via the qemu virtio-gpu-device driver and maybe the embedded-graphics crate.
  • [ ] Realtime clock (goldfish-rtc or whatever qemu gives us)
  • [ ] virtio input device

Supporting more than one kind of any particular device is a Power Goal we are not going to worry about right now. An abstract interface for devices is something we should have but isn't necessary for now.

#Storage/Filesystem

Content-addressed. That's all I got right now. Probably an append-only (but garbage-collected) content-addressed block store, and a more user-friendly index over it that lets you assign names to things and stuff like that. Maybe tag-based indexing system.

#Limitations

Runs on 64 bit platforms only. Runs on RISC-V only, though at least some attempts have been made to make it remotely possible to port to other platforms. (An AArch64 port might be fun someday, either on virt or raspberry pi hardware.) Minimum memory: The kernel tries to get a 4 MB heap all to itself, so that plus whatever the code size is plus some for userland. Currently the VM is set to have 128 MB of memory, which is way more than enough. Maximum memory: Whatever fits in the physical memory map. Maximum number of CPU's/harts: 8 -- max the qemu virt platform supports, though changing it should only involve changing one constant and recompiling.

#Random Ideas

Make this single processing, more or less. There can probably be separate programs that run on other cores/harts, but they will probably be basically non-interactive "background" or "service" processes and might need a different API (for example, may not be able to spawn other background processes). Probably no actual userland threads, each program probably gets one hart or is put to sleep. Maybe play with cooperative multithreading or such.

Maybe have the user shell be a unique "console" or "supervisor" program with more privilieges than background processes.

I kinda want kernel system calls to just be a table of function pointers, 'cause early MacOS did that and it always seemed cute. (When combined with cooperative multitasking this makes the kernel look a hell of a lot like a DLL that the user programs just call into.)

It would be cool to be able to run rustc on this, but oh gods. For now let's not even try to go there.

Webassembly? Don't tempt me.

Would be cool to have a memory debugger type thing that could actually draw the memory map for the OS as a whole, and also the different sections of each process.

#Filesystem

Good effin' question. FAT-32? Image-based a la Smalltalk? Something different and weird? I'm not particularly interested in files for this particular use case.

Maybe an exclusive single-level space per-process, with the Monitor being the one that lets files move around between them. I do like the idea of file access being deny-by-default, programs can only access their own little sandbox unless explicitly permitted otherwise.

CONTENT ADDRESSING! OF COURSE!

#UI

I'm imagining a simple tab-based GUI with a Lisp-based shell. Maybe it hooks into the stdout-equivalent of other background programs for faux-multiprocessing.

I do like the idea of having the "shell" be something richer than just stdin+stdout. Maybe an input stream, output stream, (structured?) logging stream, raster canvas each program can draw on, tabs/controls to switch between them, and around it lie a few basic control and status things. Imagine a tabbed GUI shoved into a particularly blinged-out tmux instance.

Steal some of the UI aesthetic from the game Paradise Killer. In all seriousness there's plenty of room for some basic aesthetics and display hacks.

I kinda want to come up with a Better Terminal Protocol. Handle mouse, events, window resizes etc. entirely with in-band messages. Basically developing a terminal-like render abstraction and making an inline command stream to make it Do Stuff, then just having everything run in a TUI. ...Maybe not worth it for this project.

Inspiration for old IBM user consoles can be found here: https://www.righto.com/2019/04/iconic-consoles-of-ibm-system360.html Go ahead and find other old consoles from Burroughs or PDP systems too. Having the default pane be a system status display that can drop you into a debugger is somewhat appealing.

References for Lisp impl's and other embeddable languages in Rust. Some might be useful, some might simply be inspiration.

Just using Fennel is not going to be a bad choice though! Needs retargeting a C compiler though.

Here, this is the most beautiful TUI I've ever seen: https://github.com/ggerganov/imtui Maybe something like that by default would be a way to go for GUI stuff.

#Networking

It would be nice to be able to talk to other things over ethernet. Not super interesting to me yet, but eventually. Can we run an ethernet driver as a background task?

#Userland

Fennel for shell. EDN for config files, or maybe Dhall if I'm feeling masochistic enough.

#Lessons Learned

#RISC-V stuff

Using RISC-V on qemu's virt machine setup is way, way, way, way nicer than any real hardware, but is close enough to real hardware that you are forced to pay attention to the details. I would highly recommend it for any budding OS dev. You get to the interesting stuff way faster, but the hardware peripherials are real enough to be useful and give you practice for dealing with hairier real hardware someday in the future.

Turns out RISC-V is in some ways lower level than x86. (Some would say just "simpler".) Interrupts are a good example: x86 has this complicated machinery of descriptors and reserved memory spaces and stuff that handles context swaps on interrupts. (Traps, to use the RV term.) You give it a place to stick stack pointers and spill registers and all this random stuff, then still have to write code to handle the edge cases. RV on the other hand basically just gives you a single register on each hart, sscratch, and says "yeah that oughta let you do whatever you need to do". On the one side it's easy to understand and set up at a basic level, on the other side it would be sort of nice to have more than one register available and not have to hand-roll your own interrupt stack frames and such. This has actually forced my design a little bit, it makes context switching outside of interrupts difficult.

Writing assembly for a RISC CPU is occasionally a little bit of a pain in the ass, mainly when you're doing "weird" things like making long jumps to arbitrary addresses or writing interrupt handlers. The consistent instructions and convenient calling conventions are really nice though. Amusingly, 32 registers is more than I can pay attention to at any given time, so if I were actually writing user code in assembly a compiler would probably do a better job of register allocation than I would.

Also, assuming RISC-V is a good example of what 2020-era hardware likes, the original RISC principle of "each instruction only does one thing" is now out of date. An example is the compare-and-branch instructions, which do two things: compare, and branch. But the extra difficulty of making the CPU handle the code dependency is worth it because it eliminates a data dependency; there is an extra bit of internal state in executing the instruction, but there is no flags register that multiple instructions have to try to read and write at once. Data flow dictates what the best form for the code is. Similarly, no arithmatic instructions cause traps/interrupts, even on invalid input (such as 1/0), but rather produce a guard value that can be checked against. An implicit global data dependency (do I trap or not?) has been replaced by an explicit local one (check this one register against a guard value).

Entertainingly, there's now even a proposal for RISC-V to have multi-register push/pop instructions (The Zce proposal), which is pretty damn far from the "RISC philosophy". Pure models don't often map to reality well, at least not for long!

Turns out most/all(?) multi-core CPU's have a distinction between on-core traps, such as invalid instruction faults, and off-core interrupts, such as a serial device becoming ready to read/write stuff. The on-core ones generally go to a per-core interrupt controller, while the off-core ones go through a separate external interrupt controller that decides which core to route it to.

Interestingly, I think that doing a context switch in RISC-V is actually impossible outside of an interrupt. You need a pointer in a register to load the registers of the task you are switching to, which is doable, but then you also need an address in a register specifying the instruction to jump to when you exit the context switch. I'm no expert but I haven't been able to juggle registers around properly, I just always need a free register somewhere that I don't have one. I thought cooperative context switches could be done using a function call and just return to the value in the ra register, but I haven't been able to make that work in practice either. Inside an interrupt it becomes much more feasible because you have the sscratch register (and know something else isn't gonna be using it), which can hold a value for whatever you want (and the csrrw instruction swaps its value another register, so reading it loses no state), and you have the sepc register which lets you tell the CPU where to return to without altering the ra register of the context you're about to return to anyway. Still, I think this deserves a bit more work; cooperative context switches really should be possible, just gotta make them work.

https://wiki.alopex.li/RiscVNotes in general.

#OS stuff

I really having guard pages at the ends of my stacks. Otherwise you basically don't detect stack overflows until something dies horriblyI guess there's some drawbacks to rocking it old school. Adding canary values at the end of the stacks and checking them routinely (on interrupt exit for example) actually has caught some bugs though, so it gives you a fighting chance.

Memory is kind of more important than CPU power, in terms of OS design. Having tiny amounts of memory warps an OS's design way more than having a slow CPU does. Same with a tiny address space. You can always just drop in a faster CPU of the same architecture, at least since the 70's or so, but adding more memory or a bigger address space can fundamentally change how your OS works. Even though I'm on a deliberately restricted system, a slow VM with 128 MB of memory and no virtual mem, I have a bigass 64-bit address space. This means a lot of my OS looks kinda modern, without a lot of the weird hacks and limitations that were necessary on 16-bit or 24-bit systems.

This sort of extends to hardware, too. The Intel 8080 had 16 bits of address space, plus 256 I/O ports. The I/O ports are basically a separate 8-bit address space dedicated to I/O. This means the whole 64k address space could be dedicated to RAM, with no space needed for mem-mapped I/O. It would have been nicer all around to have a 17-bit address space with some of it devoted to mmapped I/O... maybe even nicer on the hardware side, since you could reuse some address and I/O pins. But you would have needed bigger registers to be able to address that space, and that wasn't going to happen, so they didn't do that, and so you have this less-clean design with its dedicated I/O space... but you get to use the full 64k address space devoted to RAM. These days, where 32-bit microcontrollers can cost a few cents and have 32-bit registers and a full 32-bit address space even if there's no way you can attach that much memory to it, having big chunks of your address space devoted to memmapped I/O is easy. You can have an all-around better design, because you don't need to work around the limitations implicit in a tiny address space, because it's easy to throw enough transistors at the problem to support 32-bit register and ALU's and such everywhere.

The oldschool style of design is synchronous and single-threaded, because single-threaded is simple, and if you only have one thread happening anyway then having system calls and such block is usually pretty good. The modern style of design, which has really picked up pace since 2010 or so, is asynchronous and multi-threaded, which similarly go well together. If you have lots of cores you need lots of threads to use them, and if you have lots of threads then you want to minimize the cost involved with context switches, shuffling data between them, and the accounting involved. So a non-blocking model becomes more and more useful. Networked systems are an extension of this; having a system call send an event from one thread to another on separate cores is structurally very similar to one computer sending an RPC call to another. The API's that make both efficient are similar, it's just a matter of the magnitudes of the overhead.

This trend towards non-blocking code is not a fundamental law however, it is an implementation detail. (I prefer the term "non-blocking" to "asynchronous" since it's more descriptive and "async" often refers to particular feature sets in particular languages.) If we could always replace two computers with one twice as fast, we usually would. In practice though, Moore's Law has run out and making a computer twice as fast is much more than twice as difficult/expensive, so we end up with datacenters full of thousands of small computers working together instead of a few computers the size of shipping containers.

#Known bugs

  • None, any behavior is currently a feature.
  • The kernel memory allocator is a little hecked, see the "next steps" section