Switch test image format from BMP to QOI.
Trying to remember how to handle the virtio GPU
Split apart virtio module
A small toy OS written in Rust, just to take a break from other stuff. I want to go intentionally minimalist and somewhat oldschool, so this will probably look kinda like old OS's, and kinda like embedded RTOS's. Direct-ish inspirations include DOS, classic MacOS, Oberon, GEOS, GEM, and others.
Design goals:
Total time put into this: approx 2.5 months. I'm a little sick of it right now, so it's time to put it down for a little while again. Interrupts and hardware driver support are decent, and we have a real video driver even if it doesn't do much yet. Here's where it's at:
include_bytes!()
Next steps for when I pick this up again:
rustup install nightly
rustup +nightly component add rust-src
rustup target add riscv64gc-unknown-none-elf
rustup update
apt install qemu-system-misc binutils-riscv64-unknown-elf
# Non-critical but useful tools
apt install device-tree-compiler
GPL-2.0 and not interested in arguing about it
Target machine: qemu-system-riscv64 -machine virt
We should at least vaguely try to have support for different machines. ...Eh, maybe later.
Hardware:
virt
machine for RV provides a reasonable selection of
hardware, but, no graphics device. Also note that per the qemu docs,
the virt
machine is guarenteed to be stable between minor versions
of qemu but not major versions, so the qemu 6.x virt
machine may be
different.qemu virt
machine memory map, for qemu 5.2:
address-space: memory
0000000000000000-ffffffffffffffff (prio 0, i/o): system
0000000000001000-000000000000ffff (prio 0, rom): riscv_virt_board.mrom
0000000000100000-0000000000100fff (prio 0, i/o): riscv.sifive.test
0000000000101000-0000000000101023 (prio 0, i/o): goldfish_rtc
0000000002000000-000000000200ffff (prio 0, i/o): riscv.sifive.clint
0000000003000000-000000000300ffff (prio 0, i/o): gpex_ioport
000000000c000000-000000000c20ffff (prio 0, i/o): riscv.sifive.plic
0000000010000000-0000000010000007 (prio 0, i/o): serial
0000000010001000-00000000100011ff (prio 0, i/o): virtio-mmio
0000000010002000-00000000100021ff (prio 0, i/o): virtio-mmio
0000000010003000-00000000100031ff (prio 0, i/o): virtio-mmio
0000000010004000-00000000100041ff (prio 0, i/o): virtio-mmio
0000000010005000-00000000100051ff (prio 0, i/o): virtio-mmio
0000000010006000-00000000100061ff (prio 0, i/o): virtio-mmio
0000000010007000-00000000100071ff (prio 0, i/o): virtio-mmio
0000000010008000-00000000100081ff (prio 0, i/o): virtio-mmio
0000000020000000-0000000021ffffff (prio 0, romd): virt.flash0
0000000022000000-0000000023ffffff (prio 0, romd): virt.flash1
0000000030000000-000000003fffffff (prio 0, i/o): alias pcie-ecam @pcie-mmcfg-mmio 0000000000000000-000000000fffffff
0000000040000000-000000007fffffff (prio 0, i/o): alias pcie-mmio @gpex_mmio 0000000040000000-000000007fffffff
0000000080000000-000000009fffffff (prio 0, ram): riscv_virt_board.ram
Tools:
dtc
, debian package
device-tree-compiler
Useful qemu monitor commands:
xp /LEN 0xADDR
-- Dump physical memory at addressinfo mtree
-- memory mapinfo pci
-- PCI treeinfo qtree
-- device treeinfo roms
-- ROMs, plus loaded files such as kernels/bootloadersinfo mem
-- Virtual memory infoqemu-system-riscv64 -device help
prints out ID's for all supported
devicesqemu-system-riscv64 -machine virt -machine dumpdtb=whatever.dtb
--
Dump device tree blob -- use dtc
to decompile it to text.Why is it the way it is? For fun.
No virtual memory. No memory protection. Kernel is loaded at a fixed location and has a fixed-size stack and heap; the kernel should be designed so that it hopefully never needs more than a given amount of heap memory. All userland executables are position-independent code.
There is (probably) one privileged process, the Monitor, which is the one the user generally interacts with. It forms the "shell" that handles user interaction and has special privileges.
The Monitor can start background processes, we'll call them Tasks. They are less privileged; they cannot start other Tasks (for now), have fixed size memory allocations (for now), and can only access system calls/hardware resources/capabilities that the Monitor gives them. I think of them a little like processes in old batch-processing mainframes, where they just do a bunch of data crunching and have a fixed output. There may be a fixed maximum number of Tasks.
All processes just run in Supervisor mode, same as the OS. Considered making them run in User mode, but without memory protection there's not really any benefits to doing so. A malicious program could always just overwrite the OS's interrupt handler with its own and gain privilege that way.
Each process has a single thread and runs on a single CPU. (For now.) Processes get pre-empted by timer interrupts and those may schedule different processes to run, though it probably won't exactly be sophisticated. I wasn't originally planning on doing it this way, but frankly it ended up being so obvious it was kind of a shoo-in.
A process gets a fixed chunk of memory when it is started, and does not get more ever. The amount of memory to give to a process is an input to the "spawn process" function call, and that function call just fails if there's not enough memory available. The process gets loaded at the beginning of this section of memory, and canonically the stack starts at the end of it and grows downward. The process doesn't need to actually use a stack, technically, but system calls are functions following the C ABI for the platform so really you're gonna need a stack. Interrupts also use the stack, so.
Because there's no virtual memory, all programs must be position-independent code. I was considering having essentially a compacting GC for processes that moved them around in memory to minimize fragmentation, but that would still need fix-ups of non-PC-relative pointers that might get created during the execution of the process, so it will probably never happen.
Executable format: ELF64. It's Fine. No dynamic linking, at least not implemented by the kernel loader itself.
TODO: What about Tasks whose job is unbounded I/O, such as network servers? Ponder this. Those require some kind of blocking to be efficient, which means context switching.
The Monitor as I'm envisioning it currently is a tabbed frame with some status info on it, a little like a tmux session. There's one tab that is the kernel console, one tab (at least?) which is a (privileged?) command line, and programs can exist in other tabs. Each tab is either a (rich) text widget command line or a raster display. When a program (task?) is started, it gets exclusive access over one tab, and releases it when it is done.
We will see as we go how much it actually ends up being a special-case privileged thing, or whether it's just a normal process that has been given a few extra permissions.
There is a struct that is passed to each process when it is started,
which contains function pointers to various system calls. We'll call it
the "system call table". The function pointers are just normal
functions using extern "C"
ABI. It should probably just be a single
struct in the kernel, shared between processes; there's not much point
to making processes have separate copies, and it gets a little weird
'cause process structures may move around.
A system call involves a context switch from the user program to the operating system. It saves the user program's registers to the process structure and switches to a per-CPU OS stack, does its stuff, then restores the user's registers back when it's done. Because juggling registers is a pain in the ass and gets exponentially worse the more you do it, interrupts are disabled during system calls. So we should design system calls to be short. An async programming model for system calls is a good idea anyway, even if it's not exactly oldschool.
Interrupts spill registers onto whatever stack they happen to be on, then restore them when it's done. We could have an interrupt-specific stack per CPU, and in fact did once upon a time, but it was kind of a pain to set up and was not large, so the actual benefits were minimal.
Hardware abstraction layer currently just consists of a single serial device that may or may not exist, which is designated the "kernel console". This is where kernel debug messages get output to.
Devices we want to support:
virtio
block device -- gpt
and some kind
of fat32 library at least, fat32
or fatfs
or suchvirtio-gpu-device
driver and maybe the embedded-graphics
crate.Supporting more than one kind of any particular device is a Power Goal we are not going to worry about right now. An abstract interface for devices is something we should have but isn't necessary for now.
Content-addressed. That's all I got right now. Probably an append-only (but garbage-collected) content-addressed block store, and a more user-friendly index over it that lets you assign names to things and stuff like that. Maybe tag-based indexing system.
Runs on 64 bit platforms only. Runs on RISC-V only, though at least
some attempts have been made to make it remotely possible to port to
other platforms. (An AArch64 port might be fun someday, either on
virt
or raspberry pi hardware.) Minimum memory: The kernel tries to
get a 4 MB heap all to itself, so that plus whatever the code size is
plus some for userland. Currently the VM is set to have 128 MB of
memory, which is way more than enough. Maximum memory: Whatever fits in
the physical memory map. Maximum number of CPU's/harts: 8 -- max the
qemu virt
platform supports, though changing it should only involve
changing one constant and recompiling.
Make this single processing, more or less. There can probably be separate programs that run on other cores/harts, but they will probably be basically non-interactive "background" or "service" processes and might need a different API (for example, may not be able to spawn other background processes). Probably no actual userland threads, each program probably gets one hart or is put to sleep. Maybe play with cooperative multithreading or such.
Maybe have the user shell be a unique "console" or "supervisor" program with more privilieges than background processes.
I kinda want kernel system calls to just be a table of function pointers, 'cause early MacOS did that and it always seemed cute. (When combined with cooperative multitasking this makes the kernel look a hell of a lot like a DLL that the user programs just call into.)
It would be cool to be able to run rustc on this, but oh gods. For now let's not even try to go there.
Webassembly? Don't tempt me.
Would be cool to have a memory debugger type thing that could actually draw the memory map for the OS as a whole, and also the different sections of each process.
Good effin' question. FAT-32? Image-based a la Smalltalk? Something different and weird? I'm not particularly interested in files for this particular use case.
Maybe an exclusive single-level space per-process, with the Monitor being the one that lets files move around between them. I do like the idea of file access being deny-by-default, programs can only access their own little sandbox unless explicitly permitted otherwise.
CONTENT ADDRESSING! OF COURSE!
I'm imagining a simple tab-based GUI with a Lisp-based shell. Maybe it hooks into the stdout-equivalent of other background programs for faux-multiprocessing.
I do like the idea of having the "shell" be something richer than just stdin+stdout. Maybe an input stream, output stream, (structured?) logging stream, raster canvas each program can draw on, tabs/controls to switch between them, and around it lie a few basic control and status things. Imagine a tabbed GUI shoved into a particularly blinged-out tmux instance.
Steal some of the UI aesthetic from the game Paradise Killer. In all seriousness there's plenty of room for some basic aesthetics and display hacks.
I kinda want to come up with a Better Terminal Protocol. Handle mouse, events, window resizes etc. entirely with in-band messages. Basically developing a terminal-like render abstraction and making an inline command stream to make it Do Stuff, then just having everything run in a TUI. ...Maybe not worth it for this project.
Inspiration for old IBM user consoles can be found here: https://www.righto.com/2019/04/iconic-consoles-of-ibm-system360.html Go ahead and find other old consoles from Burroughs or PDP systems too. Having the default pane be a system status display that can drop you into a debugger is somewhat appealing.
References for Lisp impl's and other embeddable languages in Rust. Some might be useful, some might simply be inspiration.
Just using Fennel is not going to be a bad choice though! Needs retargeting a C compiler though.
Here, this is the most beautiful TUI I've ever seen: https://github.com/ggerganov/imtui Maybe something like that by default would be a way to go for GUI stuff.
It would be nice to be able to talk to other things over ethernet. Not super interesting to me yet, but eventually. Can we run an ethernet driver as a background task?
Fennel for shell. EDN for config files, or maybe Dhall if I'm feeling masochistic enough.
Using RISC-V on qemu
's virt
machine setup is way, way, way, way
nicer than any real hardware, but is close enough to real hardware that
you are forced to pay attention to the details. I would highly
recommend it for any budding OS dev. You get to the interesting stuff
way faster, but the hardware peripherials are real enough to be useful
and give you practice for dealing with hairier real hardware someday in
the future.
Turns out RISC-V is in some ways lower level than x86. (Some would say
just "simpler".) Interrupts are a good example: x86 has this
complicated machinery of descriptors and reserved memory spaces and
stuff that handles context swaps on interrupts. (Traps, to use the RV
term.) You give it a place to stick stack pointers and spill registers
and all this random stuff, then still have to write code to handle the
edge cases. RV on the other hand basically just gives you a single
register on each hart, sscratch
, and says "yeah that oughta let you do
whatever you need to do". On the one side it's easy to understand and
set up at a basic level, on the other side it would be sort of nice to
have more than one register available and not have to hand-roll your own
interrupt stack frames and such. This has actually forced my design a
little bit, it makes context switching outside of interrupts difficult.
Writing assembly for a RISC CPU is occasionally a little bit of a pain in the ass, mainly when you're doing "weird" things like making long jumps to arbitrary addresses or writing interrupt handlers. The consistent instructions and convenient calling conventions are really nice though. Amusingly, 32 registers is more than I can pay attention to at any given time, so if I were actually writing user code in assembly a compiler would probably do a better job of register allocation than I would.
Also, assuming RISC-V is a good example of what 2020-era hardware likes, the original RISC principle of "each instruction only does one thing" is now out of date. An example is the compare-and-branch instructions, which do two things: compare, and branch. But the extra difficulty of making the CPU handle the code dependency is worth it because it eliminates a data dependency; there is an extra bit of internal state in executing the instruction, but there is no flags register that multiple instructions have to try to read and write at once. Data flow dictates what the best form for the code is. Similarly, no arithmatic instructions cause traps/interrupts, even on invalid input (such as 1/0), but rather produce a guard value that can be checked against. An implicit global data dependency (do I trap or not?) has been replaced by an explicit local one (check this one register against a guard value).
Entertainingly, there's now even a proposal for RISC-V to have
multi-register push/pop instructions (The Zce
proposal), which is
pretty damn far from the "RISC philosophy". Pure models don't often map
to reality well, at least not for long!
Turns out most/all(?) multi-core CPU's have a distinction between on-core traps, such as invalid instruction faults, and off-core interrupts, such as a serial device becoming ready to read/write stuff. The on-core ones generally go to a per-core interrupt controller, while the off-core ones go through a separate external interrupt controller that decides which core to route it to.
Interestingly, I think that doing a context switch in RISC-V is actually
impossible outside of an interrupt. You need a pointer in a register to
load the registers of the task you are switching to, which is doable,
but then you also need an address in a register specifying the
instruction to jump to when you exit the context switch. I'm no expert
but I haven't been able to juggle registers around properly, I just
always need a free register somewhere that I don't have one. I thought
cooperative context switches could be done using a function call and
just return to the value in the ra
register, but I haven't been able
to make that work in practice either. Inside an interrupt it becomes
much more feasible because you have the sscratch
register (and know
something else isn't gonna be using it), which can hold a value for
whatever you want (and the csrrw
instruction swaps its value another
register, so reading it loses no state), and you have the sepc
register which lets you tell the CPU where to return to without altering
the ra
register of the context you're about to return to anyway.
Still, I think this deserves a bit more work; cooperative context
switches really should be possible, just gotta make them work.
https://wiki.alopex.li/RiscVNotes in general.
I really having guard pages at the ends of my stacks. Otherwise you basically don't detect stack overflows until something dies horriblyI guess there's some drawbacks to rocking it old school. Adding canary values at the end of the stacks and checking them routinely (on interrupt exit for example) actually has caught some bugs though, so it gives you a fighting chance.
Memory is kind of more important than CPU power, in terms of OS design. Having tiny amounts of memory warps an OS's design way more than having a slow CPU does. Same with a tiny address space. You can always just drop in a faster CPU of the same architecture, at least since the 70's or so, but adding more memory or a bigger address space can fundamentally change how your OS works. Even though I'm on a deliberately restricted system, a slow VM with 128 MB of memory and no virtual mem, I have a bigass 64-bit address space. This means a lot of my OS looks kinda modern, without a lot of the weird hacks and limitations that were necessary on 16-bit or 24-bit systems.
This sort of extends to hardware, too. The Intel 8080 had 16 bits of address space, plus 256 I/O ports. The I/O ports are basically a separate 8-bit address space dedicated to I/O. This means the whole 64k address space could be dedicated to RAM, with no space needed for mem-mapped I/O. It would have been nicer all around to have a 17-bit address space with some of it devoted to mmapped I/O... maybe even nicer on the hardware side, since you could reuse some address and I/O pins. But you would have needed bigger registers to be able to address that space, and that wasn't going to happen, so they didn't do that, and so you have this less-clean design with its dedicated I/O space... but you get to use the full 64k address space devoted to RAM. These days, where 32-bit microcontrollers can cost a few cents and have 32-bit registers and a full 32-bit address space even if there's no way you can attach that much memory to it, having big chunks of your address space devoted to memmapped I/O is easy. You can have an all-around better design, because you don't need to work around the limitations implicit in a tiny address space, because it's easy to throw enough transistors at the problem to support 32-bit register and ALU's and such everywhere.
The oldschool style of design is synchronous and single-threaded, because single-threaded is simple, and if you only have one thread happening anyway then having system calls and such block is usually pretty good. The modern style of design, which has really picked up pace since 2010 or so, is asynchronous and multi-threaded, which similarly go well together. If you have lots of cores you need lots of threads to use them, and if you have lots of threads then you want to minimize the cost involved with context switches, shuffling data between them, and the accounting involved. So a non-blocking model becomes more and more useful. Networked systems are an extension of this; having a system call send an event from one thread to another on separate cores is structurally very similar to one computer sending an RPC call to another. The API's that make both efficient are similar, it's just a matter of the magnitudes of the overhead.
This trend towards non-blocking code is not a fundamental law however, it is an implementation detail. (I prefer the term "non-blocking" to "asynchronous" since it's more descriptive and "async" often refers to particular feature sets in particular languages.) If we could always replace two computers with one twice as fast, we usually would. In practice though, Moore's Law has run out and making a computer twice as fast is much more than twice as difficult/expensive, so we end up with datacenters full of thousands of small computers working together instead of a few computers the size of shipping containers.