BUAA 2023 Spring OS


Prologue

In this lab, we’ll have a glance at system call for the first time.


A “Call” to Answer

For security reasons, certain instructions could not be accessed by user process. However, user process still need to complete certain tasks. Therefore, we have to “call” kernel to do that for us. And we, again, as “kernel”, have to answer such a “call”.

Well, this process is easy to understand since… PassBash also uses this pattern.

User Call

As user, we got a “phone book” to look for functions to call. And this phone book is user/lib/syscall_lib.c. We can see a lot of functions with prefix syscall_, indicating a system call. When we want a service, we can just call specific system call.

But, how can we really call kernel? Well, we do this by executing syscall instruction. By calling this, we’ll sink into kernel immediately, thus make kernel answer. To make this simpler, we wrap system call into one unified interface, which is the so called msyscall.

1
2
3
4
5
// An example system call for user.
void syscall_putchar(int ch)
{
msyscall(SYS_putchar, ch);
}

And for msyscall, we just do syscall to sink into kernel.

1
2
3
4
LEAF(msyscall)
syscall
j ra
END(msyscall)

Kernel Answer

When user calls, we kernel will answer. syscall actually invoke exception No.8 - handle_sys, so our answer starts from this specific function. However, before we actually get into exception handler, we first jump to exception entry, which is a fixed address for MIPS. We set exception entry point in kernel.lds, and exception handler entry function will be loaded here.

1
2
3
4
5
6
7
# kernel.lds
SECTIONS {
# ...
. = 0x80000080;
.exc_gen_entry : { *(.text.exc_gen_entry) }
# ...
}

This will be executed every time there is an exception.

1
2
3
4
5
6
7
8
# kern/entry.S
.section .text.exc_gen_entry
exc_gen_entry:
SAVE_ALL # save all registers to Trapframe
mfc0 t0, CP0_CAUSE
andi t0, 0x7c
lw t0, exception_handlers(t0) # call exception handler
jr t0

So, what is exception_handler? We defined this previously kern/traps.c.

1
2
3
4
5
6
7
void (*exception_handlers[32])(void) = {
[0 ... 31] = handle_reserved,
[0] = handle_int,
[2 ... 3] = handle_tlb,
[1] = handle_mod,
[8] = handle_sys,
};

So we just jump to these specific handlers to handle our exceptions. You may notice that CP0.Cause register is used to get the handlers, so what is it?

image-20230420204633657

We can see that, by andi t0, 0x7c, we get the ExcCode bits in it, which indicates the type of exception to handle.

Notice that, for exception_handlers(t0), t0 is not a index in C. Instead, it is a raw offset! So the lowest 2 bits are zero to make it byte aligned.

Exception Handler

Now that we can jump to exception handlers, let’s have a closer look at them, for example handle_sys here. It is declared using a macro. The actual function is do_syscall.

1
2
# kern/genex.S
BUILD_HANDLER sys do_syscall

And then the declaration for do_syscall in kern/syscall_all.c.

1
void do_syscall(struct Trapframe* tf);

Since all registers (including pc, sp, etc.) will be saved when exception is triggered, we can get the parameters user passed to us from corresponding Trapframe. More specifically, one register for handler type, and five for actual parameters. For type, we also have a table in kernel for handlers in kern/syscall_all.c.

1
2
3
4
5
void* syscall_table[MAX_SYSNO] = {
[SYS_putchar] = sys_putchar,
[SYS_exofork] = sys_exofork,
// ...
};

Register $a0 indicates the type of the call, and $a1 to $a3 and the other 2 in stack are the real parameters. Then, we can get the correct handler for the call and pass parameters to it.

In fact, user could passed at most 6 parameters to system call. However, the first one is the type of call, rather than a valid parameter.

After handling, we set $v0 ($2) to the return value of system call, to send it back to user.

1
tf->regs[2] = func(arg1, arg2, arg3, arg4, arg5);

Here is the complete code for do_syscall.

do_syscall
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* Overview:
* Call the function in 'syscall_table' indexed at 'sysno' with arguments from
* user context and stack.
*
* Hint:
* Use sysno from $a0 to dispatch the syscall.
* The possible arguments are stored at $a1, $a2, $a3, [$sp + 16 bytes],
* [$sp + 20 bytes] in order.
* Number of arguments cannot exceed 5.
*/
void do_syscall(struct Trapframe* tf)
{
int (*func)(u_int, u_int, u_int, u_int, u_int); // handler
int sysno = tf->regs[4]; // $a0
if (sysno < 0 || sysno >= MAX_SYSNO)
{
tf->regs[2] = -E_NO_SYS;
return;
}

/* Step 1: Add the EPC in 'tf' by a word (size of an instruction). */
tf->cp0_epc += 4; // to skip syscall instruction

/* Step 2: Use 'sysno' to get 'func' from 'syscall_table'. */
func = syscall_table[sysno];

/* Step 3: First 3 args are stored in $a1, $a2, $a3. */
u_int arg1 = tf->regs[5];
u_int arg2 = tf->regs[6];
u_int arg3 = tf->regs[7];

/* Step 4: Last 2 args are stored in stack at [$sp + 16 bytes], [$sp + 20 bytes]. */
u_int arg4, arg5;
arg4 = *(u_int*)(tf->regs[29] + 16); // $sp is $29
arg5 = *(u_int*)(tf->regs[29] + 20);

/*
* Step 5: Invoke 'func' with retrieved arguments and store its return value
* to $v0 in 'tf'.
*/
tf->regs[2] = func(arg1, arg2, arg3, arg4, arg5);
}

Notice that, for user, system call starts with syscall_, it at last calls the real system call in kernel mode, which starts with sys_.


Basic System Calls

Before we get into some fancy system calls, lets have a look at some fundamental ones. They all located in kern/syscall_all.c.

Identify Process

To know which process to manipulate, we use process id, which, in MOS, is Env::env_id. We simply get a Env from a envid via envid2env function. Its located in kern/env.c. If checkperm is set, we can only get Env that with certain permissions (relation with current env).

1
int envid2env(u_int envid, struct Env** penv, int checkperm);

Notice that, if envid is zero, then this function will degenerate into getting current Env only!

By the way, this is why Env::env_id won’t be zero, since zero indicate current Env.

Here is the complete definition of it.

envid2env
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Overview:
* Convert an existing 'envid' to an 'struct Env *'.
* If 'envid' is 0, set '*penv = curenv', otherwise set '*penv = &envs[ENVX(envid)]'.
* In addition, if 'checkperm' is non-zero, the requested env must be either
* 'curenv' or its immediate child.
*
* Pre-Condition:
* 'penv' points to a valid 'struct Env *'.
*
* Post-Condition:
* return 0 on success, and set '*penv' to the env.
* return -E_BAD_ENV on error (invalid 'envid' or 'checkperm' violated).
*/
int envid2env(u_int envid, struct Env** penv, int checkperm)
{
struct Env* e;

/* Step 1: Assign value to 'e' using 'envid'. */
if (envid == 0)
{
/*
* If envid is zero, this function will degenerate to get current
* env only. If not, envid is bound to be different from e->env_id, since
* e->envid won't be zero. Hence, we must return here.
*/
*penv = curenv;
return 0;
}
else
e = &envs[ENVX(envid)];

if (e->env_status == ENV_FREE || e->env_id != envid)
{
return -E_BAD_ENV;
}

/* Step 2: Check when 'checkperm' is non-zero. */
/* Hints:
* Check whether the calling env has sufficient permissions to manipulate
* the specified env, i.e. 'e' is either 'curenv' or its immediate child.
* If violated, return '-E_BAD_ENV'.
*/
if (checkperm)
{
if (!((e->env_id == curenv->env_id) || (e->env_parent_id == curenv->env_id)))
return -E_BAD_ENV;
}

/* Step 3: Assign 'e' to '*penv'. */
*penv = e;

return 0;
}

For this line, we check e->env_id and envid, but why, we already get one?

1
if (e->env_status == ENV_FREE || e->env_id != envid)

We have 1024 Envs, and the low 10 bits are to present the physical offset of a env to the base address envs. If ew have more than 1024 Envs, the high 22 bits are to ensure uniqueness of Env.

Here, we get e by the lowest 10 bits from envid and ignores the high bits, thus make two different Envs with the same low 10 bits possible. This is perhaps because of getting a out of data Env, or simply a bad envid.

Memory Management

Memory management is important to kernel, there’s no doubt.

sys_mem_alloc

Well, just like malloc, this function requires kernel to allocate a physical page, to make access at given virtual address legal.

1
int sys_mem_alloc(u_int envid, u_int va, u_int perm);

There’s not much thing to say about this one. It just allocate a new page, and use page_insert to map it to the given va. Though trivial, previous page that mapped to va will be unmapped. (see page_insert)

sys_mem_alloc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/* Overview:
* Allocate a physical page and map 'va' to it with 'perm' in the address space
* of 'envid'.
* If 'va' is already mapped, that original page is silently unmapped.
* 'envid2env' should be used with 'checkperm' set, like in most syscalls, to
* ensure the target is either the caller or its child.
*
* Post-Condition:
* Return 0 on success.
* Return -E_BAD_ENV: 'checkperm' of 'envid2env' fails for 'envid'.
* Return -E_INVAL: 'va' is illegal (should be checked using 'is_illegal_va').
* Return the original error: underlying calls fail (you can use 'try' macro).
*
* Hint:
* You may want to use the following functions:
* 'envid2env', 'page_alloc', 'page_insert', 'try' (macro)
*/
int sys_mem_alloc(u_int envid, u_int va, u_int perm)
{
struct Env* env;
struct Page* pp;

/* Step 1: Check if 'va' is a legal user virtual address using 'is_illegal_va'. */
if (is_illegal_va(va))
return -E_INVAL;

/* Step 2: Convert the envid to its corresponding 'struct Env *'
using 'envid2env'. */
/* Hint: **Always** validate the permission in syscalls! */
try(envid2env(envid, &env, 1));

/* Step 3: Allocate a physical page using 'page_alloc'. */
try(page_alloc(&pp));

/* Step 4: Map the allocated page at 'va' with permission 'perm' using 'page_insert'. */
return page_insert(env->env_pgdir, env->env_asid, pp, va, perm);
}

You may be curious about is_illegal_va, it just check if va is within user’s space or not.

1
2
3
4
static inline int is_illegal_va(u_long va)
{
return va < UTEMP || va >= UTOP;
}

Breaking News! Understanding of how page_insert works. Basically, it just fill the address field of the page table entry for va with the physical page number of the page. Thus we can use the physical memory though this va. It can be divided into two steps.

  1. First, it ensures page table entry for va exists. If such entry exists, but not mapped to the given page, it will remove the page. If mapped to, it will flush TLB and reset permission. If doesn’t exist, it will create one.
  2. Then, it fills the page table entry with the physical address of the page.

Important! When we talk about mapping or un-mapping a page to virtual memory, we simply fill or clear the page table entry of the given virtual memory.

Again! page_insert does not create new page! It only add the physical address of the given page to a specific page table entry. The same as page_remove.

For a more detailed understanding of virtual memory and page table entry, you can refer to my previous post here: Page Directory Self Mapping.

sys_mem_map

This is a tricky function. What it does is mapping the physical memory, that mapped to source Env with id srcid at virtual address srcva to the destination Env with id dstid at virtual address dstva. That is to say, make these two Env share a same page.

1
int sys_mem_map(u_int srcid, u_int srcva, u_int dstid, u_int dstva, u_int perm);
sys_mem_map

Here is the realization of it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* Overview:
* Find the physical page mapped at 'srcva' in the address space of env 'srcid',
* and map 'dstid''s 'dstva' to it with 'perm'.
*
* Post-Condition:
* Return 0 on success.
* Return -E_BAD_ENV: 'checkperm' of 'envid2env' fails for 'srcid' or 'dstid'.
* Return -E_INVAL: 'srcva' or 'dstva' is illegal, or 'srcva' is unmapped in 'srcid'.
* Return the original error: underlying calls fail.
*
* Hint:
* You may want to use the following functions:
* 'envid2env', 'page_lookup', 'page_insert'
*/
int sys_mem_map(u_int srcid, u_int srcva, u_int dstid, u_int dstva, u_int perm)
{
/*
* Step 1: Check if 'srcva' and 'dstva' are legal user virtual addresses using
* 'is_illegal_va'.
*/
if (is_illegal_va(srcva) || is_illegal_va(dstva))
return -E_INVAL;

struct Env* srcenv;
struct Env* dstenv;

/* Step 2: Convert the 'srcid' to its corresponding 'struct Env *' using 'envid2env'. */
try(envid2env(srcid, &srcenv, 1));

/* Step 3: Convert the 'dstid' to its corresponding 'struct Env *' using 'envid2env'. */
try(envid2env(dstid, &dstenv, 1));

/* Step 4: Find the physical page mapped at 'srcva' in the address space of 'srcid'. */
/* Return -E_INVAL if 'srcva' is not mapped. */
struct Page* pp = page_lookup(srcenv->env_pgdir, srcva, NULL);
if (!pp)
return -E_INVAL;

/* Step 5: Map the physical page at 'dstva' in the address space of 'dstid'. */
return page_insert(dstenv->env_pgdir, dstenv->env_asid, pp, dstva, perm);
}

sys_mem_unmap

Well, sys_mem_unmap is similar to its map brother. Just un-map the physical address of given virtual memory, making it no longer valid to access.

1
int sys_mem_unmap(u_int envid, u_int va);
sys_mem_unmap

Just, quite simple.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/* Overview:
* Unmap the physical page mapped at 'va' in the address space of 'envid'.
* If no physical page is mapped there, this function silently succeeds.
*
* Post-Condition:
* Return 0 on success.
* Return -E_BAD_ENV: 'checkperm' of 'envid2env' fails for 'envid'.
* Return -E_INVAL: 'va' is illegal.
* Return the original error when underlying calls fail.
*/
int sys_mem_unmap(u_int envid, u_int va)
{
struct Env* e;

/* Step 1: Check if 'va' is a legal user virtual address using 'is_illegal_va'. */
if (is_illegal_va(va))
return -E_INVAL;

/* Step 2: Convert the envid to its corresponding 'struct Env *' using 'envid2env'. */
try(envid2env(envid, &e, 1));

/* Step 3: Unmap the physical page at 'va' in the address space of 'envid'. */
page_remove(e->env_pgdir, e->env_asid, va);

return 0;
}

Schedule

Schedule is also an important part of OS. And yield is to give up possession of CPU, due to inability to continue, or some other reasons. It is simple, just make kernel schedule again is OK.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* Overview:
* Give up remaining CPU time slice for 'curenv'.
*
* Post-Condition:
* Another env is scheduled.
*
* Hint:
* This function will never return.
*/
void __attribute__((noreturn)) sys_yield(void)
{
// Hint: Just use 'schedule' with 'yield' set.
schedule(1);
}

Notice that it will not return, since schedule is a no-return.


Inter-Process Communication

Well, IPC, huh? The fundamental support of IPC lies within Env, so let’s have a look.

1
2
3
4
5
6
7
8
9
struct Env
{
// ... other members
u_int env_ipc_value; // data value sent to us
u_int env_ipc_from; // envid of the sender
u_int env_ipc_recving; // env receiving status (block 0 or accept 1)
u_int env_ipc_dstva; // va at which to map received page
u_int env_ipc_perm; // perm of page mapping received
};

As receiver, we can decide whether to receive message or not. e.g. We can set Env::env_ipc_receiving to 1, which indicate that we are receiving message, or 0 to block any message, and make sender’s sending request fail. Plus, we need Env::env_ipc_from to know the source of the message.

If we only send a value, we only need Env::env_ipc_value, and set Env::env_ipc_dstva to 0. Otherwise, a valid virtual address is required.

As sender, we can set the receiver’s Env::env_ipc_value and Env::env_ipc_from, and, if there is, map message page to Env::env_ipc_dstva.

Now, let’s see some core functions of IPC.

sys_ipc_recv

This system call make current process starts to receive a message. Such action will block current process from continue, until it receives the desired message. Here is its declaration.

1
int sys_ipc_recv(u_int dstva);

The caller of this system call is current process. If we only want to receive a value, we simply set dstva to 0. Otherwise, we should set it to our desired virtual address. Then, since the process will be blocked, we’ll set its status to ENV_NOT_RUNNABLE and remove it from env_sched_list. At last, we should set the default return value 0 to $v0.

sys_ipc_recv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Overview:
* Wait for a message (a value, together with a page if 'dstva' is not 0) from
* other envs.
* 'curenv' is blocked until a message is sent.
*
* Post-Condition:
* Return 0 on success.
* Return -E_INVAL: 'dstva' is neither 0 nor a legal address.
*/
int sys_ipc_recv(u_int dstva)
{
/* Step 1: Check if 'dstva' is either zero or a legal address. */
if ((dstva != 0) && is_illegal_va(dstva))
return -E_INVAL;

/* Step 2: Set 'curenv->env_ipc_recving' to 1. */
curenv->env_ipc_recving = 1;

/* Step 3: Set the value of 'curenv->env_ipc_dstva'. */
curenv->env_ipc_dstva = dstva;

/*
* Step 4: Set the status of 'curenv' to 'ENV_NOT_RUNNABLE' and remove it from
* 'env_sched_list'.
*/
curenv->env_status = ENV_NOT_RUNNABLE;
TAILQ_REMOVE(&env_sched_list, curenv, env_sched_link);

/* Step 5: Give up the CPU and block until a message is received. */
((struct Trapframe*)KSTACKTOP - 1)->regs[2] = 0;
schedule(1); // with yeild set
}

sys_ipc_try_send

He’s waiting for us! Let’s send him a message! It is quite straightforward, with all parameters receiver need. Notice that here we use ‘try’ send, because we may fail to send due to target process is not receiving or other errors.

1
int sys_ipc_try_send(u_int envid, u_int value, u_int srcva, u_int perm);

Since the receiver was suspended because of waiting for message, we need to wake him up after we send the message. If we also send a page, we should map our page to Env::env_ipc_dstva of our receiver using page_insert. And set its Env::env_ipc_recving 0 to mark completion.

The same with page_insert, we only link sender’s page to receiver’s page table entry. Only.

One more thing, in this system call, we do not restrict permission in envid2env, since message can be exchanged between processes with no relations.

sys_ipc_try_send
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/* Overview:
* Try to send a 'value' (together with a page if 'srcva' is not 0) to the target
* env 'envid'.
*
* Post-Condition:
* Return 0 on success, and the target env is updated as follows:
* - 'env_ipc_recving' is set to 0 to block future sends.
* - 'env_ipc_from' is set to the sender's envid.
* - 'env_ipc_value' is set to the 'value'.
* - 'env_status' is set to 'ENV_RUNNABLE' again to recover from 'ipc_recv'.
* - if 'srcva' is not NULL, map 'env_ipc_dstva' to the same page mapped at 'srcva' in 'curenv'
* with 'perm'.
*
* Return -E_IPC_NOT_RECV if the target has not been waiting for an IPC message with
* 'sys_ipc_recv'.
* Return the original error when underlying calls fail.
*/
int sys_ipc_try_send(u_int envid, u_int value, u_int srcva, u_int perm)
{
struct Env* e;
struct Page* p;

/* Step 1: Check if 'srcva' is either zero or a legal address. */
if ((srcva != 0) && is_illegal_va(srcva))
return -E_INVAL;

/* Step 2: Convert 'envid' to 'struct Env *e'. */
/* This is the only syscall where the 'envid2env' should be used with 'checkperm' UNSET,
* because the target env is not restricted to 'curenv''s children. */
try(envid2env(envid, &e, 0));

/* Step 3: Check if the target is waiting for a message. */
if (!e->env_ipc_recving)
return -E_IPC_NOT_RECV;

/* Step 4: Set the target's ipc fields. */
e->env_ipc_value = value;
e->env_ipc_from = curenv->env_id;
e->env_ipc_perm = PTE_V | perm;
e->env_ipc_recving = 0;

/* Step 5: Set the target's status to 'ENV_RUNNABLE' again and insert it to
* the tail of 'env_sched_list'. */
e->env_status = ENV_RUNNABLE;
TAILQ_INSERT_TAIL(&env_sched_list, e, env_sched_link);

/* Step 6: If 'srcva' is not zero, map the page at 'srcva' in 'curenv' to
* 'e->env_ipc_dstva' in 'e'. */
/* Return -E_INVAL if 'srcva' is not zero and not mapped in 'curenv'. */
if (srcva != 0)
{
p = page_lookup(curenv->env_pgdir, srcva, NULL);
if (!p)
return -E_INVAL;
/*
* Though we didn't check if receiver accepts page or not, we can still
* check it indirectly by `page_insert` since this functino will return
* error in that case.
*/
try(page_insert(e->env_pgdir, e->env_asid, p, e->env_ipc_dstva, perm));
}

return 0;
}

You may wonder, what if receiver’s Env::env_ipc_dstva is 0, and it is a valid virtual address? Haha, don’t worry, you can check include/mmu.h, address from 0x0 to UTEMP, which is larger than one page, is invalid memory. :P


Fork

As we know before, fork is a useful function to duplicate process. It is fascinating, as we’ve been curious about its realization long before when we first encountered it here Process. You could use this as application of fork first.

In this part, I’d like to demonstrate stuffs from top to bottom. :)

Fork in Kernel Space

Well, fork is actually a lib function in user space, and the core of it is a system call named syscall_exofork, which eventually calls sys_exofork, and this system call explains how we get a new process out of nowhere. Now, let’s have a look of it. It’s surprisingly not that long, but each line has its meaning.

First, it’s declaration. Err… really simple.

1
int sys_exofork(void);

First, we really create a PCB (Env) from nowhere.

1
2
3
/* Step 1: Allocate a new env using 'env_alloc'. */
struct Env* e;
try(env_alloc(&e, curenv->env_id));

Then, the most important step, we copy current PCB’s Trapframe to it.

1
2
/* Step 2: Copy the current Trapframe below 'KSTACKTOP' to the new env's 'env_tf'. */
e->env_tf = *((struct Trapframe*)KSTACKTOP - 1);

Important! By copying current Trapframe, we copy the break point of current process to its child, which makes it resume at the same point with its parent! Notice that, Trapframe is set at once when process makes the system call, and the system call dispatcher will increase CP0.epc by 4 bytes before entering the handler, which make it jump over syscall. Then, we enter the real hander and do the fork. So, anything we do in system call does not affect the break point (mainly focus on PC), and child process will not execute system call again!

Here, why we don’t just copy curenv->env_tf to it? This is a little complicated, you can expand it optionally.

which one to copy

Go carefully about the source code, and you’ll find that, for system calls that don’t make schedule, we only use KSTACKTOP - TF_SIZE to KSTACK_TOP to save and restore process break point. Env::env_tf is only used when schedule happens, to store process status and use it to restore that memory block next time it runs. So Env::env_tf is usually out of date.

As a further explanation, system call use ret_from_exception to restore registers with KSTACK_TOP - TF_SIZE to KSTACK_TOP, which also resets PC so it will go directly to the user process. This is what most system calls do at last. However, for system calls that involves schedule, be fore ret_from_exception, it will use env_pop_tf to restore KSTACK_TOP - TF_SIZE with Env::env_tf first, and then call ret_from_exception to jump to the new process.

To distinguish between parent and child, we set child process’s return value to 0.

1
2
/* Step 3: Set the new env's 'env_tf.regs[2]' to 0 to indicate the return value in child. */
e->env_tf.regs[2] = 0;

Notice that, parent process hasn’t got its return value yet.

At last, since child process is not ready to run (no memory allocated), we temporarily set its status to ENV_NOT_RUNNABLE, and priority the same as its parent.

1
2
3
/* Step 4: Set up the new env's 'env_status' and 'env_pri'.  */
e->env_status = ENV_NOT_RUNNABLE;
e->env_pri = curenv->env_pri;

As all these things are executed in parent process, though in kernel, the return value will also be returned to parent process directly. So here we just return child process’ Env::env_id.

1
return e->env_id;

So what’s the difference between the return value of parent and child? Well, parent process is a complete process then, yet child process only has a PCB. So child could not receive return value from function, so we have to ‘make’ its return value from nowhere.

A complete view of this function.

sys_exofork
1
2
3
4
5
6
7
8
9
10
11
12
13
int sys_exofork(void)
{
struct Env* e;
try(env_alloc(&e, curenv->env_id));

e->env_tf = *((struct Trapframe*)KSTACKTOP - 1);
e->env_tf.regs[2] = 0;

e->env_status = ENV_NOT_RUNNABLE;
e->env_pri = curenv->env_pri;

return e->env_id;
}

It doesn’t do much. Yeah, I though it does everything. But it does not, other stuffs are done in user space.

Fork in User Space

You can find user space fork in user/include/lib.h and user/lib/fork.c. But hold your horse, please. There are some basic concepts to understand before we meet fork.

Copy on Write

To reduce physical memory use, when we fork a child, we first make child share parent’s physical pages. If both parent and child do not write them, then we can save much memory. However, if one of them write one of the pages, it will invoke a TLB mod exception, which will then duplicate the page.

To mark such pages, we use PTE_COW. PTE_COW and PTE_D (write) is mutual exclusive. That is to say, such page can’t be written before copy.

However, though it is a kernel exception, the actual work is done by a user space function, which is stored in Env::env_user_tlb_mod_entry, and this entry is set to a user space handler cow_entry. This is a little tricky, and I’ll elaborate on it later, real soon.

What this entry does is just make a copy of the page, and jump back to the instruction that cause this exception.

cow_entry
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/* Overview:
* Map the faulting page to a private writable copy.
*
* Pre-Condition:
* 'va' is the address which led to the TLB Mod exception.
*
* Post-Condition:
* - Launch a 'user_panic' if 'va' is not a copy-on-write page.
* - Otherwise, this handler should map a private writable copy of
* the faulting page at the same address.
*/
static void __attribute__((noreturn)) cow_entry(struct Trapframe* tf)
{
u_int va = tf->cp0_badvaddr;
u_int perm;

/* Step 1: Find the 'perm' in which the faulting address 'va' is mapped. */
/* Hint: Use 'vpt' and 'VPN' to find the page table entry. If the 'perm'
* doesn't have 'PTE_COW', launch a 'user_panic'. */
perm = PTE_PERM(vpt[VPN(va)]);
panic_on(!(perm & PTE_COW));

/* Step 2: Remove 'PTE_COW' from the 'perm', and add 'PTE_D' to it. */
perm = PTE_CLR(perm, PTE_COW);
perm = PTE_SET(perm, PTE_D);

/* Step 3: Allocate a new page at 'UCOW'. */
syscall_mem_alloc(0, (void*)UCOW, perm);

/* Step 4: Copy the content of the faulting page at 'va' to 'UCOW'. */
/* Hint: 'va' may not be aligned to a page! */
memcpy((void*)UCOW, (void*)ROUNDDOWN(va, BY2PG), BY2PG);

// Step 5: Map the page at 'UCOW' to 'va' with the new 'perm'.
syscall_mem_map(0, (void*)UCOW, 0, (void*)ROUNDDOWN(va, BY2PG), perm);

// Step 6: Unmap the page at 'UCOW'.
syscall_mem_unmap(0, (void*)UCOW);

// Step 7: Return to the faulting routine.
int r = syscall_set_trapframe(0, tf);
user_panic("syscall_set_trapframe returned %d", r);
}

If you are confused about the jump between these functions, you can expand the following block.

tlb_mod_exception

First, we have to review do_tlb_mod. When a TLB mod exception happens, we’ll first jump here.

do_tlb_mod
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
void do_tlb_mod(struct Trapframe* tf)
{
struct Trapframe tmp_tf = *tf;

/*
* Set $sp to user exception stack top. Since the whole trapframe will
* be restored to what it was on exception, we can change it here.
*/
if (tf->regs[29] < USTACKTOP || tf->regs[29] >= UXSTACKTOP)
tf->regs[29] = UXSTACKTOP;

/*
* Save current Trapframe to user exception stack. Since we have altered
* base address of the stack, the user process won't feel the difference.
*/
tf->regs[29] -= sizeof(struct Trapframe);
*(struct Trapframe*)tf->regs[29] = tmp_tf;

// User must have such handler.
if (curenv->env_user_tlb_mod_entry)
{
/*
* Set parameter ($a0) of the following function call, this is the
* parameter in env_user_tlb_mod_entry function, which is set to
* cow_entry!
*/
tf->regs[4] = tf->regs[29]; // $a0 = &temp_tf !
tf->regs[29] -= sizeof(tf->regs[4]); // push $a0 to stack as param.

/*
* Make uesr process execute its TLB mod handler next, instead of
* the normal order.
*/
tf->cp0_epc = curenv->env_user_tlb_mod_entry;
}
else
{
panic("TLB Mod but no user handler registered");
}

/*
* So, you can see that, this kernel exception handler does nothing,
* but changes user process' execution order only.
*/
}

After this, user process will continue to execute cow_entry after getting out of kernel. Notice that, user process still uses exception stack in cow_entry in order not to crash previous stack.

You may wonder, how do we get back to the right track? The answer is the parameter we passed to env_user_tlb_mod_entry. We stored the previous trapframe in exception stack to preserve it, and then from cow_entry, we can see that it uses this trapframe to jump back to where the exception happens! Brilliant!

This is just like setjmp() and longjmp in C. :)

Duplicate Pages

To achieve COW, we first need to make child process share those pages, and this is done by duplicating the link to them. And this is done by duppage. This function is relatively simple.

duppage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
static void duppage(u_int envid, u_int vpn)
{
u_int addr;
u_int perm;

/* Step 1: Get the permission of the page. */
/* Hint: Use 'vpt' to find the page table entry. */
addr = vpn << PGSHIFT;
perm = PTE_PERM(vpt[vpn]);

/*
* Step 2: If the page is writable, and not shared with children, and not
* marked as COW yet, then map it as copy-on-write, both in the parent (0)
* and the child (envid).
*/
int remap = 0;
if ((perm & PTE_D) && !(perm & PTE_LIBRARY))
{
perm = PTE_CLR(perm, PTE_D);
perm = PTE_SET(perm, PTE_COW);
remap = 1;
}

syscall_mem_map(0, addr, envid, addr, perm);

if (remap)
syscall_mem_map(0, addr, 0, addr, perm);
}

Don’t be afraid of accessing virtual memory, they will be intercepted by MMU at last and translated to physical memory.

One problem, why map child before remap to parent?

Because, once we remap to parent, the pages will become read only, but we still need to write pages because of function calling stack!

Fork

Finally, we could have a glance at fork in user space. Now, it should be easy to understand. Just about to exit fork, we should set child process’ TLB mod entry. And as we have prepared everything for it, we can set it status to RUNNABLE.

Notice that, for child process, we have to set its env manually, since env is set at the start of a process to the Env it corresponds to, but child process doesn’t start normally, which makes that the same as its parent.

fork
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
int fork(void)
{
extern volatile struct Env* env;

u_int child;

/* Step 1: Set our TLB Mod user exception entry to 'cow_entry' if not done yet. */
if (env->env_user_tlb_mod_entry != (u_int)cow_entry)
try(syscall_set_tlb_mod_entry(0, cow_entry));

/* Step 2: Create a child env that's not ready to be scheduled. */
// Hint: 'env' should always point to the current env itself, so we should fix
// it to the correct value.
child = syscall_exofork();
if (child == 0)
{
env = envs + ENVX(syscall_getenvid());
return 0;
}

/* Step 3: Map all mapped pages below 'USTACKTOP' into the child's address space. */
// Hint: You should use 'duppage'.
for (u_int i = 0; i < USTACKTOP; i += BY2PG)
{
u_int _vpn = VPN(i);
u_int _vpd = VPD(i);
if ((vpd[_vpd] & PTE_V) && (vpt[_vpn] & PTE_V))
duppage(child, _vpn);
}

/* Step 4: Set up the child's tlb mod handler and set child's 'env_status' to
* 'ENV_RUNNABLE'. */
/* Hint:
* You may use 'syscall_set_tlb_mod_entry' and 'syscall_set_env_status'
* Child's TLB Mod user exception entry should handle COW, so set it to 'cow_entry'
*/
try(syscall_set_tlb_mod_entry(child, cow_entry));
try(syscall_set_env_status(child, ENV_RUNNABLE));

return child;
}

Epilogue

Well, I guess, this is it… Too many words. :(