BUAA 2023 Spring OS


Prologue

Well, we finally came to the last lab. To be honest, this one is quite easy.


1. Pipe

The first thing to deal with is pipe. You know, in Linux, we can use | operator to redirect one process’s output to another process’s input, which is quite important for shell. So… Here, let’s do it.

1.1 Pipe Structure

Well, pipe is not a actual file, it is, a circular queue. Write end keeps writing things to the back, while read end keeps reading things from the front. Both will yield when the pipe is empty or full. So here is how it looks like.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct Pipe
{
u_int p_rpos; // read position
u_int p_wpos; // write position
u_char p_buf[BY2PIPE]; // data buffer
};

static int _pipe_is_empty(struct Pipe* p)
{
return p->p_rpos >= p->p_wpos;
}

static int _pipe_is_full(struct Pipe* p)
{
return p->p_wpos - p->p_rpos >= BY2PIPE;
}

1.2 Create Pipe

So, how to create a pipe? First, let’s make it clear that, the pipe we create here, is anonymous. When I say anonymous, I mean, anonymous. :P

The difference of pipe and regular file is that, the data part of pipe is not raw data, it is what I just mentioned above, the so called struct Pipe. So this is it, a small buffer, which only takes up one page. (Actually not one page, but we don’t divide a page into several parts.)

Then, considering the nature of sharing, the data part of pipe file is shared by two file descriptors. So we just map the same page of physical memory to both read end and write end. A little confusing, huh? Here are some figure to further demonstrate this.

First, we use pipe to create a pipe in parent process. We can see that two file descriptors are mapped to two different pages, but both of theirs data are mapped to the same, pipe page.

Green is for read and red is for write. If you feel not familiar with this layout, check my previous post.

create_01

Well, pipe always comes with fork. So let’s see how it changes after fork. So… like a mirror, so symmetrical, isn’t it? Now we can see that all physical pages’ references doubled.

Notice: Here, the file descriptor and file data memory are marked PTE_LIBRARY, so they won’t be affected by COW mechanism.

create_02

Then, assume that parent process writes and child process reads. And below is the final (almost final) form of pipe. The grey hatch means the file descriptor was closed.

Important! So are we getting it? Are we getting it? The sum of pageref of read end file descriptor and write end one, is exactly the pageref of the pipe page!

create_03

If you care about source code, then here it is.

pipe

Here, it uses many goto for error handling. Well, not bad.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
int pipe(int pfd[2])
{
int r;
void* va;
struct Fd* fd0, * fd1;

/* Step 1: Allocate the file descriptors. */
if ((r = fd_alloc(&fd0)) < 0 || (r = syscall_mem_alloc(0, fd0, PTE_D | PTE_LIBRARY)) < 0)
{
goto err;
}

if ((r = fd_alloc(&fd1)) < 0 || (r = syscall_mem_alloc(0, fd1, PTE_D | PTE_LIBRARY)) < 0)
{
goto err1;
}

/* Step 2: Allocate and map the page for the 'Pipe' structure. */
va = fd2data(fd0);
if ((r = syscall_mem_alloc(0, (void*)va, PTE_D | PTE_LIBRARY)) < 0)
{
goto err2;
}
if ((r = syscall_mem_map(0, (void*)va, 0, (void*)fd2data(fd1), PTE_D | PTE_LIBRARY)) < 0)
{
goto err3;
}

/* Step 3: Set up 'Fd' structures. */
fd0->fd_dev_id = devpipe.dev_id;
fd0->fd_omode = O_RDONLY;

fd1->fd_dev_id = devpipe.dev_id;
fd1->fd_omode = O_WRONLY;

#ifdef MOS_VERBOSE
debugf("[%08x] pipecreate \n", env->env_id, vpt[VPN(va)]);
#endif

/* Step 4: Save the result. */
pfd[0] = fd2num(fd0);
pfd[1] = fd2num(fd1);
return 0;

err3:
syscall_mem_unmap(0, (void*)va);
err2:
syscall_mem_unmap(0, fd1);
err1:
syscall_mem_unmap(0, fd0);
err:
return r;
}

1.3 Read & Write Pipe

Before we start reading and writing, there’s one problem remaining - how to determine our writing has a receiver, or we are not waiting for nobody?

So we have to figure out a way to determine whether the other end of pipe is closed or not. Any idea? Correct, the equation we mentioned just now.
$$
pageref(read \space fd) + pageref(write \space fd) \equiv pageref(pipe)
$$
When we are at write end, if $ pageref(write fd) = pageref(pipe) $, it means the read end is closed. It looks just like this, and one physical page is freed, too. The same goes with write end.

create_04

So we can write a test function as below.

You know, pipe only works between two process, parent and child. So there shouldn’t be a third party. And yet another thing to notice is that, we must make sure this check is an atomic operation, because it is in user space, and schedule will happen anytime.

1
2
3
4
5
6
7
8
9
10
11
12
static int _pipe_is_closed(struct Fd* fd, struct Pipe* p)
{
int fd_ref, pipe_ref, runs;
do
{
runs = env->env_runs;
fd_ref = pageref(fd);
pipe_ref = pageref(p);
} while (runs != env->env_runs);

return fd_ref == pipe_ref;
}

Reminder, this env is a global volatile variable in user space.

With these in mind, it will be relatively easier to understand pipe read and write.

pipe read/write

Notice that, even if one end is closed, the other end can still access pipe data. So it is OK to use _pipe_is_empty or _pipe_is_full, though the pipe is sure to be empty after the last read.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
static int pipe_read(struct Fd* fd, void* vbuf, u_int n, u_int offset)
{
struct Pipe* p = (struct Pipe*)fd2data(fd);
int nbytes = 0;
for (; ; )
{
while (!_pipe_is_empty(p) && nbytes < n)
{
*(u_char*)(vbuf + nbytes++) = p->p_buf[p->p_rpos++ % BY2PIPE];
}
if ((nbytes > 0) || _pipe_is_closed(fd, p))
return nbytes;
syscall_yield();
}
}

static int pipe_write(struct Fd* fd, const void* vbuf, u_int n, u_int offset)
{
struct Pipe* p = (struct Pipe*)fd2data(fd);
int nbytes = 0;
for (; ; )
{
while (!_pipe_is_full(p) && nbytes < n)
{
p->p_buf[p->p_wpos++ % BY2PIPE] = *(u_char*)(vbuf + nbytes++);
}
if (nbytes == n || _pipe_is_closed(fd, p))
return nbytes;
syscall_yield();
}
}

2. Spawn

Remember, our ultimate goal is to make a Shell. So it is essential to load executables from disk, and spawn is what makes it possible.

This part is… a little complicated. But basic principle is simple: load each segment of ELF file into memory. What to notice is that we have to initialize the calling stack for target main function, and map those pages with PTE_LIBRARY flags.

Interestingly, memory segment UTEMP is used here. :)

spawn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
int spawn(char* prog, char** argv)
{
// Step 1: Open the file 'prog' (the path of the program).
int fd;
if ((fd = open(prog, O_RDONLY)) < 0)
return fd;

// Step 2: Read the ELF header (of type 'Elf32_Ehdr') from the file into
// 'elfbuf' using 'readn()'.
// If that fails (where 'readn' returns a different size than expected),
// set 'r' and 'goto err' to close the file and return the error.
int r;
u_char elfbuf[512];
if (readn(fd, elfbuf, sizeof(Elf32_Ehdr)) != sizeof(Elf32_Ehdr))
{
r = -E_NOT_EXEC;
goto err;
}

const Elf32_Ehdr* ehdr = elf_from(elfbuf, sizeof(Elf32_Ehdr));
if (!ehdr)
{
r = -E_NOT_EXEC;
goto err;
}
u_long entrypoint = ehdr->e_entry;

// Step 3: Create a child using 'syscall_exofork()' and store its envid in
// 'child'. If the syscall fails, set 'r' and 'goto err'.
u_int child;
r = syscall_exofork();
if (r < 0)
goto err;
child = r;

// Step 4: Use 'init_stack(child, argv, &sp)' to initialize the stack of the
// child. 'goto err1' if that fails.
u_int sp;
if ((r = init_stack(child, argv, &sp)) != 0)
goto err1;

// Step 5: Load the ELF segments in the file into the child's memory.
// This is similar to 'load_icode()' in the kernel.
size_t ph_off;
ELF_FOREACH_PHDR_OFF(ph_off, ehdr)
{
// Read the program header in the file with offset 'ph_off' and length
// 'ehdr->e_phentsize' into 'elfbuf'.
// 'goto err1' on failure.
seek(fd, ph_off);
if (readn(fd, elfbuf, ehdr->e_phentsize) != ehdr->e_phentsize)
{
r = -E_NOT_EXEC;
goto err1;
}

Elf32_Phdr* ph = (Elf32_Phdr*)elfbuf;
if (ph->p_type == PT_LOAD)
{
void* bin;
// Read and map the ELF data in the file at 'ph->p_offset' into our
// memory using 'read_map()'. 'goto err1' if that fails.
if ((r = read_map(fd, ph->p_offset, &bin)) != 0)
goto err1;

// Load the segment 'ph' into the child's memory using 'elf_load_seg()'.
// Use 'spawn_mapper' as the callback, and '&child' as its data.
// 'goto err1' if that fails.
if ((r = elf_load_seg(ph, bin, spawn_mapper, &child)) != 0)
goto err1;
}
}
close(fd);

struct Trapframe tf = envs[ENVX(child)].env_tf;
tf.cp0_epc = entrypoint;
tf.regs[29] = sp;
if ((r = syscall_set_trapframe(child, &tf)) != 0)
goto err2;

// Pages with 'PTE_LIBRARY' set are shared between the parent and the child.
for (u_int pdeno = 0; pdeno <= PDX(USTACKTOP); pdeno++)
{
if (!(vpd[pdeno] & PTE_V))
continue;
for (u_int pteno = 0; pteno <= PTX(~0); pteno++)
{
u_int pn = (pdeno << 10) + pteno;
u_int perm = vpt[pn] & ((1 << PGSHIFT) - 1);
if ((perm & PTE_V) && (perm & PTE_LIBRARY))
{
void* va = (void*)(pn << PGSHIFT);
if ((r = syscall_mem_map(0, va, child, va, perm)) < 0)
{
debugf("spawn: syscall_mem_map %x %x: %d\n", va, child, r);
goto err2;
}
}
}
}

if ((r = syscall_set_env_status(child, ENV_RUNNABLE)) < 0)
{
debugf("spawn: syscall_set_env_status %x: %d\n", child, r);
goto err2;
}

return child;

err2:
syscall_env_destroy(child);
return r;
err1:
syscall_env_destroy(child);
err:
close(fd);
return r;
}

One important procedure is to initialize stack.

init_stack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
int init_stack(u_int child, char** argv, u_int* init_sp)
{
int argc, i, r, tot;
char* strings;
u_int* args;

// Count the number of arguments (argc)
// and the total amount of space needed for strings (tot)
tot = 0;
for (argc = 0; argv[argc]; argc++)
tot += strlen(argv[argc]) + 1;

// Make sure everything will fit in the initial stack page
if (ROUND(tot, 4) + 4 * (argc + 3) > BY2PG)
return -E_NO_MEM;

// Determine where to place the strings and the args array
strings = (char*)(UTEMP + BY2PG) - tot;
args = (u_int*)(UTEMP + BY2PG - ROUND(tot, 4) - 4 * (argc + 1));
if ((r = syscall_mem_alloc(0, (void*)UTEMP, PTE_D)) < 0)
return r;

// Copy the argument strings into the stack page at 'strings'
char* ctemp, * argv_temp;
u_int j;
ctemp = strings;
for (i = 0; i < argc; i++)
{
argv_temp = argv[i];
for (j = 0; j < strlen(argv[i]); j++)
{
*ctemp = *argv_temp;
ctemp++;
argv_temp++;
}
*ctemp = 0;
ctemp++;
}

// Initialize args[0 ... argc - 1] to be pointers to these strings
// that will be valid addresses for the child environment
// (for whom this page will be at USTACKTOP-BY2PG!).
ctemp = (char*)(USTACKTOP - UTEMP - BY2PG + (u_int)strings);
for (i = 0; i < argc; i++)
{
args[i] = (u_int)ctemp;
ctemp += strlen(argv[i]) + 1;
}

// Set args[argc] to 0 to null-terminate the args array.
ctemp--;
args[argc] = (u_int)ctemp;

// Push two more words onto the child's stack below 'args',
// containing the argc and argv parameters to be passed
// to the child's main() function.
u_int* pargv_ptr;
pargv_ptr = args - 1;
*pargv_ptr = USTACKTOP - UTEMP - BY2PG + (u_int)args;
pargv_ptr--;
*pargv_ptr = argc;

// Set *init_sp to the initial stack pointer for the child
*init_sp = USTACKTOP - UTEMP - BY2PG + (u_int)pargv_ptr;

if ((r = syscall_mem_map(0, (void*)UTEMP, child, (void*)(USTACKTOP - BY2PG), PTE_D)) < 0)
goto error;
if ((r = syscall_mem_unmap(0, (void*)UTEMP)) < 0)
goto error;

return 0;

error:
syscall_mem_unmap(0, (void*)UTEMP);
return r;
}

Also, we need a special spawn_mapper here.

spawn_mapper
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static int spawn_mapper(
void* data,
u_long va,
size_t offset,
u_int perm,
const void* src,
size_t len)
{
u_int child_id = *(u_int*)data;
try(syscall_mem_alloc(child_id, (void*)va, perm));
if (src != NULL)
{
int r = syscall_mem_map(child_id, (void*)va, 0, (void*)UTEMP, perm | PTE_D);
if (r)
{
syscall_mem_unmap(child_id, (void*)va);
return r;
}
memcpy((void*)(UTEMP + offset), src, len);
return syscall_mem_unmap(0, (void*)UTEMP);
}
return 0;
}

So this is spawn, sorry for few details. :(


3. Shell

Well, shell is actually a external program, that can access certain system APIs. The original MOS Shell is really shabby, so I don’t want to talk much about it.

But there is one thing, that inspired me. That is, we can parse pipe operator | by a recursive parsing.

I chose Lab 6 Challenge, so refer to that post for more detailed information on Shell. :)


Epilogue

Well, this is the last Lab of OS. (Or the one but last?) So much to say, but only at the tip of tongue. I had to admit, a nice experience, it is, to make a clumsy operating system from… not totally scratch. But… It is really torturing sometimes, since there are so many obscure bugs. 🤕

So, I guess, this is it. At least it contribute some good articles to my blog. 😳