A Survey of the Mach Operating System

A Survey of the Mach Operating System

Murphy Chen

Department of Electric Engineering
National Taiwan University

Abstract

This report describes the history, goals, microkernel structure, process management, interprocess communication, and memory management of Mach. This report also introduces the six basic abstraction, i.e. tasks, threads, ports, memory objects, messages of Mach.


1 Introduction

The Mach operating system is designed to incorporate the many recent innovations in operating system research to produce a fully functional, technically advanced system. Mach incorporates multiprocessing support throughout. Its multiprocessing support is also exceedingly flexible, ranging from shared memory systems to systems with no memory shared between processors.

Mach is designed to run on computer systems ranging from one to thousands of processors. In addition, Mach is easily ported to many varied computer architectures because of its microkernel structure. A key goal of Mach is to be a distributed system capable of functioning on heterogeneous hardware.

As we shall see, Mach provides the ability to layer emulation of other operating systems as well, and they can even run concurrently.


2 History

Mach earliest roots go back to a system called RIG (Rochester Intelligent Gateway) which began at the University of Rochester in 1975. The main research goal of RIG was to demonstrate that operating systems could be structured in a modular way, as a collection of processes that communicated by message passing, including over a network.

One of its designers, Richard Rashid, left the University of Rochester and moved to Carnegie-Mellon University in 1979, he wanted to continue developing message passing operating systems. The new operating system was called Accent. It improved on RIG by adding protection, the ability to operate transparently over the network, 32-bit virtual memory, and other features.

By 1984, Rashid observed that although Accent pioneered a number of novel operating system concepts, its utility was limited by its inability to execute UNIX applications and its strong ties to a single hardware architecture that made it difficult to port. So, he decided to begin a third generation operating systems project called Mach.

By making Mach compatible with UNIX, Rashid hoped to be able to use the large volume of UNIX software becoming available. In addition, Mach had many other improvements over Accent, including threads, a better interprocess communication mechanism, multiprocessor support, and a highly imaginative virtual memory system.

Around this time, DARPA, the U.S. Department of Defense's Advanced Research Projects Agency was hunting around for an operating system that supported multiprocessors. CMU was selected, and with DARPA funding, Mach was developed further. It was decided to make the system compatible with 4.2BSD by combining Mach and 4.2BSD into a single kernel. Although this approach led to a large kernel, it did guarantee absolute compatibility with 4.2BSD.

By 1986, the virtual memory and communication subsystems were running on the DEC VAX computer family, including multiprocessor versions of the VAX. Versions for the IBM RT/PC and for SUN 3 workstations followed shortly. By 1987, Mach was also running on the Encore Multimax and Sequent Balance multiprocessors. The first official releases of the system, Release 0 and Release 1 were completed, too.

Mach was propelled into the forefront of industry attention when the Open Software Foundation (OSF) announced in 1989 that it would use Mach 2.5 as the basis for its new operating system, OSF/1. The initial release of OSF/1 occurred a year later, and now competes with UNIX system V, Release 4, the operating system of choice among UNIX International (UI) members. OSF members include key technological companies such as IBM, DEC, and HP. Mach 2.5 is also the basis for the operating system on the NeXT workstation, the brainchild of Steve Jobs, of Apple Computer fame.

As of 1988, the Mach 2.5 kernel was large and monolithic due to the presence of a large amount of Berkeley UNIX code in the kernel. In 1989, CMU removed all the Berkeley code from the kernel and put it in user space, leaving a much smaller microkernel consisting of pure Mach. This version, Mach 3.0, implements only basic Mach features in the kernel; all UNIX-specific code has been evicted to run in user-mode servers. Excluding UNIX-specific code from the kernel allows replacement of BSD with another operating system, or the simultaneous execution of multiple operating-system interfaces on top of the microkernel. In addition to BSD, user-mode implementations have been developed for DOS, the Macintosh operating system, and OSF/1. As of Release 3.0, Mach became available on a wide variety of systems, including single-processor SUN, Intel, IBM, and DEC machines, and multiprocessor DEC, Sequent, and Encore systems. OSF is evaluating Mach 3.0 as the basis for a future operating-system release, and research on Mach continues at CMU and OSF, and elsewhere.

3 Goals of Mach

Mach has evolved considerably since its first incarnation as RIG. The goals of the project have also changed as time has gone on. The current primary goals can be summarized as follows:

1. Providing a base for building other operating systems (e.g., UNIX).

2. Supporting large sparse address space.

3. Allowing transparent access to network resources.

4. Exploiting parallelism in both the system and the applications.

5. Making Mach portable to a larger collection of machines.

In detail, Mach also offers the following facilities:

Support for tightly coupled and loosely coupled general-purpose multiprocessors.

Support for large, sparse virtual address spaces, 'copy-on-write' virtual copy operations and memory-mapped files.

Provision for user-supplied memory objects and user-level page managers for them.

Multiple threads of control within a single address space.

A message-based IPC facility which is integrated with memory management and allows large amounts of data to be transferred using the 'copy-on-write' technique.

Transparent network IPC. This is achieved by a user-level network server, and also by port names which are independent of tasks and network addresses.

Protection of interprocess communication which extends across network boundaries. This is achieved through protected port names and send and receive rights for ports. Encryption is available for network communication.

An internal, symbolic kernel debugger. A new facility.


4 The Mach Microkernel

The Mach microkernel has been built as a base upon which UNIX and other operating systems can be emulated. This emulation is done by a software layer that runs outside the kernel, in user space, as shown in Fig. 1. It should be noted that multiple emulators can be running simultaneously, so it is possible to run 4.3BSD, System V, and MS-DOS programs on the same machine at the same time.

The Mach kernel, like other microkernels, provides process management, memory management, interprocess communication, and I/O services. Files, directories, and other traditional operating system functions are handled in user space. The idea behind the Mach kernel is to provide the necessary mechanisms for making the system work, but leaving the policy to user-level processes.

IMG00001

Figure 1. Mach 3.0 structure.

The Mach kernel supports six basic abstractions:

task An execution environment and the basic unit of resource allocation. This includes a paged virtual address space and protected access to the system resources such as processors, ports, capabilities for using ports and virtual memory.

thread The basic unit of execution. A thread executes in the address space of a single task. A thread shares the resources allocated to the task with other threads in the task. There is no notion of a "process" in Mach. Rather, a traditional process would be implemented as a task with a single thread of control.

port The basic object reference mechanism in Mach, and is implemented as a kernel-protected communication channel. Communication is accomplished by sending messages to ports; messages are queued at the destination port if no thread is immediately ready to receive them. Ports are protected by kernel-managed capabilities, or port rights; a task must have a port right to send a message to a port. The programmer invokes an operation on an object by sending a message to a port associated with that object. The object being represented by a port receives the messages.

port set A group of ports sharing a common message queue. A thread can receive messages for a port set, and thus service multiple ports. Each received message identifies the individual port (within the set) that it was received from; the receiver can use this to identify the object referred to by the message.

message A typed collection of data objects used in communication between threads. Data may be contained by value or by reference. Port rights are passed in messages; passing port rights in messages is the only way to move them among tasks.

memory object An object usually residing in secondary storage that is mapped into the address space of a task. The object may be managed by a user-mode external memory manager. One example is a file managed by a file server; however, a memory object can be any object for which memory-mapped access makes sense. A mapped buffer implementation of a UNIX pipe is one example.

Figure 2 illustrates these abstractions, which we shall elaborate in the remainder of this report.

IMG00002

Figure 2. Mach's basic abstractions.



5 Process Management

Process management in Mach deals with tasks, threads, and scheduling. We will look at each of them in turn.

5.1 Tasks

A task contains a virtual address space, a set of port rights, and accounting information. A task is a passive entity that does nothing unless it has one or more threads executing in it.

Figure 3 illustrates a Mach task.

IMG00003

Figure 3. A Mach task.

5.1.1 Ports

In addition to an address space and threads, it has some ports and other properties. The ports shown in the figure all have special functions. The task port is used to communicate with the kernel. Many of the kernel services that a process can request are done by sending a message to the task port, rather than making a system call. This mechanism is used throughout Mach to reduce the actual system calls to a bare minimum.

In general, the programmer is not even aware of whether a service requires a system call or not. All services, including both those accessed by system calls and those accessed by message passing, have stub procedures in the library. It is these procedures that are described in the manuals and called by application programs. The procedures are generated from a service definition by the MIG (Mach Interface Generator) compiler.

The bootstrap port is used for initialization when a task starts up. The very first task reads the bootstrap port to learn the names of kernel ports that provide essential services. UNIX processes also use it to communicate with the UNIX emulator.

The exception port is used by the system to report errors to the task. Typical exceptions are division by zero and illegal instruction executed. Debuggers also use the exception port.

The registered ports are normally used to provide a way for the process to communicate with standard system servers. For example, the name server makes it possible to present a string and get back the corresponding port for certain basic servers.

5.1.2 Task States

Tasks also have other properties. A task can be runnable or blocked, independent of the state of its threads. If a process is runnable, then those threads that are also runnable can be scheduled and run. If a process is blocked, its threads may not run, no matter what state they are in.

5.1.3 Scheduling Parameters

The per-task items also include scheduling parameters. These include the ability to specify which processors the task's threads can run on. This feature is most useful on a multiprocessor system. For example, the task can use this power to force each thread to run on a different processor, or to force them all to run on the same processor, or anything in between.

In addition, each task has a default priority that is settable. When a thread is created, the new thread is given this priority. It is also possible to change the priority of all the existing threads.

5.1.4 Emulation Address

An emulation address can be set to tell the kernel where in the task's address space the emulation package is located. The kernel needs to know this address to handle UNIX system calls that need to be emulated. It is set once when the UNIX emulator is started up, and inherited by all of the emulator's children (i.e., all the UNIX processes).

5.1.5 Statistics

Every task has statistics associated with it, including the amount of memory consumed, the run times of the threads, and so on. A task that is interested in this information can acquire it by sending a message to the task port.

5.1.6 Task Management Primitives

Mach provides a small number of primitives for managing tasks. Most of these are done by sending messages to the kernel via the task port, rather than actual system calls. The most important of these calls are shown in Fig 4. These, like all calls in Mach, have prefixes indicating the group they belong to, but we have omitted these here for the sake of brevity.

Call
Description
Create Create a new task, inheriting certain properties
Terminate Kill a specified task
Suspend Increment suspend counter
Resume Decrement suspend counter. If it is 0, unblock the task
Priority Set the priority for current or future threads
Assign Tell which processor new threads should run on
Info Return information about execution time, memory usage, etc.
Threads Return a list of task's threads

Figure 4. Selected task management calls in Mach.

The first two calls in Fig. 4 are for creating and destroying tasks, respectively. The task creation call specifies a prototype task, not necessarily the caller. The child is a copy of the prototype, except that the call has a parameter that tells whether or not the child is to inherit the parent's address space. If it does not inherit the parent's address space, objects (e.g., text, initialized data, and a stack) can be mapped in later. Initially the child has no threads. It does, however, automatically get a task port, a bootstrap port, and an exception port. Other ports are not automatically inherited since each port may have only one reader.

Tasks can be suspended and resumed under program control. Each task has a counter, incremented by the suspend call and decremented by the resume call, that can block or unblock it. When the counter is 0, the task is able to run. When it is positive, it is suspended. Having a counter is more general than just having a bit, and helps avoid race conditions.

The priority and assign calls give the programmer control over how and where its threads run on multiprocessor systems. CPU scheduling is done using priorities, so the programmer has fine-grain control over which threads are most important and which are least important. The assign call makes it possible to control which thread runs on which CPU or group of CPUs.

The last two calls of Fig. 4 return information about the task. The former gives statistical information and the latter returns a list of all the threads.

5.2 Threads

The active entities in Mach are the threads. They execute instructions and manipulate their registers and address spaces. Each thread belongs to exactly one task. A task cannot do anything unless it has one or more threads.

Mach threads are managed by the kernel, that is, they are what are sometimes called heavyweight threads rather than lightweight threads (pure user space threads). Thread creation and destruction are done by the kernel, and involve updating kernel data structures. They provide the basic mechanisms for handling multiple activities within a single address space.

On a single CPU system, threads are time-shared, first one running, then another. On a multiprocessor, several threads can be active at the same time. This parallelism makes mutual exclusion, synchronization, and scheduling more important than they normally are, because performance now becomes a major issue, along with correctness.

5.2.1 Per-thread Resources

All the threads in a task share the address space and all the task-wide resources shown in Fig 3. Nevertheless, threads also have private per-thread resources. One of these is the thread port, which is analogous to the task port. Each thread has its own thread port, which it uses to invoke thread-specific kernel services, such as exiting when the thread is finished. Since ports are task-wide resources, each thread has access to its siblings' ports, so each thread can control the others if need be.

5.2.2 Thread States

Like a task, a thread can be runnable or blocked. The mechanism is similar too: a counter per thread that can be incremented and decremented. When it is zero, the thread is runnable. When it is positive, the thread must wait until another thread lowers it to zero. This mechanism allows threads to control each other's behavior.

5.2.3 Thread Management Primitives

A variety of primitives is provided. The basic kernel interface provides about two dozen thread primitives, many of them concerned with controlling scheduling in detail. On top of these primitives one can build various thread packages.

The C Threads package provides multiple threads of control, shared variables, mutual exclusion for critical sections, and condition variables for synchronization. In fact, C Threads is one of the major influence of the POSIX P Threads standard, which many operating systems are being modified to support.

The C Threads package provides six calls for direct thread manipulation. They are listed in Fig. 5. The first one, fork, creates a new thread in the same address space as the calling thread. It runs the procedure specified by a parameter rather than the parent's code. After the call, the parent thread continues to run in parallel with the child. The thread is started with a priority and on a processor determined by the task's scheduling parameters.

Call
Description
Fork Create a new thread running the same code as the parent thread
Exit Terminate the calling thread
Join Suspend the caller until a specified thread exits
Detach Announce that the thread will never be jointed (waited for)
Yield Give up the CPU voluntarily
Self Return the calling thread's identity to it

Figure 5. The C Threads calls for direct thread management.

When a thread has done its work, it calls exit. If the parent is interested in waiting for the thread to finish, it can call join to block itself until a specified child thread terminates. If the thread has already terminated, the parent continues immediately.

The fourth call, detach, provides a way to announce that a particular thread will never be waited for. If that thread ever exits, its stack and other state information will be deleted up immediately. Normally this cleanup only happens after the parent has done a successful join. In a server, it might be desirable to start up a new thread to service each incoming request. When it has finished, the thread exits. Since there is no need for the initial thread to wait for it, the server thread should be detached.

The yield call is a hint to the scheduler that the thread has nothing useful to do at the moment, and is waiting for some event to happen before it can continue. An intelligent scheduler will take the hint and run another thread. In Mach, which normally schedules its thread preemptively, yield is only optimization. In systems that have nonpreemptive scheduling, it is essential that a thread that has no work to do release the CPU, to give other threads a chance to run.

Finally, self returns the callers identity, analogous to GETPID in UNIX.


5.2.4 Thread Synchronization

Synchronization is done using mutexes and condition variables. The routines associated with mutual exclusion are these:

The routine mutex_alloc dynamically creates a mutex variable.

The routine mutex_free deallocates a dynamically created mutex variable.

The routine mutex_lock locks a mutex variable. The executing thread loops in a spinlock until the lock is attained. A deadlock results if a thread with a lock tries to lock the same mutex variable. Bounded waiting is not guaranteed by the C Threads package. Rather, it is dependent on the hardware instructions used to implement the mutex routines.

The routine mutex_unlock unlocks a mutex variable, much like the typical signal operation of a semaphore.

General synchronization without busy waiting can be achieved through the use of condition variables. A condition variable is associated with a mutex variable, and reflects a Boolean state of that variable. The routines associated with general synchronization are these:

The routine condition_alloc dynamically allocates a condition variable.

The routine condition_free deletes a dynamically created condition variable allocated as result of condition_alloc.

The routine condition_wait unlocks the associated mutex variable, and blocks the thread until a condition_signal is executed on the condition variable, indicating that the event being waited for may have occurred. The mutex variable is then locked, and the thread continues. A condition_signal does not guarantee that the condition still holds when the unblocked thread finally returns from its condition_wait call, so the awakened thread must loop, executing the condition_wait routine until it is unblocked and the condition holds.


5.3 Scheduling

Mach scheduling has been heavily influenced by its goal of running on multiprocessors.


5.3.1 Processor Sets

The CPUs in a multiprocessor can be assigned to processor sets by software. Each CPU belongs to exactly one processor set. Threads can also be assigned to processor sets by software.

Thus each processor set has a collection of CPUs at its disposal and a collection of threads that need computing power. The job of the scheduling algorithm is to assign threads to CPUs in a fair and efficient way.

This mechanism gives tasks a large amount of control over their threads. A task can assign an important thread to a processor set with one CPU and no other threads, thus insuring that the thread runs all the time. It can also dynamically reassign threads to processor sets as the work proceeds, keeping the load balanced.


5.3.2 Scheduling Policy

The CPU scheduler for a thread-based multiprocessor operating system is more complex than are its task-based relatives. Mach uses a simple policy to keep the scheduler manageable. Only threads are scheduled, so no knowledge of tasks is needed in the scheduler.

Thread scheduling in Mach is based on priorities. Priorities are integers from 0 to 31, with 0 being the highest priority and 31 being the lowest priority. Each thread has three priorities assigned to it. The first priority is a base priority, which the thread can set itself, within certain limits. The second priority is the lowest numerical value that the thread may set its base priority to. Since using a higher value gives worse service, a thread will normally set its value to the lowest value it is permitted, unless it is intentionally trying to defer to other threads. The third priority is the current priority, used for scheduling purposes. It is computed by the kernel by adding to the base priority a function based on the thread's recent CPU usage.


5.3.3 Run Queues

Associated with each processor set is an array of run queues, as shown in Fig. 6. The array has 32 queues, corresponding to threads currently at priorities 0 through 31. When a thread at priority n becomes runnable, it is put at the end of queue n. A thread that is not runnable is not present on any run queue.

IMG00004

Figure 6. The global run queues for a system with two processor sets.

Each run queue has three variables attached to it. The first one is a mutex that is used to lock the data structure. It is used to make sure that only one CPU at a time is manipulating the queues. The second one is the count of the number of threads on all the queues combined. If this count becomes 0, there is no work to do. The third variable is a hint as to where to find the highest priority thread. It is guaranteed that no thread is at a higher priority, but the highest one may be at a lower priority. This hint allows the search for the highest priority thread to avoid the empty queues at the top.

In addition to the global run queues shown in Fig. 6, each CPU has its own local run queue. Each local run queue holds those threads that are permanently bound to that CPU, for example, because they are device drivers for I/O devices attached to that CPU. These threads can only run on one CPU, so putting them on the global run queue is incorrect (because the "wrong" CPU might choose them).


5.3.4 Basic Scheduling Algorithm

We can now describe the basic scheduling algorithm. When a thread blocks, exits, or uses up its quantum, the CPU it is running on first looks on its local run queue to see if there are any active threads. This check merely requires inspecting the count variable associated with the local run queue.

If it is nonzero, the CPU begins searching the queue for the highest priority thread, starting at the queue specified by the hint. If the local run queue is empty, the same algorithm is applied to the global run queue, the only difference being that the global run queue must be locked before it can be searched.

If there are no threads to run on either queue, a special idle thread is run until some thread becomes ready.

If a runnable thread is found, it is scheduled and run for one time quantum. At the end of the time quantum, both the local and global run queues are checked to see if any other threads at its priority or higher are runnable, with the understanding that all threads on the local run queue have higher priority than all threads on the global run queue. If a suitable candidate is found, a thread switch occurs. If not, the thread is run for another time quantum.

On multiprocessors, the length of the time quantum is variable, depending on the number of threads that are runnable. The more runnable threads and the fewer CPUs there are, the shorter the time quantum. This algorithm gives good response time to short requests, even on heavily loaded systems, but provides high efficiency (i.e., long time quantum) on lightly loaded systems.

On every clock tick, the CPU increments the priority counter of the currently running thread by a small amount. As the value goes up, the priority goes down and thread will eventually move to a higher-numbered (i.e., lower-priority) queue. The priority counters are lowered by the passage of time.

6 Interprocess Communication

The goal of communication in Mach is to support a variety of styles of communication in a reliable and flexible way. It can handle asynchronous message passing, RPC, byte streams, and other forms as well. Mach's interprocess communication mechanism is optimized for the local case (one node) rather than the remote case (distributed system).


6.1 Ports

The basis of all communication in Mach is a kernel data structure called a port. A port is essentially a protected mailbox. When a thread in one task wants to communicate with a thread in another task, the sending thread writes the message to the port and the receiving thread takes it out. Each port is protected to ensure that only authorized processes can send to it and receive from it.

Ports support unidirectional communication. A port that can be used to send a request from a client to a server cannot also be used to send the reply back from the server to the client. A second port is needed for the reply.

Ports support reliable, sequenced, message streams. If a thread sends a message to a port, the system guarantees that it will be delivered. Messages sent by a single thread are also guaranteed to be delivered in the order sent.


6.1.1 Port Structure

A port is shown in Fig 7. When a port is created, 64 bytes of kernel storage space are allocated and maintained until the port is destroyed. The port contains the field shown in Fig. 7 and a few others.

Port

Message queue
Current message count: 5
Maximum messages: 7
Port set this port belongs to: (none)
Counts of outstanding capabilities
Capabilities to use for error reporting
Queue of threads blocked on this port
Pointer to the task holding the RECEIVE capability
Index of this port in the receiver's capability list
Pointer to the kernel object
Miscellaneous items

Figure 7. A Mach port.

Messages are not actually stored in the port itself, but in another kernel data structure, the message queue. The port contains a count of the number of messages currently present in the message queue and the maximum permitted. If the port belongs to a port set, a pointer to the port set data structure is present in the port. For various reasons, the kernel has to know how many capabilities of each type are outstanding, so the port stores the counts.

If errors occur when using the port, they are reported by sending messages to other ports whose capabilities are stored there. Threads can block when reading from a port, so a pointer to the list of blocked threads is included. It is also important to be able to find the capability for reading from the port (there can only be one), so that information is present too. If the port is a task port, then the next field holds a pointer to the task it belongs to. If it is a thread port, then the field holds a pointer to the kernel's data structure for the thread, and so on. A few miscellaneous fields not described here are also needed.

When a thread creates a port, it gets back an integer identifying the port. This integer is used in subsequent calls that send messages to the port or receive messages from it in order to identify which port is to be used. Ports are kept track of per task, not per thread, so if one thread creates a port and gets back the integer 3 to identify it, another thread in the same task will never get 3 to identify its new port.


6.1.2 Capabilities

For each task, the kernel maintains a table of all ports to which the task has access. This table is kept safely inside the kernel, where user tasks cannot get at it. Processes refer to ports by their position in this table. We will refer to the table entries as capabilities. We will call the table containing the capabilities a capability list.

Each task has exactly one capability list. When a thread asks the kernel to create a port for it, the kernel does so and enters a capability for it in the capability list for the task to which the thread belongs. The integer returned to the thread to identify the capability is usually an index into the capability list. We will refer to this integer as a capability name.

Each capability consists of not only a pointer to a port, but also a rights field telling what access the holder of the capability has to the port. Three rights exist: RECEIVE, SEND, and SEND-ONCE.

Each capability list entry is one of the following four items:

1. A capability for a port.

2. A capability for a port set.

3. A null entry.

4. A code indicating that the port that was there is now dead.


6.1.3 Primitives for Managing Ports

Mach provides about 20 calls for managing ports. All of these are invoked by sending a message to a task port. A sampling of the most important ones is given in Fig. 8.

Call
Description
Allocate Create a port and insert its capability in the capability list
Destroy Destroy a port and remove its capability from the list
Deallocate Remove a capability from the capability list
Extract_right Extract the n-th capability from another task
Insert_right Insert a capability in another task's capability list
Move_member Move a capability into a capability set
Set_qlimit Set the number of messages a port can hold

Figure 8. Selected port management calls in Mach.

The first one, allocate, creates a new port and enters its capability into the caller's capability list. The capability is for reading from the port. A capability name is returned so the port can be used.

The next two undo the work of the first one. Destroy removes a capability. If it is a RECEIVE capability, the port is destroyed and all other capabilities for it in all tasks are marked as dead. Deallocate decrements the reference count associated with a capability. If it is zero, the capability is removed, but the port remains intact. Deallocate can only be used to remove SEND or SEND-ONCE capabilities or dead capabilities.

Extract_right allows a thread to select out a capability from another task's capability list and insert the capability in its own list. Insert_right allows a task to take one of its own capabilities and add it to another task's capability list.

The move_member call is used for managing port sets. It can add a port to a port set or remove one. Finally, set_qlimit determines the number of messages a port can hold. When a port is created, the default is five messages, but with this call that number can be increased or decreased. The messages can be of any size since they are not physically stored in the port itself.


6.2 Messages

The purpose of having ports is to send messages to them. Mach has a single system call for sending and receiving messages. The call is wrapped in a library procedure called mach_msg. It has seven parameters and a large number of options. Below we will give a simplified sketch of some of its possibilities.

The mach_msg call is used for both sending and receiving. It can send a message to a port and then return control to the caller immediately, at which time the caller can modify the message buffer without affecting the data sent. It can also try receive a message from a port, blocking if the port is empty, or giving up after a certain interval. Finally, it can combine these two operations, first sending a message and then blocking until a reply comes back. In the latter mode, mach_msg can be used for RPC.

A typical call to mach_msg looks like this:

mach_msg(&hdr,options,send_size,rcv_size,rcv_port,timeout,notify_port);


6.2.1 Message Structure

A message consists of a fixed-length header and a variable number of typed data objects.

The header contains the destination's port name, the name of the reply port to which return messages should be sent, and the length of the message.

The data in the message (in-line data) were limited to less than 8K in Mach 2.5 systems, but Mach 3.0 has no limit. Any data exceeding that limit must be sent in multiple messages, or more likely via reference by a pointer in a message (out-of-line data).

Each data section may be a simple type (numbers or characters), port rights, or pointers to out-of-line data. Each section has an associated type, so that the receiver can unpack the data correctly even if it uses a byte of ordering different from that used by the sender.

The use of pointers in a message provides the means to transfer the entire address space of a task in one single message. The kernel also must process pointers to out-of-line data, as a pointer to data in the sender's address space would be invalid in the receiver's.


6.2.2 Message Passing

Generally, systems send messages by copying the data from the sender to the receiver. Because this technique can be inefficient, especially in the case of large messages, Mach optimizes this procedure.

The data referenced by a pointer in a message being sent to a port on the same system are not copied between the sender and the receiver. Instead, the address map of the receiving task is modified to include a copy-on-write copy of the pages of the message. This operation is much faster than a data copy, and makes message passing efficient. In essence, message passing is implemented via virtual-memory management.

In Mach 3.0, this operation was implemented as follows. The kernel creates a data structure that would be a copy of the region if it were part of an address map. On receipt, this data structure is added to the receiver's map and becomes a copy accessible to the receiver.

The newly allocated regions in a task do not need to be contiguous with previous allocations, so Mach virtual memory is said to be sparse, consisting of regions of data separated by unallocated addresses.


6.3 The Network Message Server

Every machine in a Mach distributed system runs a network message server. The network message servers work together to handle intermachine messages, trying to simulate intramachine messages as best they can.

A network message server is a multithreaded process that performs a variety of functions. These include interfacing with local threads, forwarding messages over the network, translating data types from one machine's representation to another's, managing capabilities in a secure way, doing remote notification, providing a simple network-wide name lookup service, and handling authentication of other network message servers. Network message servers can speak a variety of protocols, depending on the networks to which they are attached.

Although the idea of relaying messages from one machine to another via a user-level server offers some flexibility, a substantial price is paid in performance as compared to a pure kernel implementation, which most other distributed systems use.


7 Memory Management

Mach has a powerful, elaborate, and highly flexible memory management system based on paging, including features found in few other operating systems. In particular, it separates the machine independent parts of the memory management system from the machine dependent parts in an extremely clear and unusual way. This separation makes the memory management far more portable than in other systems. In addition, the memory management system interacts closely with the communication system, which we will discuss below.

Mach's memory management is split into three parts. The first part is the pmap module, which runs in the kernel, and is concerned with managing the MMU. It sets up the MMU registers and hardware page tables, and catches all page faults. This code depends on the MMU architecture, and must be rewritten for each new machine Mach is ported to.

The second part is the machine-independent kernel code, and is concerned with processing page faults, managing address maps, and replacing pages.

The third part runs as a user process called a memory manager or sometimes an external pager. It handles the logical part of the memory management system, primarily the management of the backing store (disk).


7.1 Virtual Memory

The conceptual model of memory that Mach user processes see is a large, linear virtual address space. For most 32-bit CPU chips, the address space runs from address 0 to address 232-1. The address space is supported by paging.

In order to determine which virtual addresses are in use and which are not, Mach provides a way to allocate and deallocate sections of virtual address space, called regions. The allocation call can specify a base address and a size, in which case the indicated region is allocated, or it can just specify a size, in which case the system finds a suitable address range and returns its base address

A virtual address in only valid if it falls in an allocated region. An attempt to use an address between allocated regions results in a trap.


7.1.1 Memory Object

A key concept relating to the use of virtual address space is the memory object. A memory object can be a page or a set of pages, but it can also be a file or other, more specialized data structure. A memory object can be mapped into an unused portion of the virtual address space, forming a new region.

When a file is mapped into the virtual address space, it can be read and written by normal machine instructions. Mapped files are paged in the usual way. When a task terminates, its mapped files automatically appear back in the file system, complete with all the changes that were made to them when they were mapped in. It is also possible to unmap files or other memory objects explicitly, freeing their virtual addresses and making them available for subsequent allocation or mapping.


7.1.2 Primitives for Virtual Memory Management

Mach supports a number of calls for manipulating virtual address spaces. The main ones are listed in Fig. 9. None are true system calls. Instead, they all write messages to the caller's task port.

Call
Description
Allocate Make a region of virtual address space usable
Deallocate Invalidate a region of virtual address space
Map Map a memory object into the virtual address space
Copy Make a copy of a region at another virtual address
Inherit Set the inheritance attribute for a region
Read Read data from another task's virtual address space
Write Write data to another task's virtual address space

Figure 9. Selected Mach calls for managing virtual memory


7.2 Memory Sharing

To facilitate memory sharing, Mach allows tasks to assign an inheritance attribute to each region in its address space. Different regions may have different attributes. Three values are provided:

1. The region is unused in the child task.

2. The region is shared between the prototype task and the child.

3. The region in the child task is a copy of the prototype.

If a region has the first value, the corresponding region in the child is unallocated. References to it are treated as references to any other unallocated memory - they generate traps. The child is free to allocate the region for its own purposes, or to map a memory object there.

The second option is true sharing. The pages of the region are present in both the prototype's address space and the child's. Changes made by either one are visible to the other one.

The third possibility is to copy all the pages in the region, and map the copies into the child's address space. Actually, Mach does not really copy the pages but uses a clever trick called copy-on-write instead. It places all the necessary pages in the child's virtual memory map, but marks them all read-only. As long as the child makes only read references to these pages, everything works fine.

However, if the child attempts to write on any page, a protection fault occurs. The operating system then makes a copy of the page, and maps the copy into the child's address space, replacing the read-only page that was there. The new page is marked read-write.


7.3 External Memory Managers

Each memory object that is mapped in a task's address must have an external memory manager that controls it. Different classes of memory objects are handled by different memory managers.


7.3.1 Associated Ports

To map an object into a task's address space, the task sends a message to a memory manager asking it to do the mapping. Three ports are needed to do the job. The first one, the object port, is created by the memory manager and will later be used by the kernel to inform the memory manager about page faults and other events relating to the object.

The second one, the control port, is created by the kernel itself so the memory manager can respond to these events. The use of distinct ports is due to the fact that ports are unidirectional. The object port is written by the kernel and read by the memory manager; the control port works the other way around.

The third port, the name port, is used as a kind of name to identify the object. For example, a thread can give the kernel a virtual address and ask which region it belongs to. The answer is a pointer to the name port. If address belong to the same region, they will be identified by the same name port.


7.3.2 Memory Management

When the memory manager maps in an object, it provides the capability for the object port as one of the parameters. The kernel then creates the other two ports and sends an initial message to the object port telling it about the control, and name ports. The memory manager then sends back a reply telling the kernel what the object's attributes are, and also informing it whether or not to keep the object in its cache after it is unmapped. Initially, all the object's pages are marked as unreadable/unwritable, to force a trap on the first use.

At this point the memory manager does a read on the object port and blocks. The memory manager remains idle until the kernel requests it to do something by writing a message to the object port. The thread that mapped the object in is now unblocked and allowed to execute.

Sooner or later, the thread will undoubtedly attempt to read or write a page belonging to the memory object. This operation will cause a page fault and a trap to the kernel. The kernel will then send a message to the memory manager via the object port telling it which page has been referenced and asking it to please provide the page. This message is asynchronous because the kernel does not dare to block any of its threads waiting for a user task that may not reply. While waiting for a reply, the kernel suspends the faulting thread and looks for another thread to run.

When the memory manager hears about the page fault, it checks to see if the reference is legal. If not, it sends the kernel an error message. If it is legal, the memory manager gets the page by whatever method is appropriate for the object in question. If the object is a file, the memory manager seeks to the correct address and read the page into its own address space. It then sends a reply back to the kernel providing a pointer to the page.

The kernel maps the page into the faulting thread's address space. The thread can now be unblocked. This process is repeated so often as necessary to load all the pages needed.


7.3.3 Paging Daemon

To make sure there is a steady supply of free page frames, a paging daemon thread in the kernel wakes up from time to time and checks the state of memory. If there are not enough free page frames, it picks an old dirty page and sends it to the memory manager in charge of the page's object. The memory manager is expected to write the page to disk and tell when it is done. If the page belongs to a file, the memory manager will first seek to the page's offset in the file, then write it there. The replacement algorithm used is second chance.

It is worth noting that the paging daemon is part of the kernel. Although the page replacement algorithm is completely machine independent, with a memory full of pages owned by different memory managers, there is no suitable way to let one of them decide which page to evict. The only method that might be possible is to statically partition the page frames among the various managers, and let each one do page replacement on its set. However, since global algorithms are generally more efficient than local ones, this approach was not taken.


7.3.4 Default Memory Manager

In addition to the memory managers for mapped files and other specialized objects, there is also a default memory manager for "ordinary" paged memory. When a task allocates a region of virtual address space using the allocate call, it is in fact mapping an object managed by the default manager. This manager provides zero-filled pages as needed. It uses a temporary file for swap space, rather than a separate swap area.


8 Conclusion

Mach is a microkernel-based operating system. It has been designed to provide a base for building new operating systems and emulating existing ones. It also provides a flexible way to extend UNIX to multiprocessors and distributed systems.

Mach is based on the concepts of tasks, threads, ports, and messages. A Mach process is an address space and a collection of threads that run in it. The active entities are the threads. The task is merely a container for them. Each task and thread has a port to which it can write to have kernel calls carried out, eliminating the need for direct system calls.

Mach has an elaborate virtual memory system, featuring memory objects that can be mapped and unmapped into address spaces, and backed up by external, user-level memory managers. Files can be made directly readable and writable in this way, for example. Memory objects can be shared in various ways, including copy-on-write. Inheritance attributes determine which parts of a task's address space will be passed to its children.

Communication in Mach is based on ports, which are kernel objects that hold messages. All messages are directed to ports. Ports are accessed using capabilities, which are stored inside the kernel and referred to by 32-bit integers that are usually indices into capability lists. Ports can be passed from one task to another by including them in complex messages.

References

[1] Abraham Silberschatz, Peter B. Galvin. Operating System Concepts, 4th edition. Addison-Wesley, 1994.

[2] Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall, 1992.

[3] Jean Bacon. Concurrent Systems: An Integrated Approach to Operating Systems, Database, and Distributed Systems. Addison-Wesley, 1992.

[4] Rashid, Richard. Mach distributed operating systems/Unix/AIX. IEEE Computer Society Press, c1991.

[5] Jochen Liedtke. On micro-Kernel Construction. German National Research Center for Information Technology.