A Detailed Description of Off++, a Distributed Adaptable tex2html_wrap_inline4297kernel

Francisco J. Ballesteros Fabio Kon Roy H. Campbell
nemo@gsyc.escet.urjc.es {f-kon,roy}@cs.uiuc.edu
Department of Computer Science
University of Illinois at Urbana-Champaign
Report No. UIUCDCS-R-97-2035, UILU-ENG-97-1748

August, 1997


The Off++ distributed adaptable tex2html_wrap_inline4297kernel is a minimal tex2html_wrap_inline4297kernel whose only task is to safely multiplex and export the distributed hardware present in the network. It is designed to be used as a basis for distributed user-level OS services. This technical report describes the design and implementation of Off++, the object-oriented redesign of the Off tex2html_wrap_inline4297kernel [2, 3]. Off++ extends the functionalities provided by Off in order to provide basic support for the 2K operating system [5] we are currently building. This document is meant to be the starting point for the system literate implementation.


List of Figures

1 Introduction

1.1 Motivation

The well known definition for operating system is ``the software that securely abstracts and multiplexes physical resources'' [19]. By no means is it known that those resources should be contained in a single node. So, Why are our distributed operating systems based on microkernels which essentially multiplex just local resources?

Obviously system services can be later distributed when using a (centralized) microkernel. Indeed, that can be done even when using a monolithic system [16]. But this will not solve the actual problem that the system is not being actually distributed and is not transparently multiplexing both local and remote resources.

Secondly, a major drawback of current distributed operating systems is their lack of adaptability. It is known that adaptability can be achieved using a minimal microkernel as a foundation for the operating system [8, 7, 4, 10]. If the microkernel is centralized, adaptation of system services for particular requirements may harm the distribution of those services because they are distributed on top of the microkernel and this distribution is not supported by the microkernel itself. The reason is that user extensions may fail to preserve properties found in the distributed hardware and there will be no lower layer supporting them. If the microkernel is itself distributed, simple system extensions may still benefit from system distribution because the lower layer is distributed by itself. This problem can be noticed by the fact that it is necessary to modify and/or re-implement existing system services to add new distributed services to a typical microkernel based distributed system [6, 20, 13].

The 2K Operating System is being built to explore the combination of distribution and adaptability. 2K is built from two main pieces:

We expect the combination of both elements to yield a simple easy-to-use distributed and flexible OS.

By adopting a strict architecture self-awareness philosophy, we will allow system and user modules to be conscious of its own physical and logical architecture. So, the system will be able to adapt itself optimizing its performance and reliability.

This document describes the object oriented redesign of the Off tex2html_wrap_inline4297kernel [2, 3, 1], named Off++, and is meant to be the starting point for its literate implementation (see section 1.3).

Off++ also adds new functionalities to the tex2html_wrap_inline4297kernel in order to provide basic support for 2K. These new functionalities include support for architecture self-awareness, system object browsing and inspection (see subsections 2.3.5 and 2.3.7), and flexible user-defined domains (see section 2.6).

Multiple protection domains, multi-user operation, and multitasking are optional features which can be avoided in those nodes where they are not useful.

1.2 The Off tex2html_wrap_inline4297kernel


The three basic abstractions provided by Off (shuttles, portals and DTLBs) will be described briefly in this section so that the reader will be able to better understand the following chapters.

1.2.1 Shuttles. Distributed Process Services

Process services are quite simple in Off. The only abstraction provided is the Shuttle. A shuttle is an extensible hardware context. Initially, it consists only of a program counter and a stack pointer, though they can be safely extended later on to include other pieces of context or properties (such like general purpose registers, address spaces, privilege levels, etc.)

1.2.2 Portals. Interprocess communication

The basic interprocess communication mechanism provided by Off is the Portal. A Portal can be seen like a ``distributed interrupt line'' or a ``network-wide gate''. Portals have the basic utility of interrupts, i.e. they can be invoked transparently to make a handler perform some task.

Portals do not implement buffering and do not specify whether synchronous messages, asynchronous messages or RPCs are to be used. The mechanisms and policies used to locate and to invoke portals over the network are left up to the user. User provided transport and location protocols are used to extend the portal mechanism transparently over the network. In this way traps, interrupts, exception and user messages can be accessed and adapted in the whole system.

D-TLBs. Distributed Memory Management

The only thing done by Off is to export the network hardware, and this also holds for memory management. Off implements a Distributed Software TLB. The user can establish translations from virtual to distributed physical memory addresses and the address translation hardware is safely multiplexed by Off among the competing applications.

1.3 Literate programming


As this document is hardcopy of a literate program [14] we think that it is worth saying something about this technique. We do it by citing the comp.programming.literate newsgroup FAQ:

``Literate programming is the combination of documentation and source together in a fashion suited for reading by human beings. In fact, literate programs should be enjoyable reading, even inviting! [...] In general, literate programs combine source and documentation in a single file. Literate programming tools then parse the file to produce either readable documentation or compilable source. The WEB style of literate programming was created by D.E. Knuth during the development of his TeX typsetting software.

All the original work revolves around a particular literate programming tool called WEB. Knuth says:

The philosophy behind WEB is that an experienced system programmer, who wants to provide the best possible documentation of his or her software products, needs two things simultaneously: a language like TeX for formatting, and a language like C for programming. Neither type of language can provide the best documentation by itself; but when both are appropriately combined, we obtain a system that is much more useful than either language separately.

The structure of a software program may be thought of as a web that is made up of many interconnected pieces. To document such a program we want to explain each individual part of the web and how it relates to its neighbours. The typographic tools provided by TeX give us an opportunity to explain the local structure of each part by making that structure visible, and the programming tools provided by languages such as C or Fortran make it possible for us to specify the algorithms formally and unambigously. By combining the two, we can develop a style of programming that maximizes our ability to perceive the structure of a complex piece of software, and at the same time the documented programs can be mechanically translated into a working software system that matches the documentation.''

What follows have been written in a mixture of LaTeX and \textsl{C++}. The \textsl{noweb} literate programming tool has been used to obtain a LaTeX formatted version of the document as well as the C++ code which can be later compiled.

1.4 How to read this document

In what follows, each description of a part of the system may be followed by a chunk of code like this one

<Example chunk of code. >= (U->)
class A_Class { ... };
Defines A_Class (links are to index).

In this example, the name of the chunk was ``Example chunk of code.'' The number on the right of the chunk name, also found on the left margin, identifies the chunk and will be used in cross-references. We will refer to such number as the ``chunk number''. Also, when an interesting entity has been defined by a chunk (A_Class in the example) it will be stated below the chunk using a small font.

The chunk number is made the page number (identifying in which page the chunk is defined) and, if necessary, a letter to distinguish between different chunks in the same page. Using this number one is able to quickly locate any chunk in the document.

Chunks of code can include other chunks, in this case, you can use the number enclosed with the chunk name between angles (i.e. ``tex2html_wrap_inline4313'' and ``tex2html_wrap_inline4315'') to locate quickly the code for the chunk being included. In this example, the chunk shown above will be included in chunk 3a. We then should use the chunk numbers found in the left margin to locate it.

<A chunk including another. >=
  // Some code...
  <Example chunk of code. >
  // And we also use A_Class
  A_Class an_object;

Finally, a chunk may be continued at a different part of the document. In this case it will be said explicitly below the chunk using a small font, as it can be seen below.

<Continuing chunk of code. >= [D->]
  //Mary had

<Continuing chunk of code. >+= [<-D]
  //a little lamb.

Note the use of the ``+'' to state that the chunk is a continuation.

Finally, we suggest reading this document as it is. Do not try to read it following the order understood by a C++ compiler. The references below each chunk will allow you to navigate through this program's web.

1.5 Tools

To implement Off++ several tools have been used. They are all you need to build a system image from the on-line source for this document.

All these tools have been already used in the construction of the original Off prototype.

2 System structure

2.1 Exporting system objects

System services are made of a bunch of system objects exported to users. The model is actually more generic: portals are used whenever a protection domain wants to export certain services in a secure way to alien users. For each object being exported, a set of methods are made available by means of portals.

2.1.1 Traps and the OO model

The Off++ tex2html_wrap_inline4297kernel is an object oriented system. However, the boundary between the user and the kernel is procedural: portals are used to perform system calls (implemented in turn by means of traps), as can be seen in figure 2.1.

Figure 2.1: System calls

The path for a system call proceeds as follows:

  1.   A user object calls a system object method.
  2. The system object is actually a wrapper that will translate the method invocation into a portal invocation.
  3.   The portal invocation will transfer the user control flow to the kernel.
  4. Inside the kernel, a system object method (that wrapped by the wrapper object in step 1) will be called as part of the portal delivering mechanism.

Only step 3 is procedural. Both user and kernel objects think that they are calling now and then to other objects. But whenever a protection domain has to be crossed, a portal is used.

In this way no remote method invocation has to be built inside the kernel so that it could be kept simple.

Thus, the only actual system calls are those needed to implement the portal delivering mechanism. Remaining system services are first provided by means of portals and then wrapped at user level by objects.

2.1.2 Tying portals and methods together

Whenever an object wants to export part of its interface to the outer world it must create one portal per method. To automate the task, each class with methods being exported should be labeled ENTRY too. In this way a compiler can generate the glue code between the portals and the methods being exported. All the method of task are considered to be exported to the user.

In particular, the compiler will generate automatically the implementation of this method:

<Off private methods for exported objects. >=
  // Creates portals to access class methods
  void export(void); 

The code generated for export will create one portal per method. The portal handler will be setup so that methods are invoked transparently: The program counter will be pointing to the method code and the maximum expected number of message words will be setup to the size in the stack of the method arguments. The task of extracting the arguments is done automatically by the portal invocation mechanism and the compiler because portal messages are always copied from (caller) stack to (callee) stack.

Those methods exported are expected to return a non-zero value with the error code in case the invocation fails, or zero otherwise. This can be used by user-level wrappers to raise exceptions. That error code will be returned by delivering a ``system call failed'' (SCERR) event to the shuttle issuing the system call. The handler for that event may set the error code inside the user application so that an exception could be raised. The benefit of using events for error notifying is that we save a register and permit ``optimistic'' system usage where most of the times the system does not need to return anything to the user.

It should be noted that the object constructor should call export or nothing will be actually exported.

Portal identifiers are not stored because the order in which exported system objects are created is always the same. Thus, the algorithm used to construct the portal identifiers ensures that given the node identifier, portal numbers will always be the same. This is to say that they are node-wide constants.

Although it is not part of the kernel, we will say that the compiler mentioned above can generate user wrappers too. These wrappers are objects with the same signature of the system object being exported but with just those methods being labeled ENTRY. Each wrapper method will simply invoke to the kernel portal and raise an exception on portal return whenever the return code is non-zero.

Note that many of these wrappers are to be built dynamically as users obtain access to system resources. They can cache the identifier of the resource being wrapped and use that identifier to to perform system calls.

To make it clear what system services are being exported to the user, and also to avoid in-kernel authentication (e.g. unnecessary access checks when operations are requested by the kernel itself) we will wrap every object being exported to the user. Those wrappers will be named with the prefix off_ugif and will be described right after the objects they are wrapping to make it clear which methods are exported to the user.

2.2 Resource building blocks

There are some basic interfaces which must be present on almost every system object. They were initially inherited by the Resourcegif class, which models system resources, as can be seen in figure 2.2, but we will be using aggregation instead (see figure 2.3) to avoid multiple inheritancegif.

Figure 2.2: Initial resource hierarchy (not used)

Figure 2.3: Basic resource hierarchy

2.2.1 Protection

To protect object access we define a generic set of access operations (read, write, execute, delete, and protect in the current implementationgif) and for each one we have a Protection object specifying the protection for such operation.

<Off access operations. >=
// Operations which can be protected. 
enum off_op_t { OFF_OP_R=0x1, OFF_OP_W=0x2, 
                  OFF_OP_X=0x4, OFF_OP_D=0x8, OFF_OP_P=0x10 };
const natural_t OFF_NOPS = 5;
Defines OFF_NOPS, OFF_OP_D, OFF_OP_P, OFF_OP_R, off_op_t, OFF_OP_W, OFF_OP_X (links are to index).

Individual operations can be combined together (usually by a bit-or operation) to specify an access mode (e.g. OFF_OP_R|OFF_OP_X for ``read and execute'' access mode).

Also, note that natural_t and other basic types which we will be using through the document have to be defined in a portable way (i.e. ensuring that they all have the same size across different platforms).

<Off access mode. >=
typedef off_op_t off_mode_t;
Defines off_mode_t (links are to index).

Three different objects are involved in protection:

2.2.2 Reference counting

Resources exported by the kernel will also be used by other kernel resources (e.g. page frames are used by address translations). To avoid resource deletion while its being used and to provide basic support for garbage collection we employ reference counters (RefCounter objects). When a RefCcounter comes down to zero an UNUSED exception may be raised to notify users of unreferenced (maybe unused) resources.

This feature is actually included as a builtin service in every Resource, so that we could save some typing and execution time using foo->reference() instead of foo->reference_counter->reference()..

The implementation is so simple that we have folded and inlined it.

<Off public methods for reference counting objects. >= (U-> U->)
  // References  (unreferences) an object
  void reference( void )   { _rc++; };
  void unreference( void ) { if (!--_rc) unused(); };
  natural_t get_num_refs(void) {  return _rc; };

The only subtle point is that we will use the virtual method unused to raise any exception and then destroy the object.

<Off protected methods for reference counting objects. >= (U-> U->)
  // Raises UNUSED exceptions and destroyes the object
  virtual void unused(void)  {;}

And of course\ldots

<Off private members for reference counting objects. >= (U-> U->)
  natural_t _rc;                 // # of references

2.2.3 Synchronizing system resources

To synchronize access to system objects, we will use spin locks and bureaucracygif most of the time.


The access pattern to system objects in system call is:

    \item Call make_available() to bring any needed remote object here.
  1. Having the required objects in the local node, lock them either for reading or for writing (usual multiple readers, single write semantics apply).
  2. Perform the operation (which should now be non-blocking) and return.

Thus, resources must include a read/write lock. The member variable _lock will be included in every ``lockable'' object.

<Off private members for lockable objects. >= (U->)
  rw_lock_t _lock;            // The lock

Again, we include this as builtin functionality to save some typing and time.

<Off public methods for lockable objects. >= (U->)
  // Spin (un)lock on this for reading 
  lock_state_t r_lock(void);
  void r_unlock(lock_state_t old_state);

  // Spin (un)lock on this for writing 
  lock_state_t w_lock(void);
  void w_unlock(lock_state_t old_state);

The value returned by lock routines (the lock_state) holds the state to be restored when the resource is released through unlock routines, e.g. the interrupt mask (are interruptions enabled or disabled?) and any other piece of ``global'' state changed by the lock.

Kernel bureaucracy

In order to wait for a resource to reach a given state, or to do long-term waiting due to resources being temporarily unavailable, we use Bureaucrats. Each resource that may cause long-term waiting (abstract resources) will provide one or more Bureaucrats so that shuttles block on them until some new state is reaached. Bureaucrats can be understood as condition variables but can be implemented by other means.

The use of condition variables is not an obstacle to system distribution because we will be using lists of identifiers (and not lists of objects) to represent the set of shuttles waiting on condition variables. Thus a single list may span several nodes.

<Off bureaucrat. >=
class off_Bureaucrat {
  // Puts 'who' to sleep on this bureaucrat.
  void wait(shtl_id_t who);
  // Signals this condition. Just one will be awaken. 
  void signal(void);
  // Signals this condition. Everyone will be awaken. 
  void signal_all(void);
Defines off_Bureaucrat (links are to index).

%% --------------------------------------------------------------

2.2.4 Sequencers

Some objects will need to atomically obtain unique sequence numbers to create unique identifiers (for other objects contained inside). Again, this functionality is incorporated into the aggregate.

<Off private members for sequencing objects. >= (U->)
  off_seq_t _seq;            // Next sequence nb.

<Off protected methods for sequencing objects. >=
   // Gets a new sequence number. Uses lock() to ensure atomicity
   off_seq_t get_seq(void);      

2.3 System resources


Each resource in the system supports a set of well-defined operations. Namely, those provided by building blocks that we have seen before and a few others like freeze, to get the state of the object and defer any further invocations, melt, to recreate a frozen object, and dump, which will return a printable representation of the resource state.

<Off resource. >=
class off_Resource  {
   <Off private members for protected objects. >
   <Off private members for reference counting objects. >
   <Off private members for lockable objects. >
   <Other private members of off_Resource. >
   <Off protected methods for reference counting objects. >
   <Off public methods for lockable objects. >
   <Off public methods for reference counting objects. >
   <Off public methods for protected objects. >

  // Every resource (be it simple or compound) is able to be
  //  dumped for inspection and frozen/melted. 
  virtual void *freeze(void *to, size_t size)=0;
  virtual melt(void *from, size_t size)=0;
  inline boolean_t is_frozen(void);
  virtual char *dump(char *buf, size_t size) const;

  <Other public methods of off_Resource. >
Defines off_Resource (links are to index).

Also, we should mention that some resource members are initialized at allocation time (with the information provided by system users). That information includes the protection for the resource and an identifier used to identify the user-level entity responsible for the object after its allocation; we name such entity the object domain.

Being the \emph{Portal} the IPC mechanism provided by the kernel, the domain identifier is indeed a portal identifier of type off_prtl_id_t. By doing that, we can have the kernel performing upcalls to either user-level or in-kernel domain servers which will be responsible for allocated system resources. Besides, the meaning of the domain abstraction is defined by the user and not by the kernel, as will be discussed in section 2.6.

<Other private members of off_Resource. >= (<-U)
   off_prtl_id_t r_domain;       // Domain for the resource

\subsection{Resources for plain users}

These are the common entry points for every system resource.

<Off resource for users. >=
ENTRY class off_uResource {
 off_Resource &u_resource;
 void *freeze(void *to, size_t size, const off_Rights &r)=0;
 void  melt(void *from, size_t size, const off_Rights &r)=0;
 boolean_t is_frozen(const off_Rights &r);
 char *dump(char *buf, size_t size, const off_Rights &r) const;

 <Other public methods of off_uResource. >

Defines off_uResource (links are to index).

2.3.2 Identifiers

Off uses two kinds of identifiers:

The size of each field is defined as shown.

<Off identifiers. >= [D->]
typedef unsigned16_t  off_node_t;  // Node descriptor
typedef unsigned32_t  off_seq_t;   // Sequence number
typedef unsigned16_t  off_slot_t;  // Slot number
typedef unsigned long off_offset_t;// Elementary unit offset number

Thus, identifiers are typed:

<Off identifiers. >+= [<-D]
// Relocatable top level identifier
struct off_id_t {
  off_node_t i_node : 16;     // Creation node for the object
  off_seq_t  i_seq  : 32;     // Sequence number
  off_slot_t i_slot : 16;     // Implementor's descriptor for the object

// Hardware resource elementary unit identifier
struct off_eu_id_t {
  off_id_t     e_container;        // id for its container
  off_offset_t e_offset;           // Object id relative to the container.
Defines off_eu_id_t, off_id_t (links are to index).

2.3.3 Containers and resource units

Resources are arranged as recursive containers, being the node the outermost one.

These containers delegate allocation to an Allocator object which can (and will) be further wrappedgif to delegate usage statistics to a bookkeeper. See figure 2.4.


Containers and allocators


<Off compound resource. >=
// Compound resource. Uses an allocator to allocate resource units. 
class off_CompResource : public off_Resource {
  // Allocator used for this container.
  off_Allocator *c_pool;    // Resource unit pool
  <Other private members of off_CompResource. > 
  // Named with `relocatable' identifiers
  off_id_t  get_id(void) const;
  // Gives a pointer to the container's allocator
  off_Allocator *get_allocator(void) const;
Defines off_CompResource (links are to index).

Specific resource units maintain a reference to their container. They also redefine the new operator so that it could be used to allocate resource units from a container's pool. This way we can avoid memory fragmentation.

The reference to the container and the redefinition of new will be declared in subclasses of ResUnit to avoid explicit type casts.

<Off elementary resource unit. >=
// Elementary resource unit.
class off_ResUnit : public off_Resource {
Defines off_ResUnit (links are to index).

\subsubsection{Compound resources and resource units for plain users}

Compound resources export to users the allocation information found in the CompResource allocator. To do so, they export the get_allocator method. They also export get_id.

<Off compound resource for users. >=
ENTRY class off_uCompResource : public off_uResource {
 off_id_t  get_id(const off_Rights &r) const;
 off_Allocator *get_allocator(const off_Rights &r) const;
Defines off_uCompResource (links are to index).

We do not export to users anything besides that exported by a Resource in ResUnit system objects.

<Off elementary resource unit for users. >=
ENTRY class off_uResUnit : public off_uResource {
Defines off_uResUnit (links are to index).

2.3.4 Resource availability


When users issue a system call, it is targeted to a particular system object. The system call is processed in the local node only when that object is local. In any other case we forward the system call to a remote node by raising a FWD exception to the user.

Assuming the target system object is local, so that the system call proceeds, it is not guaranteed that every resource needed by that system call be present in the local node.

To deal with remote resources, every resource container (CompResources) implements a make_available method so that whenever a resource is needed it could be fetched to the local node (if that is really required).

Besides, when the container being considered is an AbsCompResource (a container of abstract resources), we have to take into account that resources contained inside it (abstract resources) are able to move around. Thus, in this case we will employ a relocation table to cache known object locations.

In every CompResource, make_available proceeds as follows:

  1. Using the resource identifier, it tries to locate the resource in the allocator used by the AbsCompResource (using the slot field of the resource identifier). If the resource is found there, then there is nothing more to do.
  2. If the resource was not present in the allocator and we are considering abstract resources, the relocation table has to be consulted to find out if we know of any object relocation. If we find an entry, we now know where the object is and we are done.
  3. If we don't know where the object is at this point, we assume it is located in its creation node (known by looking in the resource identifier). Thus, we suspect where the object is and raise a MISSING exception providing the expected location to user so that he could fetch the remote resource and bring it to the local container.
    1. If the object was indeed at its expected location (it did not move around) the system call is able to proceed as soon as the MISSING exception handler completes successfully. Should the handler fail, the system call is aborted and an ILL (illegal instruction) exception is raised to the Shuttle domain.
    2. If the object was not at its expected location we will notice it as soon as the user is notified by the remote resource container (where we expected to find the resource). In this case, a location algorithm will be run by the user. The resulting location should be inserted in the container relocation table, if any.

We should note that resource ``relocation'' counts mostly for portals, shuttles and DTLBs. Elementary resource units are not able to move around, so they will always be at their creation nodes.

Besides, because the system can be heterogeneous, a remote resource brought to the local node may be meaningless to us or require some sort of marshaling (eg. when it corresponds to a different architecture). In this case, a XDT exception is raised in order to request to a trusted user translator an image of the remote object in the native format.

2.3.5 Architecture awareness support


Each resource will be wrapped with a Navigator and an Inspector. This way, the OS built on top of the tex2html_wrap_inline4297kernel will be able to transparently navigate through its components and later on ask about resource attributes (e.g. bandwidth, persistence, etc.),

The class hierarchies of Navigators and Inspectors mimic that of Resource, thus we omit it.

Every resource will include two methods, get_navigator and get_inspector, to provide a pointer to its specific navigator and inspector. Both of them will be friends of the object considered and will be able to access its state in order to support their services.

<Other public methods of off_Resource. >= (<-U)
   // Returns the navigator for the resource.
   virtual off_Navigator *get_navigator(void) const;
   // Returns the inspector for the resource.
   virtual off_Inspector *get_inspector(void) const;

By enquiring the navigator, references to other system resources may be obtained and get_navigator or get_inspector may be used again.

The scheme works as shown in the example of the figure 2.5:

  1. The user has a reference to a memory bank.
  2. Using get_navigator, a reference to its navigator is obtained. This navigator implements the Navigator interface.
  3. Browsing with the navigator, a page frame is located \item Its method get_inspector is used to obtain a reference to an object implementing the Inspector interface. A common protocol can be used now to find out about the page frame persistence, size, etc.

Of course, both get_navigator and get_inspector are exported to system users.

<Other public methods of off_uResource. >+= (<-U) [<-D]
 off_Navigator *get_navigator(const off_Rights &r ) const;
 off_Inspector *get_inspector(const off_Rights &r ) const;

A Navigator provides methods to select and lookup components.

Figure 2.5: Architecture awareness support: Navigators and Inspectors.

<Off resource navigator. >=
// A resource navigator
ENTRY class off_Navigator {
   // Returns a reference to the resource being navigated
   off_Resource *get_current(void);

   // Returns the first element.
   off_Resource *get_first(void);
   // Advances the iteration and returns the next element.
   off_Resource *get_next(void);

   // Selects a component by name.
   off_Resource *operator[](char *name);
   //   by identifiers
   off_Resource *operator[](const off_eu_id_t &id);
   off_Resource *operator[](off_id_t id);
Defines off_Navigator (links are to index).

An Inspector has two main methods to lookup attributes either by name or by index. Attribute indexes will vary from 0 to some arbitrary value. The first invalid attribute is guaranteed to have a NULL name and type.

To avoid usage of void *, users are requested to specify the expected type of an attribute. As it is unrealistic to force users to know in advance which attributes can be present, a method nameof is provided to obtain the name of a given attribute, and a method typeof is provided to obtain its type.

<Off resource inspector. >=
// A resource inspector
ENTRY class off_Inspector {
   // Returns  the name of an attribute at [idx]
   char *nameof( natural_t idx) const; 
   // Returns the type of an attribute
   off_attr_kind_t typeof(natural_t idx) const;

   // Look up a boolean attribute by name or by index
   virtual boolean_t get_bool_attr(char *name)     const { return FALSE; }
   virtual boolean_t get_bool_attr(natural_t idx)  const { return FALSE; }

   // Look up a natural attribute by name or by index
   virtual natural_t get_nat_attr(char *name)      const { return 0; }
   virtual natural_t get_nat_attr(natural_t idx)   const { return 0; }

   // Look up a string attribute by name or by index
   virtual char *get_str_attr(char *name)          const { return NULL; }
   virtual char *get_str_attr(natural_t idx)       const { return NULL; }

   // Look up an id attribute by name or by index
   virtual off_id_t get_id_attr(char *name)  const { return OFF_ID_NULL; }
   virtual off_id_t get_id_attr(natural_t idx) const {return OFF_ID_NULL;}

Defines off_Inspector (links are to index).

The default implementation is to ignore the arguments and return a silly value, so that such function could be ignored: not every resource has attributes of every type. Thus, only nameof, typeof and one of the for previous pairs of methods must be provided by any subclass.

The valid attribute types are

<Off attribute type ids. >=
  enum off_attr_kind_t { OFF_BOOL_ATTR,  OFF_INT_ATTR, 
                         OFF_STR_ATTR,   OFF_ID_ATTR };
Defines off_attr_kind_t, OFF_BOOL_ATTR, OFF_ID_ATTR, OFF_INT_ATTR, OFF_STR_ATTR (links are to index).

Finally, any resource should have at least these attributes defined

<Off attributes. >=
  enum { OFF_ATTR_NULL = 0,  // Used to mark the end of the attr list
         OFF_ATTR_NAME,      // Name of this resource
         OFF_ATTR_CLASS,     // Class for this resource
         OFF_ATTR_DOM,       // Resource domain 
         OFF_ATTR_ID,        // Resource id (or container id)
         OFF_ATTR_OFFSET,    // Resource offest in a container or 0
         OFF_ATTR_URL,       // URL for the resource literate source code

%% --------------------------------------------------------------

2.3.6 Resource Allocators

Every allocator will provide the same interface, so that we could replace it at will.

<Off allocator. >=
signature off_Allocator { 
  // Returns a chunk of raw memory.
  // Will try to allocate n chunks starting at the i-th item. 
  // i == 0 means allocate anywhere.
  void *allocate(natural_t n=1, natural_t i=0);

  // Deallocates the chunk pointed by p
  void  deallocate(void *p);
Defines off_Allocator (links are to index).

The signature keyword is a GNU C++ extension which permits the use of subtype polymorphism separately from inheritance. Some dynamic dispatching is saved this way. However, it can be considered to be an abstract base class or interface implemented by every allocator.

\subsection{Allocation statistics}  

Statistics are maintained by bookkeeping allocators (BKAllocators) which are composed by wrappinggif the basic Allocator with some bookkeeping methods. These operations are inlined.

<Off bookkeeping allocator. >=
class off_BKAllocator {
   natural_t b_nfree;           //  # of free nodes
   natural_t b_nalloc;          //  # of allocated nodes
   natural_t b_maxalloc;        //  Maximum # of ever allocated nodes
   natural_t b_nallocrq;        //  # of alloc requests
   natural_t b_nfreerq;         //  # of free requests
   // Get some statistics 
   natural_t get_nfree(void) const {return s_nfree;}
   natural_t get_nalloc(void) const {return s_nalloc;}
   natural_t get_maxalloc(void) const {return s_maxalloc;}
   natural_t get_num_allocrq(void) const {return s_nallocrq;}
   natural_t get_num_freerq(void) const {return s_nfreerq;}

   // Wrapped allocator bookkeeping methods  
   inline void *allocate(natural_t n=1, natural_t i=0);
   inline void  deallocate(void *p);
Defines off_BKAllocator (links are to index).

\subsection{Fixed and Block allocators}  

The FixedAllocator is an Allocator with a fixed allocation scheme. Its implementation may vary, but in general it will consist on a fixed array and a linked list of free nodes. This class will provide a generic implementation which can be later optimized by specialized allocators.

<Off fixed allocator. >=
class off_FixedAllocator: public off_BKAllocator {
   // Allocates (deallocates) units from a fixed store.
   void *allocate(natural_t n=1, natural_t i=0);
   void  deallocate(void *p);
Defines off_FixedAllocator (links are to index).

The BlockAllocator is an allocator which can grow dynamically.

<Off block allocator. >=
class off_BlockAllocator: public off_BKAllocator {
   // Allocates (deallocates) units from a growing store
   void *allocate(natural_t n=1, natural_t i=0);
   void  deallocate(void *p);
   // Resizes the store
   void  grow(natural_t n);
Defines off_BlockAllocator (links are to index).

Even though it is not required for the understanding of the rest of this document, we now say a few words about the implementation of BlockAllocator. In Off, it is used to allocate abstract system resources (shuttles, portals and DTLBs).

It is desirable to avoid static limits in the amount of resources an application can allocate. So, each application would allocate, for instance, as many shuttles as it wants and as many portals as it wants. To implement that, we chose to allocate abstract system resources in virtual memory.

However, adding this flexibility to the allocator could cause problems in the case when a single application allocates many resources (e.g. many portals) harming other applications' performance.

Our implementation of the BlockAllocator helps solving this problem by allocating elementary units (i.e. shuttles, portals and DTLBs) so that those of a particular application can be paged out without disturbing other applications. The kernel virtual memory may look like the one depicted in figure 2.6. In this situation the large number of VM pages used to store one of the applications' (xfig) portals does not avoid the other ( gcc) to have the opportunity to freely allocate its portals.

Figure 2.6: Storage of Portals for two different applications.

As the system does not know what a ``user application'' is, it is not possible to make the allocator use a different page for each application. However, the BlockAllocator can use the resource domain identifier as a hint. So, what we actually do is to use a different set of virtual memory pages to allocate resources for each application domain.

2.3.9 Resource revocation


Resources are not revoked by the Off++ tex2html_wrap_inline4297kernel. Instead, an UNAVAILABLE exception is raised for the resource exhausted; so that users could run whatever algorithm is needed to make more units of the resource being exhausted.

The exception trigger is the memory allocator being used. System allocators are provided with an Exhausted object at instantiation time for that purpose. The Exhausted object provides a single method notify to let allocators trigger the proper action (by default, raising an UNAVAILABLE exception).

<Off Exhausted resource revocation trigger. >=
signature off_Exhausted {
   // Notify that resource is exhausted
   void notify(void);
Defines off_Exhausted (links are to index).

2.4 Hardware resources

Hardware resources are not able to move and are always located at the same place. If they grow, they grow one container at a time (i.e. when new disks or pluggable memory banks are added to the system). Hardware resource containers may move by replacing a whole container by another. So, usually, hardware containers will use a fixed allocator.

The hardware resources exported by the kernel are arranged as recursive containers starting from Node. See figure 2.7.

Figure 2.7: Hardware resource containers and elementary units.

<Off hardware resource container. >=
// A hardware resource container. Units will be allocated
// within it. 
class off_HWCompResource : public off_CompResource {
  // We may wish to cache a resource unit 
  void make_available(off_eu_id_t u);
Defines off_HWCompResource (links are to index).

Hardware resource units employ a particular naming scheme. They do not use off_id_t identifiers as remaining system resources do. They use elementary unit identifiers (off_eu_id_t) instead. These ones are built of an off_id_t (naming the container) and an offset locating the object inside the container. The offset matches the resource physical name, i.e. that understood by the hardware.

<Off hardware resource unit. >=
// A hardware resource unit.
// Uses virtualized identifiers <container id><hw id> 
// and a fixed allocation scheme.
class off_HWResUnit : public off_ResUnit {
  // Gets the identifier for this unit.
  off_eu_id_t get_id(void) const;
Defines off_HWResUnit (links are to index).

\subsection{Hardware resources for plain users}

Users can ask for the identifier of a given resouce unit.

<Off hardware resource unit for users.>=
ENTRY class off_uHWResUnit : public off_uResUnit {
  off_eu_id_t get_id(const off_Rights &r) const;
Defines off_uHWResUnit (links are to index).

Compound hardware resources do not export anything.

<Off hardware resource container for users. >=
ENTRY class off_uHWCompResource : public off_uCompResource {
Defines off_uHWCompResource (links are to index).

2.5 Abstract resources


Shuttles, Portals, and DTLBs (see 1.2) are system abstractions, and are able to move around. Consequently, abstract resource containers (AbsCompResources) provide a method make_avaiable which uses a relocation table so that system code may request to one of the system containers to make a particular resource unit available at the local node.

They are arranged, as physical resources are, as recursive containers being the Node the top-level container (see figure 2.8). \begin{figure}[htb]

System abstract resource containers and elementary units.


<Off system server. >=
// A system resource container. 
class off_AbsCompResource : public off_CompResource {
  // System containers use a block allocator and  a relocation cache. 
  // We may wish to fetch a resource unit 
  void make_available(off_id_t u);
Defines off_AbsCompResource (links are to index).

<Off system resource. >=
// A system resource unit. It's relocatable by nature. 
class off_AbsResUnit : public off_ResUnit {
  off_id_t get_id(void) const;
Defines off_AbsResUnit (links are to index).

\subsection{Abstract resources for plain users}

The wrappers for abstract resources are so simple that we will not discuss them.

<Off system server for users. >=
ENTRY class off_uAbsCompResource : public off_uCompResource {
Defines off_uAbsCompResource (links are to index).

<Off system resource for users. >=
ENTRY class off_uAbsResUnit : public off_uResUnit {
  off_id_t get_id(void) const;
Defines off_uAbsResUnit (links are to index).

2.6 Domains and resource allocation


The kernel does not implements either ``processes'' nor any other ``resource-container'' abstractions. Users allocate system resources and release them back to the kernel when they are no longer needed. None of the system resources is responsible for containing the resources of a given application.

Besides, as we will see, the system resource that is the most similar to the traditional process abstraction (the ``shuttle'') does not contain the resources accessed during its execution. Thus, how are bookkeeping tasks accomplished? And, how are resources released upon application crashes?

Certainly, a user-level (i.e. a non-tex2html_wrap_inline4297kernel) process or ``application'' abstraction is needed. To help on this task, the kernel delegates resource bookkeeping to its users.

Users should implement resource domains containing a set of resources used by a user application. This concept is introduced in the system in order to help solving the three following problems:

Whenever a user issues a resource allocation request, it provides a target domain identifier tdi for that new resource. Then, the kernel entitles that resource to the domain identified by tdi (i.e. sets the r_domain of the Resource being allocated to the supplied domain identifier -- as defined in 2.3).

When the resource gets finally unused, the kernel notifies that using its domain identifier as a portal name, so that users could update their allocation tables. Thus, domain identifiers must be portal identifiers.

As we said, the meaning of a domain is defined by the user, not by the kernel (although the DM can be loaded into the kernel for efficiency).

The 2K operating system will utilize this basic mechanism for implementing the concept of nested domains. From a single abstraction, it will be possible to implement replacements for traditional OS abstractions like group of users, user, process group, process, and thread. This will be possible because nothing forbids domains to persist, so that a permanent user can be seen as a persistent domain.

2K will also be able to support new abstractions like, for example, temporary users. They would be implemented as non-persistent domains and could be created by regular users inheriting a subset of the creator permissions. This would add flexibility compared to existing systems in which usually only a ``super-user'' is allowed to create new users.

A resource could be also logically moved from one domain to another by changing its r_domain. There is nothing in the system architecture forbidding such feature.

3 The node

3.1 The node interface

A Node object is responsible for bringing the system into operation and also for shutting it down (i.e. to either halt, reboot, or suspend it). Node is also a singletongif used as a placeholder for node-wide system properties (e.g. the portal for the authentication server used to establish trust relationships among distributed kernels (auth), and the portal for the external data translator used to translate foreign kernel objects to the native architecture (xdt) are placed here.)

Being the outer-most starting point in the container hierarchy, its id (obtained from the method get_id) is all the user needs to know to start looking for available system resources.

<Off node. >=
// An Off Node. 
// Can be also handled as a resource: freeze, melt, etc. 
class off_Node : public off_Resource { 
   <Off private members for sequencing objects. >
   <Other off_Node private members. > 
   <Off protected members for sequencing objects. >
   <Other off_Node protected methods. > 
  // Returns the node identifier
  off_id_t   get_id(void) const;

  // Returns or sets the authorization server and
  // the external data translator portals.
  off_prtl_t get_auth(void) const;
  off_prtl_t get_xdt(void)  const;
  void       set_auth(off_prtl_t p);
  void       set_xdt(off_prtl_t  p);

  <Other off_Node public methods. > 

  // Halts, reboots, or suspends this node
  void    halt( char *msg="System On." );
  void    reboot( void );
  void    suspend( void );
Defines off_Node (links are to index).

The Node can be also considered to be a façadegif (see  [11]) for node-wide system operations.


To support efficient browsing of node components, the Node has a set of methods that provide access to its components. Most of them will be simply inlined.

For example, for memory banks we have:

<Other off_Node protected methods. >= (<-U) [D->]
   // Returns the number of specific containers found at this node
   inline natural_t get_num_mbanks(void) const;
   // Returns a pointer to a local memory bank. 
   inline off_MBank   *get_mbank( natural_t id = 0 ) const;

The same holds to gain access to remaining resource containers.

<Other off_Node protected methods. >+= (<-U) [<-D]
   // Get a pointer to a hardware resource container
   inline off_IOBank  *get_iobank( void )   const;
   inline off_DMA     *get_dma( void )      const;
   inline off_ProcTbl *get_proctbl( void )  const;
   inline off_ShtlSrv *get_shtlsrv( void )  const;
   inline off_PrtlSrv *get_prtlsrv( void )  const;
   inline off_DMM     *get_dmm( void )      const;

Besides, a generic navigator specialized for the Node will support the common interface for navigation. That is primarily for system users and not for the kernel itself.


There are some global operations that do not fit well anywhere else.

The following one, for example, redirects system output within the kernel to the serial line. Output can be redirected at any time.

<Other off_Node public methods. >= (<-U)
  // Use the serial console? 
  void use_serial_console( boolean_t doit = TRUE );

\section{Nodes for plain users}

The wrapper for node services export almost everything as can be seen.

<Off node for users. >=
ENTRY class off_uNode : public off_uResource{
  off_prtl_t get_auth(const off_Rights &r) const;
  off_prtl_t get_xdt(const off_Rights &r)  const;
  void       set_auth(off_prtl_t p, const off_Rights &r);
  void       set_xdt(off_prtl_t  p, const off_Rights &r);
  void    halt( char *msg="System On." );
  void    reboot( const off_Rights &r );
  void    suspend( const off_Rights &r );

  off_IOBank  *get_iobank( const off_Rights &r )   const;
  off_DMA     *get_dma( const off_Rights &r )      const;
  off_ProcTbl *get_proctbl( const off_Rights &r )  const;
  off_ShtlSrv *get_shtlsrv( const off_Rights &r )  const;
  off_PrtlSrv *get_prtlsrv( const off_Rights &r )  const;
  off_DMM     *get_dmm( const off_Rights &r )      const;
  // Use the serial console? 
  void use_serial_console( const off_Rights &r, boolean_t doit = TRUE );
Defines off_uNode (links are to index).

\section{System booting}

The system entry point, which is called by the OSKit after basic hardware initialization is named main. Its arguments proceed from the secondary boot loader.

<Off main entry point. >=
// main system entry point. 
int main(int argc, char *argv[]);
Defines main (links are to index).

4 Exporting the hardware

\section{Memory banks}

Physical memory is distributed among different memory containers named mbanks. Each mbank supports allocation and deallocation of physical memory inside a single memory bank. To do so, it provides the alloc and free methods. Also, it provides means to locate a page frame either by address or by page frame number (PFN).

<Off memory bank. >=
// A bank of memory
class off_MBank : public off_HWCompResource {
  // Allocates n contiguous page frames. 
  off_PFrame *alloc(off_Protection *prot, 
                    natural_t n=1, off_pg_id_t at=OFF_EU_ID_NULL);
  // Deallocates the n contiguous page frames starting at pf. 
  void free(off_PFrame *pf, natural_t n=1 );

  //Gets a page frame from its physical address
 off_PFrame *operator[](off_pg_id_t pfn) const;

  //Gets the size of pages inside the bank
  vm_size_t get_pgsize(void);
Defines off_MBank (links are to index).


<Off page frame identifier. >= [D->]
typedef off_eu_id_t off_pg_id_t;
Defines off_pg_id_t; (links are to index).

PFrame maintains the information needed about each page frame installed in the system. Their methods are just a means to access and update such state.

<Off page frame. >=

class off_PFrame : public off_HWResUnit {
  // Gets the page frame bits.
  pg_bits_t get_bits(void);

  // Sets the page frame bits and returns the old ones. 
  pg_bits_t set_bits(pg_bits_t b);
Defines off_PFrame (links are to index).

where pg_bits_t correspond to the bits maintained by the address translation machinery.

<Off page frame bits data type. >=
typedef pt_entry_t pg_bits_t;
Defines pg_bits_t; (links are to index).

\subsection{Memory banks and page frames for plain users}

These are the wrappers for memory banks and page frames:

<Off memory bank for users. >=
ENTRY class off_uMBank : public off_uHWCompResource {
 off_uPFrame *alloc(off_Protection *prot, 
                   natural_t n=1, off_pg_id_t at=OFF_EU_ID_NULL);
 void free(off_uPFrame *pf, const off_Rights &r, natural_t n=1 );
 off_uPFrame *fetch(off_pg_id_t pfn,const off_Rights &r) const;
 vm_size_t get_pgsize(const off_Rights &r);

Defines off_uMBank (links are to index).

<Off page frame for users. >=
ENTRY class off_uPFrame : off_uHWResUnit {
  pg_bits_t get_bits(const off_Rights &r);
  pg_bits_t set_bits(pg_bits_t b, const off_Rights &r);
Defines off_uPFrame (links are to index).


<Off page frame identifier. >+= [<-D]
typedef off_eu_id_t off_pg_id_t;
Defines off_pg_id_t (links are to index).

%% --------------------------------------------------------------

4.2 Input/Output

IO bank objects are merely used to allocate ports, as I/O should be able to operate (when permitted) from user level.

<Off IO bank. >=
//An IO bank
class off_IOBank : public off_HWCompResource {
  // Allocates n contiguous IO ports. 
  off_IOPort *alloc(off_Protection *prot,natural_t n=1,
                    off_io_id_t at=OFF_EU_ID_NULL);
  // Deallocates the n contiguous IO ports starting at iop. 
  void free( off_IOPort *iop, natural_t n=1 );
  //Gets an IO port from its port number.
  off_IOPort *operator[](off_io_id_t iop) const;
Defines off_IOBank (links are to index).


<Off IO identifier. >=
typedef off_eu_id_t off_io_id_t;
Defines off_io_id_t (links are to index).

<Off IO port. >=
// An IO port
class off_IOPort : public off_HWResUnit {
  // Input/Output 8, 16, 32 or 64 bit words through the port.
  unsigned8_t   in8(void);
  void          out8(unsigned8_t o);

  unsigned16_t  in16(void);
  void          out16(unsigned16_t o);

  unsigned32_t  in32(void);
  void          out32(unsigned32_t o);
  unsigned64_t  in64(void);
  void          out64(unsigned64_t o);
Defines off_IOPort (links are to index).

\subsection{Input/Output for plain users}

Users may access IO ports using these wrappers.

<Off IO bank for users. >=
ENTRY class off_uIOBank : public off_uHWCompResource {
  off_uIOPort *alloc(off_Protection *prot,natural_t n=1,
            off_io_id_t at=OFF_EU_ID_NULL);
  void free( off_uIOPort *iop, const off_Rights &r, natural_t n=1 );
  off_uIOPort *fetch(off_io_id_t iop,const off_Rights &r) const;
Defines off_uIOBank (links are to index).

<Off IO port for users. >=
ENTRY class off_uIOPort: public off_uHWResUnit {
  unsigned8_t   in8(const off_Rights &r,);
  void          out8(unsigned8_t o,const off_Rights &r);

  unsigned16_t  in16(const off_Rights &r);
  void          out16(unsigned16_t o,const off_Rights &r);

  unsigned32_t  in32(const off_Rights &r);
  void          out32(unsigned32_t o,const off_Rights &r);
  unsigned64_t  in64(const off_Rights &r);
  void          out64(unsigned64_t o,const off_Rights &r);
Defines off_uIOPort (links are to index).

%% --------------------------------------------------------------

4.3 Traps and interrupts

  Both traps and interrupts are allocated using per-processor event tables. Each processor will have an associated a trap table and an associated interrupt table. Traps may be caused by the hardware or by the tex2html_wrap_inline4297kernel itself. Software interrupts are also available.

<Off event table. >=
// An event table 
class off_EventTbl : public off_HWCompResource {
  // Allocates n contiguous events. 
  off_Event *alloc( off_Protection *prot,  natural_t n=1, 
                    off_ev_id_t at=OFF_EU_ID_NULL );
  // Deallocates the n contiguous events starting at t. 
  void free( off_Event *t, natural_t n=1 );
  //Gets an event from its  number.
  off_Event *fetch(off_ev_id_t n) const;
Defines off_EventTbl (links are to index).


<Off event identifier. >=
typedef off_eu_id_t off_ev_id_t;
Defines off_ev_id_t (links are to index).

<Off trap table. >=
// Definition of traps for the current processor
class off_TrapTbl: public off_EventTbl {
  // Allocates n contiguous Traps. 
  off_Trap *alloc( off_Protection *prot, natural_t n=1, 
                   off_ev_id_t at=OFF_EU_ID_NULL );
  // Deallocates the n contiguous traps starting at t. 
  void free( off_Trap *t, natural_t n=1 );
  //Gets a trap from its  number.
  off_Trap *operator [](off_ev_id_t n) const;
Defines off_TrapTbl (links are to index).

<Off interrupt table. >=
// Definition of interrupts for the current processor.
class off_IntTbl: public off_EventTbl {
  // Allocates n contiguous interrupts. 
  off_Irq *alloc( off_Protection *prot,  natural_t n=1, 
                  off_ev_id_t at=OFF_EU_ID_NULL );
  // Deallocates the n contiguous interrupts starting at i. 
  void free( off_Irq *i, natural_t n=1 );
  //Gets an Irq from its  number.
  off_Irq *operator [](off_ev_id_t n) const;
Defines off_IntTbl (links are to index).

Event tables are a set of Events. Each event is identified by an unique id and a reason.

<Off event. >=
// An event, it could be a trap, an interrupt, ...
class off_Event: public off_HWResUnit {

  // Sets h as the handler for this event
  void set_handler( off_EventHandler h);
  // Gets the handler.
  off_EventHandler get_handler(void);
Defines off_Event (links are to index).

Events are handled by EventHandlers which can rely either on portals or, when delivered inside the kernel, on function calls.

<Off event handler. >=
class off_EventHandler {
  // Calls the handler
  virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... )=0;

class off_FnEventHandler : public off_EventHandler {
  void (*handler)(off_event_t ev, off_ev_reason_t reason, ...);
  // Calls the handler
  virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... );

class off_PrtlEventHandler : public off_EventHandler {
  // Its handler is
  //    void (*handler)(off_event_t ev, off_ev_reason_t reason, ...);
  off_prtl_id_t handler;
  // Calls the handler
  virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... );

Defines off_EventHandler, off_FnEventHandler, off_PrtlEventHandler (links are to index).


<Off events and reasons data types. >=
typedef natural_t off_event_t;
typedef natural_t off_reason_t;
Defines off_event_t, off_reason_t (links are to index).

In turn, events may be either Traps or Irqs (interrupts).

<Off Trap. >=
class off_Trap: public off_Event {
  off_mdepTrap &t_mdep;
Defines off_Trap (links are to index).

Although traps cannot be raised artificially, interrupts can (e.g. to notify users or interrupts ``caused by the kernel'', and not by the hardware).

<Off Irq. >=
class off_Irq: public off_Event {
  off_mdepIrq &i_mdep;
  void raise(void);
  <Other public methods of off_Irq. >

Defines off_Irq (links are to index).

Whenever an interrupt happens, we must choose either to evict the current shuttle to dispatch the interrupt to its handler or to defer interrupt delivering in favor of the current shuttle. To make a choice, both shuttles and interrupts have priority-levels. See section 5.1.3 for a discussion of interrupt priorities.

<Other public methods of off_Irq. >= (<-U)
   //Sets/Gets the priority level for this interrupt. 
   void set_prty(off_pl_t prty);
   off_pl_t get_prty(void);


<Off interrupt priority level data type. >=
typedef natural_t off_pl_t;
Defines off_pl_t (links are to index).

We have been seeing a few machine dependent objects like off_mdepEvState, off_mdepTrap, and off_mdepIrq. Those are defined for a particular architecture and their meaning should be clear by the context. Machine dependent objects are always prefixed by off_mdep and will be described along with the system implementation.

Finally, there are some ``traps'' defined by Off++ and not by the hardware, some of them have appeared before and some will appear later on.

<Off Virtual traps. >=
// Virtual (Off++ raised) exceptions
enum off_ex_t { OFF_EX_UNUSED,       // resource no longer referenced
                OFF_EX_MISSING,      // resource missing (remote?)
                OFF_EX_UNAVAILABLE,  // resource exhaust
                OFF_EX_RELOC,        // resource has moved
                OFF_EX_XDT,          // translation to native format needed
                OFF_EX_FWD,          // request should be processed remotely
        OFF_EX_SCERR         // system call failed

4.3.1 Traps and interrupts for plain users

On the one hand, users can interact with trap and interrupt tables through methods provided by Shuttles and Processors. Thus, Shuttle (see section 5.1) and and Processor (see section 4.5) behave as a facade because they provide simple entry points to handle both trap and interrupt tables.

On the other hand, individual events (trap and interrupt entries) are exported to users by means of a wrapper class.

<Off event for users. >=
ENTRY class off_uEvent : public off_uHWResUnit {
   void set_handler( off_prtl_id_t h, const off_Rights &r);
   off_prtl_id_t get_handler(const off_Rights &r) const;
Defines off_uEvent (links are to index).

There is nothing specific for user traps as of this day.

<Off Trap for users. >=
ENTRY class off_uTrap : public off_uEvent {
Defines off_uTrap (links are to index).

On the other hand, interrupts export a few methods.

<Off Irq for users. >=
ENTRY class off_uIrq: public off_uEvent {
   void raise(const off_Rights &r);
   void set_prty(off_pl_t prty, const off_Rights &r);
   off_pl_t get_prty(const off_Rights &r) const;
Defines off_uIrq (links are to index).

\section{DMA lines}

DMA objects are used to allocate DMALines. Note that DMA lines are kept in the kernel although they could have been considered to be a device (and kept in userland). The reason is efficiency, because we do not want protection domain crossings just to program a DMA line. Also, we believe that no flexibility is lost by placing this small piece of code into the kernel.

<Off DMA table. >=
// Allocation of DMA lines
class off_DMA: public off_HWCompResource {
  // Allocates n DMA lines. 
  off_DMALine *alloc(off_Protection *prot,natural_t n=1,
                     off_dma_id_t at=OFF_EU_ID_NULL );
  // Deallocates the n contiguous DMA lines starting at d. 
  void free( off_DMALine *d, natural_t n=1 );
  //Gets a DMA line from its  number.
  off_DMALine *operator [](natural_t n) const;
Defines off_DMA (links are to index).


<Off DMA line identifier. >=
typedef off_eu_id_t off_dma_id_t;
Defines off_dma_id_t (links are to index).

DMA lines are very not well modeled, instead we provided a model of DMA pretty close to that of Intel based machines. Future Off portings to different architectures should provide feedback on how should DMA lines be modeled.

<Off DMA line. >=
typedef off_mode_t dma_mode_t;
class off_DMAline: public off_HWResUnit {
   // Enables/disables this dma line.
   void enable(void);
   void disable(void);
   // Programs a DMA operation. 
   void program(off_pg_id_t addr, vm_offset_t length, dma_mode_t mode);
   // Gets # of copied or pending bytes. 
   vm_offset_t get_copied(void);
   vm_offset_t get_pending(void);
Defines off_DMALine (links are to index).


<Off DMA line mode data type. >=
typedef natural_t dma_mode_t;
Defines dma_mode_t (links are to index).

\subsection{DMA lines for plain users}

DMA related wrappers are straightforward.

<Off DMA table for users. >=
ENTRY class off_uDMA : public off_uHWCompResource {
 off_uDMALine *alloc(off_Protection *prot, 
                     natural_t n=1, off_dma_id_t at=OFF_EU_ID_NULL);
 void free(off_uDMALine *pf, const off_Rights &r, natural_t n=1 );
 off_uDMALine *fetch(off_dma_id_t d,const off_Rights &r) const;
Defines off_uDMA (links are to index).

<Off DMA line for users. >=
ENTRY class off_uDMALine : public off_uHWResUnit {
   // Enables/disables this dma line.
   void enable(const off_Rights &r);
   void disable(const off_Rights &r);
   // Programs a DMA operation. 
   void program(off_pg_id_t addr, vm_offset_t length, dma_mode_t mode,
        const off_Rights &r);
   // Gets # of copied or pending bytes. 
   vm_offset_t get_copied(void,const off_Rights &r);
   vm_offset_t get_pending(void,const off_Rights &r);
Defines off_uDMALine (links are to index).

4.5 Processors and processor pools

  The processor table can be used to allocate a whole processor to certain processes. Usually the node singleton will be the owner of every processor in the system and will delegate ownership to user level (or in-kernel) schedulers.

Processor pools provide Processors. Usually a processor will be allocated to a user-level (or dynamically loaded in-kernel) scheduler which will in turn implement a processor allocation policy.

<Off processor table. >=
class off_ProcTbl: public off_HWCompResource {
  // Allocates n processors. 
  off_Processor *alloc(off_Protection *prot,natural_t n=1,
                       off_proc_id_t at=OFF_EU_ID_NULL);
  // Deallocates the n contiguous processors starting at p. 
  void free( off_Processor *p, natural_t n=1 );
  //Gets a processor from its  number.
  off_Processor *operator [](off_proc_id_t n) const;
Defines off_ProcTbl (links are to index).


<Off processor identifier. >=
typedef off_eu_id_t off_proc_id_t;
Defines off_proc_id_t (links are to index).

To help schedulers, processors provide some methods which we now discuss.

First, a new shuttle can be explicitly installed in the processor using the switch_to method. This is to avoid using run queues when they are not needed. Thus, multiprocessing is an optional feature. Dedicated single-threaded nodes can avoid using run queues at all (no processor multiplexing code will run).

When they are, a run queue is an ordered collection of shuttle identifiers. Each identifier will be accompanied by the number of clock ticks it is expected to run.

The queue will be run in a round robin fashion until the occurrence of an event of interest to the scheduler. To let the processor know when to stop the run queue and notify to the scheduler notify should be used. It specifies which events are to cause the stop and also, who should be notified (either by portal or by direct function call when inside the kernel).

In any case, get_current will return a pointer to shuttle being run (or to the last shuttle run, when idle).

<Off processor. >=
class off_Processor : public off_HWResUnit {
   <Other private members of off_Processor. >
   // Installs a new shuttle
   void switch_to(off_Shtl &new);
   // Executes a run queue
   void run(off_shtl_id_t s[], off_offset_t ticks[], natural_t n);
   // Arranges for the processor to notify the owner on this events
   void notify( off_sevent_t events, off_EventHandler sched);
   // Returns the current shuttle
   off_shtl_id_t get_current(void);

  <Other public methods of off_Processor. >
Defines off_Processor (links are to index).

After the processor has invoked the scheduler to notify the event, the scheduler may either install a new run queue or just adjust the members of the current one. By now, the only events defined are these ones (NB. these are not hardware events, but scheduling events).

<Off scheduling events. >=
// Scheduling related events.
enum off_sevent_t { OFF_SEV_TRAP=0x01,// Trap occurred.
                    OFF_SEV_IRQ =0x02, // Interrupt occurred.
                    OFF_SEV_YLD =0x04, // Current shuttle blocked.
                    OFF_SEV_BLK =0x08, // Other than current shuttle blocked.
                    OFF_SEV_AWK =0x10, // A shuttle got awakened.
                    OFF_SEV_END =0x20 // Run queue exhausted.  
Defines off_sevent_t (links are to index).

Although it is an implementation issue, we will say that the run queue should be kept in the memory of the scheduler. The page the queue is in should be locked so that the processor could maintain a pointer to the run queue. In this way a user level scheduler may adjust the run queue just by writing its own memory without further system calls.

Finally, as we said when we discussed traps and interrupts, each Processor contains a couple of event tables, one for traps and another one for interrupts.

<Other private members of off_Processor. >= (<-U)
   off_TrapTbl p_traps;         // Processor trap table.
   off_IntTbl  p_irqs;          // Processor interrupt table.

The trap and interrupt table for a given processor can be obtained with a public method of the Processor class:

<Other public methods of off_Processor. >= (<-U)
    //Returns a reference to the trap/interrupt event table.
    off_TrapTbl *&get_traps_ref(void);
    off_IntTbl  *&get_irqs_ref(void);

\subsection{Processors and processor pools for plain users}

The processor pool wrapper is defined as follows

<Off processor table for users. >=
class off_uProcTbl : public off_uHWCompResource {
 off_uProcessor *alloc(off_Protection *prot, 
                       natural_t n=1, off_proc_id_t at=OFF_EU_ID_NULL);
 void free(off_uProcessor *pf, const off_Rights &r, natural_t n=1 );
 off_uProcessor *fetch(off_proc_id_t d,const off_Rights &r) const;
Defines off_uProcTbl (links are to index).

Processors for users also include a few methods coming from trap and interrupt tables.

<Off processor for users. >=
ENTRY class off_uProcessor : public off_uHWResUnit {
   // Installs a new shuttle
   void switch_to(off_shtl_id_t &new,
                  const off_Rights &proc_r, const off_Rights &shtl_r);
   // Executes a run queue
   void run(off_shtl_id_t s[], off_offset_t ticks[], natural_t n,
            const off_Rights &proc_r, const off_Rights &shtl_r[]);
   // Arranges for the processor to notify the owner on this events
   void notify( off_sevent_t events, off_prtl_id_t sched,
                const off_Rights &proc_t);
   // Returns the current shuttle
   off_shtl_id_t get_current(const off_Rights &proc_r);

   off_uTrap *alloc_trap(off_Protection *prot, 
                    natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL);
   void free_trap(off_uTrap *t, const off_Rights &r, natural_t n=1 );
   off_uTrap *fetch_trap(off_ev_id_t t,const off_Rights &r) const;
   off_uIrq *alloc_irq(off_Protection *prot, 
                    natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL);
   void free_irq(off_uIrq *t, const off_Rights &r, natural_t n=1 );
   off_uIrq *fetch_irq(off_ev_id_t t,const off_Rights &r) const;

Defines off_uProcessor (links are to index).

5 Implementing abstract resources

The overall picture is depicted in figure 5.1. The relationships there depicted will become clear as this chapter proceeds.

Figure 5.1: Relationship among Shuttles, Portals, DTLBs and other resources.


5.1 Shuttles


Shuttles provide flows of control which can be extended to support user-level or in-kernel process abstractions. The ShtlSrv (there is one per node) maintains a pool of shuttles. Thus, the ShtlSrv main task is to allocate shuttles.

<Off shuttle server. >=
// A shuttle server
class off_ShtlSrv: public off_AbsCompResource {
  // Allocates a shuttle.
  off_Shtl *alloc(const off_Protection &prot,
                  natural_t n=1, off_shtl_id_t at=OFF_SHTL_NULL);
  // Deallocates a shuttle.
  void free(off_shtl_id_t *p, natural_t n=1);

  // Locates a shuttle by its number. 
  off_Shtl *operator [](off_shtl_id_t id);

  <Other public methods of off_ShtlSrv. >
Defines off_ShtlSrv (links are to index).

A shuttle has a processor context and a set of properties (as described below) and it is the only schedulable entity provided by the kernel. Shuttles run when installed in the run queue of a given processor (as we saw in section 4.5).

As we will see in the next chapter, shuttles move from one protection domain to another, including the kernel protection domain, by means of portals. Besides, the use of upcalls from kernel to user space means that the set of kernel and user activationsgif (see figure 5.2) will be mixed. Thus, a user shuttle may call the kernel and the kernel my upcall back to the user during the system call; the upcall routine may in turn issue new system calls, and so on.

Figure 5.2: Shuttle activations: system calls and up-calls.

We may be interested in three different processor contexts when referring to a shuttle:

  1. The current register set, i.e. the last saved register set which holds register values at the time the shuttle was last preempted.

    This is the one used when we are interested in the ``current'' state of the shuttle, no matter what its privilege level is.

  2. The last user register set, i.e. the topmost user context saved in the stack of activations (system calls/upcalls).

    This is the one used when we are interested either in the user state at the last system call or in the state to be returned to the user.

  3. The last kernel register set, i.e. the topmost kernel context saved in the stack of activations (system calls/upcalls).

    This is the one used when we are interested in the state of a shuttle while proceeding inside the kernel.

The first register set will always match one of the two latter ones.

<Off off_Shtl register accessors. >= (U->)
  // Get a pointer to current, last user, or last kernel
  // registers in the stack.
  off_mdepPRegs *get_regref(void);
  off_mdepPRegs *get_uregref(void);
  off_mdepPRegs *get_kregref(void);

A reference to the current processor context can be obtained with get_regref. Last user and kernelgif registers are accessible by means of get_uregref and get_kregref respectively. As these methods return a reference to the register storage, no ``set_'' methods are provided.

Shuttles may also wait on abstract system resources due to some reason. When a shuttle goes to sleep it should call block. Later on, someone else may call ready to unblock it. These methods may cause scheduling events (SEV_BLK, SEV_YLD, and SEV_AWK -- see section 4.5) to be delivered to the processor scheduler so that it could reschedule if desired.

Another effect of calling block is that (even if no event is ever raised) the shuttle will be ignored by its processor until a subsequent call to ready is performed. The Processor where the shuttle is running checks the shuttle's blocked flag and ignores the shuttle if it is set. The blocked flag means ``unable to run'' and can be used not only to block a shuttle, but also during shuttle initialization, during shuttle dismantling, etc. It should not be confused with the more abstract ``blocked'' state used in traditional process abstractions.

<Off shuttle. >=
// A shuttle. 
class off_Shtl : public off_AbsResUnit {
  <Other off_Shtl private members. >
  <Off off_Shtl register accessors. >
  <Off off_Shtl property accessors. >

  // Blocks due to some reason related to the given resource.
  // [May notify the scheduler]
  void block(off_id_t culprit, off_object_t culprit_type); 
  // Awakes this shuttle.
  // [May notify the scheduler]
  void ready(void);

  // Is this shuttle blocked?
  inline boolean_t is_blocked(void) const;

  <Other public methods of off_Shtl. >
Defines off_Shtl (links are to index).

Shuttles must not run at the same time on more than one processor. To ensure it, running shuttles are flagged and such a flag is checked by every processor before switching to a new shuttle. Two new methods do the job:

<Other public methods of off_Shtl. >= (<-U)
   // Checks/Sets if this shuttle is running.
   inline boolean_t is_running(void);
   inline void run(void);
   inline void stop(void);

\subsection{Shuttles for plain users. }

<Off shuttle server for users. >=
ENTRY class off_uShtlSrv: public off_uAbsCompResource {
  // Allocates a shuttle.
  off_uShtl *alloc(const off_Protection &prot,
                   natural_t n=1, off_shtl_id_t at=OFF_SHTL_NULL);
  // Deallocates a shuttle.
  void free(off_uShtl *p, natural_t n=1, const off_Rights &r);

  // Locates a shuttle by its number. 
  off_uShtl *fetch(off_shtl_id_t id, const off_Rights &r);

  <Other public methods of off_uShtlSrv. >
Defines off_uShtlSrv (links are to index).

<Off shuttle for users. >=
ENTRY class off_uShtl : public off_uAbsResUnit {
  off_mdepPRegs get_reg(const off_Rights &r);
  off_mdepPRegs get_ureg(const off_Rights &r);
  off_mdepPRegs get_kreg(const off_Rights &r);
  void set_reg( off_mdepPRegs regs, const off_Rights &r);
  void set_ureg(off_mdepPRegs regs, const off_Rights &r);
  void set_kreg(off_mdepPRegs regs, const off_Rights &r);

  <Off off_uShtl property accessors. >

  // Blocks due to some reason related to the given resource.
  // [May notify the scheduler]
  void block(off_prtl_id_t culprit, const off_Rights &r); 
  // Awakes this shuttle
  // [May notify the scheduler]
  void ready(const off_Rights &r);

  // Is this shuttle blocked?
  boolean_t is_blocked(const off_Rights &r) const;

  <Other public methods of off_uShtl. >

Defines off_uShtl (links are to index).

\subsection{Shuttle properties}

Properties are used to specify which resources (e.g. DTLBs, IO maps, etc.) should be readily available for the shuttle to run. Specific resource instances are identified by a property value.

Properties are named by values of type off_prop_t and their values by system identifiers of type off_id_t.

<Off shuttle property identifer. >=
// Properties
typedef natural_t off_prop_t;
Defines off_prop_t (links are to index).

The context switch code is guaranteed to call a property switch function pswitch for every property changing its value (i.e. for every resource whose instance must be changed) in the context switch.

Two additional functions, pset and pclr, are provided for those cases when the hardware is switching the context automatically but some work should be done to install (or deinstall) a property value in a given shuttle.

Any resource provider implementing this interfacegif can be considered to be a property provider.

<Off shuttle property server. >=
// The interface of a property server. 
signature off_ShtlPropSrv {

  // Switches property values. 
  // Returns either 0 or an error code.
  int pswitch(off_id_t from, off_id_t to, off_Shtl &s);
  // Set or clear the property at a given shuttle. 
  int pset(off_id_t pval, off_Shtl &s, const off_Rights &r);
  int pclr(off_id_t pval, off_Shtl &s);

  // Should we call to pswitch?
  boolean_t needs_switch(void);
Defines off_ShtlPropSrv (links are to index).

needs_switch method is provided to save some calls to pswitch. Those properties which don't need to implement pswitch may provide a trivial (i.e. empty) implementation and arrange for needs_switch to return true. When a new property is being defined, the shuttle server accepting the definition will use needs_switch to determine whether it should arrange for pswitch to be called or not.

To define the property being implemented as a shuttle property, a property implementor must contact ask the shuttle server to define it. The ShtlSrv must include now this method to support property definitions. Indeed, we include two new methods. One to define properties implemented inside the tex2html_wrap_inline4297kernel (which may use a simple procedure call) and another one to define properties which might be implemented outside the tex2html_wrap_inline4297kernel (which must use portals).

<Other public methods of off_ShtlSrv. >= (<-U)
   // Defines a (new) shuttle property.
   void define_kprop(off_ShtlPropSrv &implementor, 
                     off_prop_t id, off_Protection &p);
   void define_uprop(off_prtl_id_t   implementor, 
                     off_prop_t id, off_Protection &p);
   void undefine_prop(off_prop_t id, const off_Rights &r);

If foreign shuttle servers ever find a shuttle with an unknown property being used, they will ask to other shuttle servers found in the net for the property definition.

Property values are stored in property sets, of type ShtlPSet. A ShtlPSet is an ordered (by property id) collection of property values so that context switch could be done quickly. To save some more time in context switches, ShtlPSets may be shared. Thus they must be reference counted.

<Off shuttle property set. >=
// A set of property values
class off_ShtlPSet {
  <Off private members for reference counting objects. >
  <Off protected methods for reference counting objects. >
  <Off public methods for reference counting objects. >
  // Adds a new property to this set.
  void add(off_prop_t p, off_id_t val);
  // Deletes a new property to this set.
  void del(off_prop_t p);
  // Gets the value of the given property or OFF_SHTL_NULL. 
  off_id_t operator[](off_prop_t p) const;

  // Gets the number of the last property in the set.
  off_prop_t get_maxprop(void);
Defines off_ShtlPSet (links are to index).

Each shuttle has its own (possibly shared) ShtlPSet which can be accessed by the accessor get_psetref.

<Off off_Shtl property accessors. >= (<-U) [D->]
  inline off_ShtlPSet *&get_psetref(void);

Some time can be saved in context switches avoiding the comparison of property arrays. Thus, we do not include the ShtlPSet in the shuttle but, instead, a reference to it. So, property sets can be shared and a single pointer comparison can check if we have the same property arrays.

<Other off_Shtl private members. >= (<-U)
  off_ShtlPSet *s_pset;         // Shuttle properties.

To provide more convenient entry points for users, some methods are included to get and set property values.

<Off off_Shtl property accessors. >+= (<-U) [<-D->]
  // Gets/Sets the value for a given property.
  off_id_t get_prop(off_prop_t p) const;
  void set_prop(off_prop_t p, off_id_t val);

Another one will arrange for every property to be shared between two shuttles.

<Off off_Shtl property accessors. >+= (<-U) [<-D]
  // Resets the properties so that they are share with other shuttle. 
  void dup_props(const off_Shtl &s);

Not every property needs to call its switch function on context switch. Those properties which do not need switch functions can be placed last in the array and the comparison stop as soon as the first property without a switch function is reached.

Finally, although we have not mentioned it, properties can be used to support user-level properties by arranging a user level function to be called at the beginning of every quantum. That function can be used to establish user-level context pieces if needed (e.g. to set a pointer to per-thread data or to establish a temporary memory map so that every thread could have a private (secure) store). When such a property is being used, its property value might identify the (user-space) argument to the function setting the user-level context.

\subsubsection{Allocation of ShtlPSets}

Property sets are stored into a ShtlPSetSrv used internally by the shuttle server. It is considered to be a composite resource but it is neither a hardware resource container nor an abstract resource container.

<Off shuttle property sets server. >=
class off_ShtlPSetSrv : public off_CompResource {
  // Allocates a PSet
  off_ShtlPSet *alloc(void);
  // Deallocates a PSet.
  void free(off_ShtlPSet *p);
Defines off_ShtlPSetSrv (links are to index).

\subsubsection{Shuttle properties for plain users}

For system users, properties are handled by shuttles. Thus there are few more methods in uShtl.

<Other public methods of off_uShtl. >= (<-U) [D->]
  off_id_t get_prop(off_prop_t p, const off_Rights &r) const;
  void set_prop(off_prop_t p, off_id_t val, 
                const off_Rights &shtl_r, const off_Rights &prop_r);
  void dup_props(const off_shtl_id_t &s, 
                 const off_Rights &this_r, const off_Rights &s_r);

Property definition is also available for system users. The uShtlSrv provides this service.

<Other public methods of off_uShtlSrv. >= (<-U)
    void define_prop(off_prtl_id_t implementor, off_prop_t id,
                     const off_Protection &r);
    void undefine_prop(off_prop_t id, const off_Rights &r);

5.1.3 Shuttles and hardware events

\subsubsection{Shuttles and Traps}

As we have seen in section 4.5, each Processor includes both a TrapTbl and an IntTbl. As in certain circumstances we may desire to dispatch traps on a per-shuttle basis, there is a predefined shuttle property (implemented by the shuttle trap table server ShtlTrapTblSrv) defining those traps in which the current shuttle has interest.

<Off Shuttle trap table server. >=
class off_ShtlTrapTblSrv : public  off_CompResource {
  // Allocates a trap table.
  off_TrapTbl *alloc(void);
  // Deallocates a trap table. 
  void free(off_TrapTbl *t);
  // To implement the trap table property:

  // Switches property values. 
  // Returns either 0 or an error code.
  int pswitch(off_trapt_id_t from, off_trapt_id_t to, off_Shtl &s);
  // Set or clear the property at a given shuttle. 
  int pset(off_trapt_id_t pval, off_Shtl &s, const off_Rights &r);
  int pclr(off_trapt_id_t pval, off_Shtl &s);
  // Should we call to pswitch?
  boolean_t needs_switch(void);

Defines off_ShtlTrapTblSrv (links are to index).

Thus, to make certain traps to be ``redefined'' for certain shuttles, a TrapTbl should be allocated and installed as a property in the those shuttles. Note that those traps already ``allocated'' in the Processor trap table will not be defined on a per-shuttle basis. Whenever a trap occurs, the processor trap table is has priority over the shuttle trap table (eg. Shuttles may not bypass the DMM handling of page fault traps even if they include an entry for such a trap in their trap tables).

When used, trap tables are considered to be part of the shuttle itself: The trap server is never seen outside the shuttle server.

Shuttles and interrupts


Whenever an interrupt happens, we must choose either to evict the current shuttle to dispatch the interrupt to its handler or to defer interrupt delivering in favor of the current shuttle. To make a choice, both shuttles and interrupts have priority-levels. On processor interrupt, the lowergif priority wins (considering the executing shuttle priority level and the interrupt priority level).

To arbitrate, no shuttle may set a priority level lower than the lowest existing interrupt priority level. No interrupt may be adjusted to a priority level lower than the lowest existing shuttle interrupt-priority level. In few words, when users request a priority level (be it for a shuttle or for an interrupt) the policy is FCFS; e.g. If interrupt already i has a priority level n, no shuttle may set its interrupt priority level to numbers below n+1.

To implement this feature we must include priority levels as a shuttle property predefined by the shuttle server.

<Off Shuttle interrupt priority level server. >=
class off_ShtlIntPrtySrv {
  // Switches property values. 
  // Returns either 0 or an error code.
  int pswitch(off_pl_id_t from, off_pl_id_t to, off_Shtl &s);
  // Set or clear the property at a given shuttle. 
  int pset(off_pl_id_t pval, off_Shtl &s, const off_Rights &r);
  int pclr(off_pl_id_t pval, off_Shtl &s);
  // Should we call to pswitch?
  boolean_t needs_switch(void);
Defines off_ShtlIntPrtySrv (links are to index).

The interrupt priority is considered to be inside the shuttle. The interrupt priority server is never seen outside the shuttle server.

Shuttle traps and shuttle interrupt priorities for plain users

The trap table and the interrupt priority are considered (and seen) as predefined properties by system users. Methods get_prop, set_prop and dup_props are well behaved with respect to them.

With respect to interrupt priorities, users may set the value of the interrupt priority level property without further operations (they don't need to allocate priority levels as they are predefined).

However, Trap tables are neither allocated nor deallocated by users. In fact, trap tables are not even seen by users. To handle per-shuttle trap definitions, the shuttle acts as a facade for its users mimicing the behavior of Processor in this respect.

<Other public methods of off_uShtl. >+= (<-U) [<-D]
   off_uTrap *alloc_trap(off_Protection *prot, 
                    natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL);
   void free_trap(off_uTrap *t, const off_Rights &r, natural_t n=1 );
   off_uTrap *fetch_trap(off_ev_id_t t,const off_Rights &r) const;

% sp replaced by sp pools + final notes about access modes and stacks -Nemo

5.2 Portals

A portal is a communication endpoint. It can be attached to a handler and also invoked like an interrupt. Portals are contained in the portal server, an instance of the PrtlSrv class whose main task is to allocate portals.

<Off portal server. >=
class off_PrtlSrv : public off_AbsCompResource {
  // Allocates a portal.
  off_Prtl *alloc(const off_Protection &p, 
                  natural_t n=1, off_prtl_id_t at=OFF_PRTL_NULL);
  // Deallocates a portal
  void free(off_Prtl *p, natural_t n=1);
  // Gets a portal by its number
  off_Prtl *operator[](off_prtl_id_t p) const;
Defines off_PrtlSrv (links are to index).

Portals support both protected control transfers (PCTs) and asynchronous message delivering.

When used for PCTs, the sender shuttle changes its properties on the fly and becomes itself a flow of control in the receiver protection domain. After the handler completes, the shuttle recovers its original state and continues its execution.

Callees may, at any time, delegate the execution of PCTs to other servers. In this case, one single shuttle can cross several protection domains and come back directly to its original site. To delegate a PCT reply to a different server, pct_pass can be used.

Finally, when no reply is ever expected, deliver should be used instead. It will simply save (in the callee context) what users are unable to save safely (like eflags on Intel architectures) and install a simple activation frame for the handler on the receiver's shuttle. In this case, the caller will resume execution as soon as the callee has been notified; the message is sent but may be not delivered yet.

The handler information present in the portal is basically a program counter (pc), a pointer to a pool of stack pointers (spp), an access mode mode, and an (optional) set of properties (props). Each property value in props will cause the sender's property to be set to that value during portal invocation. Remaining sender properties will maintain their original value. As it is most likely that the address space property (see section 5.3) will be present, that property is factored out of the property array to gain some performance. On nodes where the tex2html_wrap_inline4297kernel is compiled without support for multiple protection domains, a null value is used the such property is ignored. When the props set is not present in the portal, it will not support PCTs but just asynchronous message delivering.

That is enough to support PCTs, but for asynchronous message delivering it is necessary that the portal have a shuttle identifier. Thus, when deliver is used, the handler (known by the portal information) will run not on the sender shuttle but on the receiver. If the receiver shuttle identifier is not specified, the portal will not support asynchronous delivering.

<Off portal. >=
class off_Prtl : public AbsResUnit {
  <Other off_Prtl private members. >
  // (Re)sets the portal handler.
  void set_hndlr(vm_offset_t pc, vm_offset_t spp, off_mode_t mode,
                 off_dtlb_id_t vas, natural_t maxmsgsz=0,
                 const off_ShtlPSet  *props=NULL, 
                 off_shtl_id_t s=OFF_SHTL_NULL);

  // (Locally) delivers a message to this portal using a PCT. 
  // (May reset the stack)
  void pct(off_Shtl &sender, 
           void *msg, natural_t msgsz, 
           void *reply, natural_t replysz, natural_t tmout) const; 

  // Pass the PCT to a different portal.
  // Reply to the initial PCT will be issued from there. (May reset the stack)
  void pct_pass(off_Shtl &sender, off_prtl_id_t p) const;

  // Delivers an event to a portal (one-way PCT).
  // (May reset the stack)
  void deliver(off_Shtl &sender, 
               natural_t rq_nwords, void *rq_msg, natural_t tmout) const;

Defines off_Prtl (links are to index).

Portals may be used to enforce object access control. When the handler is attached to the portal, the permitted access mode through this portal is specified. The kernel does not enforce it. Instead, the access mode will be placed in a pre-specified register so that the user handler can check the validity of the access. For example, a file object may create a read-only access point using a portal specifying OFF_OP_R as the only permitted access mode. We could have omitted this access mode feature from portals, but it does not change the portal semantics and avoid those calls which will fail to proceed further.

The stack pool pointer (spp) portal member points to an array of initial stack pointers provided by the handler. Each incoming invocation will use, as its own user stack, one of the stacks found there. The pool of stacks can be of any length, but it is assumed that the last stack pointer will be invalid and its value will be 1.

Upon portal invocation, the kernel selects one of the non-zero pointers found in the pool and then resets it to 0. Since a stack pointer with value 0 will be ignored, this is a safe mechanism to avoid race conditions during stack selection. To allow the stack to be reused, the user-level handler must (atomically) write the original stack pointer value back to the stack pool.

When there are no more free stacks, any further call on the portal blocks. These blocked shuttles are queued (using shuttle identifiers instead of pointers, so that the list can span several nodes) using the p_wstack private member. When a new stack is made available, they are awakened.

<Other off_Prtl private members. >= (<-U)
  off_Bureaucrat p_wstack; // No stack available on a PCT portal

\subsection{Portals for plain users}

The portal server has a ``conventional'' wrapper.

<Off portal server for users. >=
ENTRY class off_uPrtlSrv : public off_uAbsCompResource {
  off_uPrtl *alloc(const off_Protection &prot,
                   natural_t n=1, off_prtl_id_t at=OFF_PRTL_NULL);
  void free(off_uPrtl *p, natural_t n=1, const off_Rights &r);

  off_uPrtl *operator [](off_prtl_id_t id, const off_Rights &r);
Defines off_uPrtlSrv (links are to index).

The portal wrapper is the most special one in the kernel. Only the non-delivering methods are exported to users as usually. Methods used for message delivering and PCT are implemented without wrapper services (because they are used to implement the portal delivering mechanism used by the wrappers).

<Off portal for users. >=
ENTRY class off_uPrtl : public uAbsResUnit {
  // (Re)sets the portal handler.
  void set_hndlr(vm_offset_t pc, vm_offset_t spp, off_mode_t mode,
                 off_dtlb_id_t vas, natural_t maxmsgsz=0,
                 const off_ShtlPSet  *props=NULL, 
                 off_shtl_id_t s=OFF_SHTL_NULL,
                 const off_Rights &prtl_r, const off_Rights &shtl_r,
                 const off_Rights &dtlb_r);

Defines off_uPrtl (links are to index).

5.3 Distributed Memory Managers


Off++ memory management is based on Distributed TLBs (DTLBs) where TLB stands for Translation Lookaside Buffer, a cache of virtual to physical memory address translations [12].

Each Distributed Memory Manager (DMM, for short) is actually a pool of DTLBs. However, DTLBs are an optional feature. Machines dedicated to a single application (like a dedicated file server, an embedded controller, etc.) can use a single protection domain for efficiency purposes.

A DMM multiplexes the address translation hardware among the existing DTLBs in cooperation with the shuttle server (as there is usually one processor register used to identify the protection domain).

<Off DMM. >=
class off_DMM : public off_AbsCompResource {
  // Allocates a DTLB.
  off_DTLB *alloc(const off_Protection &prot,
                  natural_t n=1, off_dtlb_id_t at=OFF_DTLB_NULL);
  // Deallocates a DTLB.
  void free(off_DTLB *DTLB, natural_t n=1);

  //Gets a DTLB from its number.
  off_DTLB *operator [](off_dtlb_id_t id);

  //Gets the size of pages being translated.
  vm_size_t get_pgsize(void);

  <Off off_DMM shuttle property methods. >

Defines off_DMM (links are to index).

As \dtlb{}s are valid shuttle property values, the \dmm{} also implements the SthlPropSrv interface.

<Off off_DMM shuttle property methods. >= (<-U)

  // Property interface routines (off_ShtlPropSrv signature).

  // Switches property values. 
  // Returns either 0 or an error code.
  int pswitch(off_dtlb_id_t from, off_dtlb_id_t to, off_Shtl &s);
  // Set or clear the property at a given shuttle. 
  int pset(off_dtlb_id_t pval, off_Shtl &s, const off_Rights &r);
  int pclr(off_dtlb_id_t pval, off_Shtl &s);

  // Should we call to pswitch?
  boolean_t needs_switch(void);

A DTLB is a set of translations. Not every existing translation must be present in it. The \dtlb{} should be considered to be a cache of the translations being used. However, on architectures with page tables, the cache may actually hold a copy of every existing translation to local memory.

Page faults and remaining events for the DTLB will be delivered to its owner (see off_Resource, for details).

<Off DTLB. >=
class off_DTLB : public AbsResUnit {
  // Installs a set of  (contiguous) address translations.
  void install(vm_offset_t va, off_pg_id_t pa, 
               off_mode_t access_mode, natural_t n=1);
  // Deinstalls a set of  (contiguous) address translations.
  void invalidate(vm_offset_t va, natural_t n=1);
  // Changes the access mode bits for the given translations
  void set_mode(vm_offset_t va, off_mode_t access_mode, natural_t n=1);
Defines off_DTLB (links are to index).

Note that these addresses can refer also to remote memory. Remote translations will make intensive use of the \dmm{}'s relocation table (one of such tables is present in every AbsCompResource object, as we saw in section 2.5).

5.3.1 DTLBs for plain users

Users can handle their DTLBs through the DTLB and DMM wrappers

<Off DMM for users. >=
ENTRY class off_uDMM : public off_uAbsCompResource {
  // Allocates a DTLB.
  off_uDTLB *alloc(const off_Protection &prot,
                   natural_t n=1, off_dtlb_id_t at=OFF_DTLB_NULL);
  // Deallocates a DTLB.
  void free(off_uDTLB *DTLB, natural_t n=1, const off_Rights &r);

  //Gets a DTLB from its number.
  off_uDTLB *operator [](off_dtlb_id_t id, const off_Rights &r);

  //Gets the size of pages being translated.
  vm_size_t get_pgsize(const off_Rights &r,);

Defines off_uDMM (links are to index).

<Off DTLB for users. >=
ENTRY class off_uDTLB : public off_uAbsResUnit {
  // Installs a set of  (contiguous and w/ the same access rights)
  // address translations.
  void install(vm_offset_t va, off_pg_id_t pa, 
               off_mode_t access_mode, natural_t n=1,
               const off_Rights &dtlb_r, const off_Rights &pa_r );
  // Deinstalls a set of  (contiguous) address translations.
  void invalidate(vm_offset_t va, natural_t n=1, const off_Rights &r);
  // Changes the access mode bits for the given translations
  void set_mode(vm_offset_t va, off_mode_t access_mode, natural_t n=1,
                const off_Rights &dtlb_r, const off_Rights &pa_r);
Defines off_uDTLB (links are to index).

A Index of Chunks

B Index of Identifiers


Francisco J. Ballesteros. Off web site. http://gsyc.escet.urjc.es/off , 1996.

Francisco J. Ballesteros. Off--Un Nuevo Enfoque en la Construccion de Sistemas Operativos Distribuidos. PhD thesis, Facultad de Informatica. Universidad Politecnica de Madrid, 1998. (being submitted, defense pending).

Francisco J. Ballesteros and Luis L. Fernández. The Network Hardware is the Operating System. In Proceedings of the 6th Hot Topics on Operating Systems (HotOS-VI)., Cape Cod, MA (USA), May 1997.

B.N. Bershad, S. Savage, P. Pardyak, E.G. Sirer, M. Fiuczynski, D. Becker, S. Eggers, and C. Chambers. Extensibility, safety and performance in the SPIN operating system. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles. ACM, December 1995.

Roy H. Campbell, Francisco J. Ballesteros, Fabio Kon, Ashish Singhai, Dulcineia Carvalho, and Robert Moore. 2k: A distributed adaptable operating system. http://choices.cs.uiuc.edu/2K , August 1997.

John B. Carter, Dilip Khandekar, and Linus Kamb. Distributed shared memory: Where we are and where we should be headed. In In proceedings of the 5th Workshop on Hot Topics in Operating Systems, 1995.

D. Cheriton and K. Duda. A caching model of operating system kernel functionality. In Proceedings of the First Symposium on Operating Systems Design and Implementation, pages 179-193, November 1994.

D. Engler, M. F. Kaashoek, and J. O'Toole. The Operating System Kernel as a Secure Programmable Machine. In Proc. of the 6th SIGOPS European Workshop, pages 62-67, Wadern, Germany, Sept 1994. ACM SIGOPS.

Bryan Ford, Godmar Back, Greg Benson, Jay Lepreau, Albert Lin, and Olin Shivers. The flux os toolkit: A substrate for kernel and language research. In Proceedings of the 16th SOSP, Saint-Malo, France, October 1997. ACM.

Bryan Ford, Mike Hibler, Jay Lepreau, Patrick Tullmann, Godmar Back, and Stephen Clawson. Microkernels Meet Recursive Virtual Machines. In Proc. OSDI, October 1996.

Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patters. Elements of Object-Oriented Software. Addison-Wesley, 1995.

John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco, California, second edition edition, 1996.

Takuro Kitayama, T.Nakajima, and Hideyuki Tokuda. RT-IPC: An IPC Extension for Real-Time Mach. In Proceedings of the 2nd Microkernel and Other Kernel Architectures. USENIX, 1993.

Donald E. Knuth. Literate Programming. Center for the Study of Language and Information, Stanford University, 1992.

Leslie Lamport. A Document Preparation System: LaTeX. Addison-Wesley, ISBN 0-201-15790-X.

R. Pike, D. Presotto, K. Thompson, and H. Trickey. Plan 9 from Bell Labs. In NKUUG Proceedings of the Summer 1990 Conference, London (England), July 1990.

Norman Ramsey. Literate programming simplified. IEEE Software, 11(5):97-105, September 1994.

Bjarne Stroustrup. The C++ Programming Language. Addison Wesley, 1986.

Andrew S. Tanenbaum. Operating Systems: Design and Implementation. Prentice-Hall, 1987.

H. Tokuda, T.Nakajima, and P. Rao. Real-Time Mach: Towards a Predictable Real-Time System. In Proceedings of the 1st USENIX Mach Workshop. USENIX, Oct 1990.


Distributed Adaptable Micro-Nucleus
... off_u
The ``u'' stands for ``user''.
Although in the code we will be using the prefix off_ we will omit it in this text.
Fabio seems to hate it.
Although nothing prevents future extensions to consider per-resource operations.
We will prefix any data member of a class with the initial letter for the class and an underscore (``_''). Those members which may be in more than one class -like the reference counter- will be just prefixed by an underscore.
As it is done in the real life.
cf. the wrapper or decorator design pattern (see [11], pp 175.)
As the allocator was an interface or signature, no inheritance is needed.
see the singleton design patterm in  [11], pp 127.
Although right now, it's not a facade but it will centralize access to node-wide services like serial debugging, message logging, etc.
By activation we mean entry points or cross-domain calls.
As the kernel can be preempted, kernel registers will be saved in the kernel stack as user registers are.
An abstract class could do the job, but GNU C++ signatures provide subtype polymorphism independently from the inheritance hierarchy.
To follow the fine tradition of UNIX systems.

About this document ...

A Detailed Description of Off++, a Distributed Adaptable tex2html_wrap_inline4297kernel

This document was generated using the LaTeX2HTML translator Version 96.1-h (September 30, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -show_section_numbers -split 0 off-full.

The translation was initiated by Francisco J. Ballesteros on Sat Dec 27 22:04:31 MET 1997

Francisco J. Ballesteros
Sat Dec 27 22:04:31 MET 1997