A Detailed Description of Off++, a Distributed Adaptable tex2html_wrap_inline4297kernel

Francisco J. Ballesteros Fabio Kon Roy H. Campbell
nemo@gsyc.escet.urjc.es {f-kon,roy}@cs.uiuc.edu
Department of Computer Science
University of Illinois at Urbana-Champaign
Report No. UIUCDCS-R-97-2035, UILU-ENG-97-1748

August, 1997

Abstract:

The Off++ distributed adaptable tex2html_wrap_inline4297kernel is a minimal tex2html_wrap_inline4297kernel whose only task is to safely multiplex and export the distributed hardware present in the network. It is designed to be used as a basis for distributed user-level OS services. This technical report describes the design and implementation of Off++, the object-oriented redesign of the Off tex2html_wrap_inline4297kernel [2, 3]. Off++ extends the functionalities provided by Off in order to provide basic support for the 2K operating system [5] we are currently building. This document is meant to be the starting point for the system literate implementation.

Contents

List of Figures

1 Introduction

1.1 Motivation

The well known definition for operating system is ``the software that securely abstracts and multiplexes physical resources'' [19]. By no means is it known that those resources should be contained in a single node. So, Why are our distributed operating systems based on microkernels which essentially multiplex just local resources?

Obviously system services can be later distributed when using a (centralized) microkernel. Indeed, that can be done even when using a monolithic system [16]. But this will not solve the actual problem that the system is not being actually distributed and is not transparently multiplexing both local and remote resources.

Secondly, a major drawback of current distributed operating systems is their lack of adaptability. It is known that adaptability can be achieved using a minimal microkernel as a foundation for the operating system [8, 7, 4, 10]. If the microkernel is centralized, adaptation of system services for particular requirements may harm the distribution of those services because they are distributed on top of the microkernel and this distribution is not supported by the microkernel itself. The reason is that user extensions may fail to preserve properties found in the distributed hardware and there will be no lower layer supporting them. If the microkernel is itself distributed, simple system extensions may still benefit from system distribution because the lower layer is distributed by itself. This problem can be noticed by the fact that it is necessary to modify and/or re-implement existing system services to add new distributed services to a typical microkernel based distributed system [6, 20, 13].

The 2K Operating System is being built to explore the combination of distribution and adaptability. 2K is built from two main pieces:

We expect the combination of both elements to yield a simple easy-to-use distributed and flexible OS.

By adopting a strict architecture self-awareness philosophy, we will allow system and user modules to be conscious of its own physical and logical architecture. So, the system will be able to adapt itself optimizing its performance and reliability.

This document describes the object oriented redesign of the Off tex2html_wrap_inline4297kernel [2, 3, 1], named Off++, and is meant to be the starting point for its literate implementation (see section 1.3).

Off++ also adds new functionalities to the tex2html_wrap_inline4297kernel in order to provide basic support for 2K. These new functionalities include support for architecture self-awareness, system object browsing and inspection (see subsections 2.3.5 and 2.3.7), and flexible user-defined domains (see section 2.6).

Multiple protection domains, multi-user operation, and multitasking are optional features which can be avoided in those nodes where they are not useful.

1.2 The Off tex2html_wrap_inline4297kernel

 

The three basic abstractions provided by Off (shuttles, portals and DTLBs) will be described briefly in this section so that the reader will be able to better understand the following chapters.

1.2.1 Shuttles. Distributed Process Services

Process services are quite simple in Off. The only abstraction provided is the Shuttle. A shuttle is an extensible hardware context. Initially, it consists only of a program counter and a stack pointer, though they can be safely extended later on to include other pieces of context or properties (such like general purpose registers, address spaces, privilege levels, etc.)

1.2.2 Portals. Interprocess communication

The basic interprocess communication mechanism provided by Off is the Portal. A Portal can be seen like a ``distributed interrupt line'' or a ``network-wide gate''. Portals have the basic utility of interrupts, i.e. they can be invoked transparently to make a handler perform some task.

Portals do not implement buffering and do not specify whether synchronous messages, asynchronous messages or RPCs are to be used. The mechanisms and policies used to locate and to invoke portals over the network are left up to the user. User provided transport and location protocols are used to extend the portal mechanism transparently over the network. In this way traps, interrupts, exception and user messages can be accessed and adapted in the whole system.

D-TLBs. Distributed Memory Management

The only thing done by Off is to export the network hardware, and this also holds for memory management. Off implements a Distributed Software TLB. The user can establish translations from virtual to distributed physical memory addresses and the address translation hardware is safely multiplexed by Off among the competing applications.

1.3 Literate programming

 

As this document is hardcopy of a literate program [14] we think that it is worth saying something about this technique. We do it by citing the comp.programming.literate newsgroup FAQ:

``Literate programming is the combination of documentation and source together in a fashion suited for reading by human beings. In fact, literate programs should be enjoyable reading, even inviting! [...] In general, literate programs combine source and documentation in a single file. Literate programming tools then parse the file to produce either readable documentation or compilable source. The WEB style of literate programming was created by D.E. Knuth during the development of his TeX typsetting software.

All the original work revolves around a particular literate programming tool called WEB. Knuth says:

The philosophy behind WEB is that an experienced system programmer, who wants to provide the best possible documentation of his or her software products, needs two things simultaneously: a language like TeX for formatting, and a language like C for programming. Neither type of language can provide the best documentation by itself; but when both are appropriately combined, we obtain a system that is much more useful than either language separately.

The structure of a software program may be thought of as a web that is made up of many interconnected pieces. To document such a program we want to explain each individual part of the web and how it relates to its neighbours. The typographic tools provided by TeX give us an opportunity to explain the local structure of each part by making that structure visible, and the programming tools provided by languages such as C or Fortran make it possible for us to specify the algorithms formally and unambigously. By combining the two, we can develop a style of programming that maximizes our ability to perceive the structure of a complex piece of software, and at the same time the documented programs can be mechanically translated into a working software system that matches the documentation.''

What follows have been written in a mixture of LaTeX and \textsl{C++}. The \textsl{noweb} literate programming tool has been used to obtain a LaTeX formatted version of the document as well as the C++ code which can be later compiled.

1.4 How to read this document

In what follows, each description of a part of the system may be followed by a chunk of code like this one

<Example chunk of code. >= (U->)
class A_Class { ... };
Defines A_Class (links are to index).

In this example, the name of the chunk was ``Example chunk of code.'' The number on the right of the chunk name, also found on the left margin, identifies the chunk and will be used in cross-references. We will refer to such number as the ``chunk number''. Also, when an interesting entity has been defined by a chunk (A_Class in the example) it will be stated below the chunk using a small font.

The chunk number is made the page number (identifying in which page the chunk is defined) and, if necessary, a letter to distinguish between different chunks in the same page. Using this number one is able to quickly locate any chunk in the document.

Chunks of code can include other chunks, in this case, you can use the number enclosed with the chunk name between angles (i.e. ``tex2html_wrap_inline4313'' and ``tex2html_wrap_inline4315'') to locate quickly the code for the chunk being included. In this example, the chunk shown above will be included in chunk 3a. We then should use the chunk numbers found in the left margin to locate it.

<A chunk including another. >=
  // Some code...
  <Example chunk of code. >
  // And we also use A_Class
  A_Class an_object;

Finally, a chunk may be continued at a different part of the document. In this case it will be said explicitly below the chunk using a small font, as it can be seen below.

<Continuing chunk of code. >= [D->]
  //Mary had

<Continuing chunk of code. >+= [<-D]
  //a little lamb.

Note the use of the ``+'' to state that the chunk is a continuation.

Finally, we suggest reading this document as it is. Do not try to read it following the order understood by a C++ compiler. The references below each chunk will allow you to navigate through this program's web.

1.5 Tools

To implement Off++ several tools have been used. They are all you need to build a system image from the on-line source for this document.

All these tools have been already used in the construction of the original Off prototype.

2 System structure

2.1 Exporting system objects

System services are made of a bunch of system objects exported to users. The model is actually more generic: portals are used whenever a protection domain wants to export certain services in a secure way to alien users. For each object being exported, a set of methods are made available by means of portals.

2.1.1 Traps and the OO model

The Off++ tex2html_wrap_inline4297kernel is an object oriented system. However, the boundary between the user and the kernel is procedural: portals are used to perform system calls (implemented in turn by means of traps), as can be seen in figure 2.1.

  figure226
Figure 2.1: System calls

The path for a system call proceeds as follows:

  1.   A user object calls a system object method.
  2. The system object is actually a wrapper that will translate the method invocation into a portal invocation.
  3.   The portal invocation will transfer the user control flow to the kernel.
  4. Inside the kernel, a system object method (that wrapped by the wrapper object in step 1) will be called as part of the portal delivering mechanism.

Only step 3 is procedural. Both user and kernel objects think that they are calling now and then to other objects. But whenever a protection domain has to be crossed, a portal is used.

In this way no remote method invocation has to be built inside the kernel so that it could be kept simple.

Thus, the only actual system calls are those needed to implement the portal delivering mechanism. Remaining system services are first provided by means of portals and then wrapped at user level by objects.

2.1.2 Tying portals and methods together

Whenever an object wants to export part of its interface to the outer world it must create one portal per method. To automate the task, each class with methods being exported should be labeled ENTRY too. In this way a compiler can generate the glue code between the portals and the methods being exported. All the method of task are considered to be exported to the user.

In particular, the compiler will generate automatically the implementation of this method:

<Off private methods for exported objects. >=
  // Creates portals to access class methods
  void export(void); 

The code generated for export will create one portal per method. The portal handler will be setup so that methods are invoked transparently: The program counter will be pointing to the method code and the maximum expected number of message words will be setup to the size in the stack of the method arguments. The task of extracting the arguments is done automatically by the portal invocation mechanism and the compiler because portal messages are always copied from (caller) stack to (callee) stack.

Those methods exported are expected to return a non-zero value with the error code in case the invocation fails, or zero otherwise. This can be used by user-level wrappers to raise exceptions. That error code will be returned by delivering a ``system call failed'' (SCERR) event to the shuttle issuing the system call. The handler for that event may set the error code inside the user application so that an exception could be raised. The benefit of using events for error notifying is that we save a register and permit ``optimistic'' system usage where most of the times the system does not need to return anything to the user.

It should be noted that the object constructor should call export or nothing will be actually exported.

Portal identifiers are not stored because the order in which exported system objects are created is always the same. Thus, the algorithm used to construct the portal identifiers ensures that given the node identifier, portal numbers will always be the same. This is to say that they are node-wide constants.

Although it is not part of the kernel, we will say that the compiler mentioned above can generate user wrappers too. These wrappers are objects with the same signature of the system object being exported but with just those methods being labeled ENTRY. Each wrapper method will simply invoke to the kernel portal and raise an exception on portal return whenever the return code is non-zero.

Note that many of these wrappers are to be built dynamically as users obtain access to system resources. They can cache the identifier of the resource being wrapped and use that identifier to to perform system calls.

To make it clear what system services are being exported to the user, and also to avoid in-kernel authentication (e.g. unnecessary access checks when operations are requested by the kernel itself) we will wrap every object being exported to the user. Those wrappers will be named with the prefix off_ugif and will be described right after the objects they are wrapping to make it clear which methods are exported to the user.

2.2 Resource building blocks

There are some basic interfaces which must be present on almost every system object. They were initially inherited by the Resourcegif class, which models system resources, as can be seen in figure 2.2, but we will be using aggregation instead (see figure 2.3) to avoid multiple inheritancegif.

  figure246
Figure 2.2: Initial resource hierarchy (not used)

  figure253
Figure 2.3: Basic resource hierarchy

2.2.1 Protection

To protect object access we define a generic set of access operations (read, write, execute, delete, and protect in the current implementationgif) and for each one we have a Protection object specifying the protection for such operation.

<Off access operations. >=
// Operations which can be protected. 
enum off_op_t { OFF_OP_R=0x1, OFF_OP_W=0x2, 
                  OFF_OP_X=0x4, OFF_OP_D=0x8, OFF_OP_P=0x10 };
const natural_t OFF_NOPS = 5;
Defines OFF_NOPS, OFF_OP_D, OFF_OP_P, OFF_OP_R, off_op_t, OFF_OP_W, OFF_OP_X (links are to index).

Individual operations can be combined together (usually by a bit-or operation) to specify an access mode (e.g. OFF_OP_R|OFF_OP_X for ``read and execute'' access mode).

Also, note that natural_t and other basic types which we will be using through the document have to be defined in a portable way (i.e. ensuring that they all have the same size across different platforms).

<Off access mode. >=
typedef off_op_t off_mode_t;
Defines off_mode_t (links are to index).

Three different objects are involved in protection:

2.2.2 Reference counting

Resources exported by the kernel will also be used by other kernel resources (e.g. page frames are used by address translations). To avoid resource deletion while its being used and to provide basic support for garbage collection we employ reference counters (RefCounter objects). When a RefCcounter comes down to zero an UNUSED exception may be raised to notify users of unreferenced (maybe unused) resources.

This feature is actually included as a builtin service in every Resource, so that we could save some typing and execution time using foo->reference() instead of foo->reference_counter->reference()..

The implementation is so simple that we have folded and inlined it.

<Off public methods for reference counting objects. >= (U-> U->)
  // References  (unreferences) an object
  void reference( void )   { _rc++; };
  void unreference( void ) { if (!--_rc) unused(); };
  natural_t get_num_refs(void) {  return _rc; };

The only subtle point is that we will use the virtual method unused to raise any exception and then destroy the object.

<Off protected methods for reference counting objects. >= (U-> U->)
  // Raises UNUSED exceptions and destroyes the object
  virtual void unused(void)  {;}

And of course\ldots

<Off private members for reference counting objects. >= (U-> U->)
  natural_t _rc;                 // # of references

2.2.3 Synchronizing system resources

To synchronize access to system objects, we will use spin locks and bureaucracygif most of the time.

Locks

The access pattern to system objects in system call is:

    \item Call make_available() to bring any needed remote object here.
  1. Having the required objects in the local node, lock them either for reading or for writing (usual multiple readers, single write semantics apply).
  2. Perform the operation (which should now be non-blocking) and return.

Thus, resources must include a read/write lock. The member variable _lock will be included in every ``lockable'' object.

<Off private members for lockable objects. >= (U->)
  rw_lock_t _lock;            // The lock

Again, we include this as builtin functionality to save some typing and time.

<Off public methods for lockable objects. >= (U->)
  // Spin (un)lock on this for reading 
  lock_state_t r_lock(void);
  void r_unlock(lock_state_t old_state);

  // Spin (un)lock on this for writing 
  lock_state_t w_lock(void);
  void w_unlock(lock_state_t old_state);

The value returned by lock routines (the lock_state) holds the state to be restored when the resource is released through unlock routines, e.g. the interrupt mask (are interruptions enabled or disabled?) and any other piece of ``global'' state changed by the lock.

Kernel bureaucracy

In order to wait for a resource to reach a given state, or to do long-term waiting due to resources being temporarily unavailable, we use Bureaucrats. Each resource that may cause long-term waiting (abstract resources) will provide one or more Bureaucrats so that shuttles block on them until some new state is reaached. Bureaucrats can be understood as condition variables but can be implemented by other means.

The use of condition variables is not an obstacle to system distribution because we will be using lists of identifiers (and not lists of objects) to represent the set of shuttles waiting on condition variables. Thus a single list may span several nodes.

<Off bureaucrat. >=
class off_Bureaucrat {
public:
  // Puts 'who' to sleep on this bureaucrat.
  void wait(shtl_id_t who);
  // Signals this condition. Just one will be awaken. 
  void signal(void);
  // Signals this condition. Everyone will be awaken. 
  void signal_all(void);
}
Defines off_Bureaucrat (links are to index).

%% --------------------------------------------------------------

2.2.4 Sequencers

Some objects will need to atomically obtain unique sequence numbers to create unique identifiers (for other objects contained inside). Again, this functionality is incorporated into the aggregate.

<Off private members for sequencing objects. >= (U->)
  off_seq_t _seq;            // Next sequence nb.

<Off protected methods for sequencing objects. >=
   // Gets a new sequence number. Uses lock() to ensure atomicity
   off_seq_t get_seq(void);      

2.3 System resources

 

Each resource in the system supports a set of well-defined operations. Namely, those provided by building blocks that we have seen before and a few others like freeze, to get the state of the object and defer any further invocations, melt, to recreate a frozen object, and dump, which will return a printable representation of the resource state.

<Off resource. >=
class off_Resource  {
private:
   <Off private members for protected objects. >
   <Off private members for reference counting objects. >
   <Off private members for lockable objects. >
   <Other private members of off_Resource. >
protected:
   <Off protected methods for reference counting objects. >
public:
   <Off public methods for lockable objects. >
   <Off public methods for reference counting objects. >
   <Off public methods for protected objects. >
   

  // Every resource (be it simple or compound) is able to be
  //  dumped for inspection and frozen/melted. 
  virtual void *freeze(void *to, size_t size)=0;
  virtual melt(void *from, size_t size)=0;
  inline boolean_t is_frozen(void);
  virtual char *dump(char *buf, size_t size) const;

  <Other public methods of off_Resource. >
};
Defines off_Resource (links are to index).

Also, we should mention that some resource members are initialized at allocation time (with the information provided by system users). That information includes the protection for the resource and an identifier used to identify the user-level entity responsible for the object after its allocation; we name such entity the object domain.

Being the \emph{Portal} the IPC mechanism provided by the kernel, the domain identifier is indeed a portal identifier of type off_prtl_id_t. By doing that, we can have the kernel performing upcalls to either user-level or in-kernel domain servers which will be responsible for allocated system resources. Besides, the meaning of the domain abstraction is defined by the user and not by the kernel, as will be discussed in section 2.6.

<Other private members of off_Resource. >= (<-U)
   off_prtl_id_t r_domain;       // Domain for the resource

\subsection{Resources for plain users}

These are the common entry points for every system resource.

<Off resource for users. >=
ENTRY class off_uResource {
private:
 off_Resource &u_resource;
public:
 void *freeze(void *to, size_t size, const off_Rights &r)=0;
 void  melt(void *from, size_t size, const off_Rights &r)=0;
 boolean_t is_frozen(const off_Rights &r);
 char *dump(char *buf, size_t size, const off_Rights &r) const;

 <Other public methods of off_uResource. >

};
Defines off_uResource (links are to index).

2.3.2 Identifiers

Off uses two kinds of identifiers:

The size of each field is defined as shown.

<Off identifiers. >= [D->]
typedef unsigned16_t  off_node_t;  // Node descriptor
typedef unsigned32_t  off_seq_t;   // Sequence number
typedef unsigned16_t  off_slot_t;  // Slot number
typedef unsigned long off_offset_t;// Elementary unit offset number

Thus, identifiers are typed:

<Off identifiers. >+= [<-D]
// Relocatable top level identifier
struct off_id_t {
  off_node_t i_node : 16;     // Creation node for the object
  off_seq_t  i_seq  : 32;     // Sequence number
  off_slot_t i_slot : 16;     // Implementor's descriptor for the object
};

// Hardware resource elementary unit identifier
struct off_eu_id_t {
  off_id_t     e_container;        // id for its container
  off_offset_t e_offset;           // Object id relative to the container.
};
Defines off_eu_id_t, off_id_t (links are to index).

2.3.3 Containers and resource units

Resources are arranged as recursive containers, being the node the outermost one.

These containers delegate allocation to an Allocator object which can (and will) be further wrappedgif to delegate usage statistics to a bookkeeper. See figure 2.4.

\begin{figure}[htb]

tex2html_wrap4325
Containers and allocators

 

<Off compound resource. >=
// Compound resource. Uses an allocator to allocate resource units. 
class off_CompResource : public off_Resource {
private:
  // Allocator used for this container.
  off_Allocator *c_pool;    // Resource unit pool
  <Other private members of off_CompResource. > 
public:
  // Named with `relocatable' identifiers
  off_id_t  get_id(void) const;
  // Gives a pointer to the container's allocator
  off_Allocator *get_allocator(void) const;
};
Defines off_CompResource (links are to index).

Specific resource units maintain a reference to their container. They also redefine the new operator so that it could be used to allocate resource units from a container's pool. This way we can avoid memory fragmentation.

The reference to the container and the redefinition of new will be declared in subclasses of ResUnit to avoid explicit type casts.

<Off elementary resource unit. >=
// Elementary resource unit.
class off_ResUnit : public off_Resource {
};
Defines off_ResUnit (links are to index).

\subsubsection{Compound resources and resource units for plain users}

Compound resources export to users the allocation information found in the CompResource allocator. To do so, they export the get_allocator method. They also export get_id.

<Off compound resource for users. >=
ENTRY class off_uCompResource : public off_uResource {
public:
 off_id_t  get_id(const off_Rights &r) const;
 off_Allocator *get_allocator(const off_Rights &r) const;
};
Defines off_uCompResource (links are to index).

We do not export to users anything besides that exported by a Resource in ResUnit system objects.

<Off elementary resource unit for users. >=
ENTRY class off_uResUnit : public off_uResource {
};
Defines off_uResUnit (links are to index).

2.3.4 Resource availability

 

When users issue a system call, it is targeted to a particular system object. The system call is processed in the local node only when that object is local. In any other case we forward the system call to a remote node by raising a FWD exception to the user.

Assuming the target system object is local, so that the system call proceeds, it is not guaranteed that every resource needed by that system call be present in the local node.

To deal with remote resources, every resource container (CompResources) implements a make_available method so that whenever a resource is needed it could be fetched to the local node (if that is really required).

Besides, when the container being considered is an AbsCompResource (a container of abstract resources), we have to take into account that resources contained inside it (abstract resources) are able to move around. Thus, in this case we will employ a relocation table to cache known object locations.

In every CompResource, make_available proceeds as follows:

  1. Using the resource identifier, it tries to locate the resource in the allocator used by the AbsCompResource (using the slot field of the resource identifier). If the resource is found there, then there is nothing more to do.
  2. If the resource was not present in the allocator and we are considering abstract resources, the relocation table has to be consulted to find out if we know of any object relocation. If we find an entry, we now know where the object is and we are done.
  3. If we don't know where the object is at this point, we assume it is located in its creation node (known by looking in the resource identifier). Thus, we suspect where the object is and raise a MISSING exception providing the expected location to user so that he could fetch the remote resource and bring it to the local container.
    1. If the object was indeed at its expected location (it did not move around) the system call is able to proceed as soon as the MISSING exception handler completes successfully. Should the handler fail, the system call is aborted and an ILL (illegal instruction) exception is raised to the Shuttle domain.
    2. If the object was not at its expected location we will notice it as soon as the user is notified by the remote resource container (where we expected to find the resource). In this case, a location algorithm will be run by the user. The resulting location should be inserted in the container relocation table, if any.

We should note that resource ``relocation'' counts mostly for portals, shuttles and DTLBs. Elementary resource units are not able to move around, so they will always be at their creation nodes.

Besides, because the system can be heterogeneous, a remote resource brought to the local node may be meaningless to us or require some sort of marshaling (eg. when it corresponds to a different architecture). In this case, a XDT exception is raised in order to request to a trusted user translator an image of the remote object in the native format.

2.3.5 Architecture awareness support

 

Each resource will be wrapped with a Navigator and an Inspector. This way, the OS built on top of the tex2html_wrap_inline4297kernel will be able to transparently navigate through its components and later on ask about resource attributes (e.g. bandwidth, persistence, etc.),

The class hierarchies of Navigators and Inspectors mimic that of Resource, thus we omit it.

Every resource will include two methods, get_navigator and get_inspector, to provide a pointer to its specific navigator and inspector. Both of them will be friends of the object considered and will be able to access its state in order to support their services.

<Other public methods of off_Resource. >= (<-U)
   // Returns the navigator for the resource.
   virtual off_Navigator *get_navigator(void) const;
   // Returns the inspector for the resource.
   virtual off_Inspector *get_inspector(void) const;

By enquiring the navigator, references to other system resources may be obtained and get_navigator or get_inspector may be used again.

The scheme works as shown in the example of the figure 2.5:

  1. The user has a reference to a memory bank.
  2. Using get_navigator, a reference to its navigator is obtained. This navigator implements the Navigator interface.
  3. Browsing with the navigator, a page frame is located \item Its method get_inspector is used to obtain a reference to an object implementing the Inspector interface. A common protocol can be used now to find out about the page frame persistence, size, etc.

Of course, both get_navigator and get_inspector are exported to system users.

<Other public methods of off_uResource. >+= (<-U) [<-D]
 off_Navigator *get_navigator(const off_Rights &r ) const;
 off_Inspector *get_inspector(const off_Rights &r ) const;

A Navigator provides methods to select and lookup components.

  figure307
Figure 2.5: Architecture awareness support: Navigators and Inspectors.

<Off resource navigator. >=
// A resource navigator
ENTRY class off_Navigator {
public: 
   // Returns a reference to the resource being navigated
   off_Resource *get_current(void);

   // Returns the first element.
   off_Resource *get_first(void);
   // Advances the iteration and returns the next element.
   off_Resource *get_next(void);

   // Selects a component by name.
   off_Resource *operator[](char *name);
   //   by identifiers
   off_Resource *operator[](const off_eu_id_t &id);
   off_Resource *operator[](off_id_t id);
}; 
Defines off_Navigator (links are to index).

An Inspector has two main methods to lookup attributes either by name or by index. Attribute indexes will vary from 0 to some arbitrary value. The first invalid attribute is guaranteed to have a NULL name and type.

To avoid usage of void *, users are requested to specify the expected type of an attribute. As it is unrealistic to force users to know in advance which attributes can be present, a method nameof is provided to obtain the name of a given attribute, and a method typeof is provided to obtain its type.

<Off resource inspector. >=
// A resource inspector
ENTRY class off_Inspector {
public:
   // Returns  the name of an attribute at [idx]
   char *nameof( natural_t idx) const; 
   // Returns the type of an attribute
   off_attr_kind_t typeof(natural_t idx) const;

   // Look up a boolean attribute by name or by index
   virtual boolean_t get_bool_attr(char *name)     const { return FALSE; }
   virtual boolean_t get_bool_attr(natural_t idx)  const { return FALSE; }

   // Look up a natural attribute by name or by index
   virtual natural_t get_nat_attr(char *name)      const { return 0; }
   virtual natural_t get_nat_attr(natural_t idx)   const { return 0; }

   // Look up a string attribute by name or by index
   virtual char *get_str_attr(char *name)          const { return NULL; }
   virtual char *get_str_attr(natural_t idx)       const { return NULL; }

   // Look up an id attribute by name or by index
   virtual off_id_t get_id_attr(char *name)  const { return OFF_ID_NULL; }
   virtual off_id_t get_id_attr(natural_t idx) const {return OFF_ID_NULL;}

};
Defines off_Inspector (links are to index).

The default implementation is to ignore the arguments and return a silly value, so that such function could be ignored: not every resource has attributes of every type. Thus, only nameof, typeof and one of the for previous pairs of methods must be provided by any subclass.

The valid attribute types are

<Off attribute type ids. >=
  enum off_attr_kind_t { OFF_BOOL_ATTR,  OFF_INT_ATTR, 
                         OFF_STR_ATTR,   OFF_ID_ATTR };
Defines off_attr_kind_t, OFF_BOOL_ATTR, OFF_ID_ATTR, OFF_INT_ATTR, OFF_STR_ATTR (links are to index).

Finally, any resource should have at least these attributes defined

<Off attributes. >=
  enum { OFF_ATTR_NULL = 0,  // Used to mark the end of the attr list
         OFF_ATTR_NAME,      // Name of this resource
         OFF_ATTR_CLASS,     // Class for this resource
         OFF_ATTR_DOM,       // Resource domain 
         OFF_ATTR_ID,        // Resource id (or container id)
         OFF_ATTR_OFFSET,    // Resource offest in a container or 0
         OFF_ATTR_URL,       // URL for the resource literate source code
  };
Defines OFF_ATTR_CLASS, OFF_ATTR_DOM, OFF_ATTR_ID, OFF_ATTR_NAME, OFF_ATTR_NULL, OFF_ATTR_OFFSET, OFF_ATTR_URL (links are to index).

%% --------------------------------------------------------------

2.3.6 Resource Allocators

Every allocator will provide the same interface, so that we could replace it at will.

<Off allocator. >=
signature off_Allocator { 
  // Returns a chunk of raw memory.
  // Will try to allocate n chunks starting at the i-th item. 
  // i == 0 means allocate anywhere.
  void *allocate(natural_t n=1, natural_t i=0);

  // Deallocates the chunk pointed by p
  void  deallocate(void *p);
};
Defines off_Allocator (links are to index).

The signature keyword is a GNU C++ extension which permits the use of subtype polymorphism separately from inheritance. Some dynamic dispatching is saved this way. However, it can be considered to be an abstract base class or interface implemented by every allocator.

\subsection{Allocation statistics}  

Statistics are maintained by bookkeeping allocators (BKAllocators) which are composed by wrappinggif the basic Allocator with some bookkeeping methods. These operations are inlined.

<Off bookkeeping allocator. >=
class off_BKAllocator {
private:
   natural_t b_nfree;           //  # of free nodes
   natural_t b_nalloc;          //  # of allocated nodes
   natural_t b_maxalloc;        //  Maximum # of ever allocated nodes
   natural_t b_nallocrq;        //  # of alloc requests
   natural_t b_nfreerq;         //  # of free requests
public:
   // Get some statistics 
   natural_t get_nfree(void) const {return s_nfree;}
   natural_t get_nalloc(void) const {return s_nalloc;}
   natural_t get_maxalloc(void) const {return s_maxalloc;}
   natural_t get_num_allocrq(void) const {return s_nallocrq;}
   natural_t get_num_freerq(void) const {return s_nfreerq;}

   // Wrapped allocator bookkeeping methods  
   inline void *allocate(natural_t n=1, natural_t i=0);
   inline void  deallocate(void *p);
};
Defines off_BKAllocator (links are to index).

\subsection{Fixed and Block allocators}  

The FixedAllocator is an Allocator with a fixed allocation scheme. Its implementation may vary, but in general it will consist on a fixed array and a linked list of free nodes. This class will provide a generic implementation which can be later optimized by specialized allocators.

<Off fixed allocator. >=
class off_FixedAllocator: public off_BKAllocator {
public:
   // Allocates (deallocates) units from a fixed store.
   void *allocate(natural_t n=1, natural_t i=0);
   void  deallocate(void *p);
};
Defines off_FixedAllocator (links are to index).

The BlockAllocator is an allocator which can grow dynamically.

<Off block allocator. >=
class off_BlockAllocator: public off_BKAllocator {
public:
   // Allocates (deallocates) units from a growing store
   void *allocate(natural_t n=1, natural_t i=0);
   void  deallocate(void *p);
   // Resizes the store
   void  grow(natural_t n);
};
Defines off_BlockAllocator (links are to index).

Even though it is not required for the understanding of the rest of this document, we now say a few words about the implementation of BlockAllocator. In Off, it is used to allocate abstract system resources (shuttles, portals and DTLBs).

It is desirable to avoid static limits in the amount of resources an application can allocate. So, each application would allocate, for instance, as many shuttles as it wants and as many portals as it wants. To implement that, we chose to allocate abstract system resources in virtual memory.

However, adding this flexibility to the allocator could cause problems in the case when a single application allocates many resources (e.g. many portals) harming other applications' performance.

Our implementation of the BlockAllocator helps solving this problem by allocating elementary units (i.e. shuttles, portals and DTLBs) so that those of a particular application can be paged out without disturbing other applications. The kernel virtual memory may look like the one depicted in figure 2.6. In this situation the large number of VM pages used to store one of the applications' (xfig) portals does not avoid the other ( gcc) to have the opportunity to freely allocate its portals.

  
Figure 2.6: Storage of Portals for two different applications.

As the system does not know what a ``user application'' is, it is not possible to make the allocator use a different page for each application. However, the BlockAllocator can use the resource domain identifier as a hint. So, what we actually do is to use a different set of virtual memory pages to allocate resources for each application domain.

2.3.9 Resource revocation

 

Resources are not revoked by the Off++ tex2html_wrap_inline4297kernel. Instead, an UNAVAILABLE exception is raised for the resource exhausted; so that users could run whatever algorithm is needed to make more units of the resource being exhausted.

The exception trigger is the memory allocator being used. System allocators are provided with an Exhausted object at instantiation time for that purpose. The Exhausted object provides a single method notify to let allocators trigger the proper action (by default, raising an UNAVAILABLE exception).

<Off Exhausted resource revocation trigger. >=
signature off_Exhausted {
public:
   // Notify that resource is exhausted
   void notify(void);
};
Defines off_Exhausted (links are to index).

2.4 Hardware resources

Hardware resources are not able to move and are always located at the same place. If they grow, they grow one container at a time (i.e. when new disks or pluggable memory banks are added to the system). Hardware resource containers may move by replacing a whole container by another. So, usually, hardware containers will use a fixed allocator.

The hardware resources exported by the kernel are arranged as recursive containers starting from Node. See figure 2.7.

  
Figure 2.7: Hardware resource containers and elementary units.

<Off hardware resource container. >=
// A hardware resource container. Units will be allocated
// within it. 
class off_HWCompResource : public off_CompResource {
public:
  // We may wish to cache a resource unit 
  void make_available(off_eu_id_t u);
};
Defines off_HWCompResource (links are to index).

Hardware resource units employ a particular naming scheme. They do not use off_id_t identifiers as remaining system resources do. They use elementary unit identifiers (off_eu_id_t) instead. These ones are built of an off_id_t (naming the container) and an offset locating the object inside the container. The offset matches the resource physical name, i.e. that understood by the hardware.

<Off hardware resource unit. >=
// A hardware resource unit.
// Uses virtualized identifiers <container id><hw id> 
// and a fixed allocation scheme.
class off_HWResUnit : public off_ResUnit {
public:
  // Gets the identifier for this unit.
  off_eu_id_t get_id(void) const;
};
Defines off_HWResUnit (links are to index).

\subsection{Hardware resources for plain users}

Users can ask for the identifier of a given resouce unit.

<Off hardware resource unit for users.>=
ENTRY class off_uHWResUnit : public off_uResUnit {
public:
  off_eu_id_t get_id(const off_Rights &r) const;
};
Defines off_uHWResUnit (links are to index).

Compound hardware resources do not export anything.

<Off hardware resource container for users. >=
ENTRY class off_uHWCompResource : public off_uCompResource {
public:
};
Defines off_uHWCompResource (links are to index).

2.5 Abstract resources

 

Shuttles, Portals, and DTLBs (see 1.2) are system abstractions, and are able to move around. Consequently, abstract resource containers (AbsCompResources) provide a method make_avaiable which uses a relocation table so that system code may request to one of the system containers to make a particular resource unit available at the local node.

They are arranged, as physical resources are, as recursive containers being the Node the top-level container (see figure 2.8). \begin{figure}[htb]


System abstract resource containers and elementary units.

 

<Off system server. >=
// A system resource container. 
class off_AbsCompResource : public off_CompResource {
  // System containers use a block allocator and  a relocation cache. 
public:
  // We may wish to fetch a resource unit 
  void make_available(off_id_t u);
};
Defines off_AbsCompResource (links are to index).

<Off system resource. >=
// A system resource unit. It's relocatable by nature. 
class off_AbsResUnit : public off_ResUnit {
public: 
  off_id_t get_id(void) const;
};
Defines off_AbsResUnit (links are to index).

\subsection{Abstract resources for plain users}

The wrappers for abstract resources are so simple that we will not discuss them.

<Off system server for users. >=
ENTRY class off_uAbsCompResource : public off_uCompResource {
public: 
};
Defines off_uAbsCompResource (links are to index).

<Off system resource for users. >=
ENTRY class off_uAbsResUnit : public off_uResUnit {
public: 
  off_id_t get_id(void) const;
};
Defines off_uAbsResUnit (links are to index).

2.6 Domains and resource allocation

 

The kernel does not implements either ``processes'' nor any other ``resource-container'' abstractions. Users allocate system resources and release them back to the kernel when they are no longer needed. None of the system resources is responsible for containing the resources of a given application.

Besides, as we will see, the system resource that is the most similar to the traditional process abstraction (the ``shuttle'') does not contain the resources accessed during its execution. Thus, how are bookkeeping tasks accomplished? And, how are resources released upon application crashes?

Certainly, a user-level (i.e. a non-tex2html_wrap_inline4297kernel) process or ``application'' abstraction is needed. To help on this task, the kernel delegates resource bookkeeping to its users.

Users should implement resource domains containing a set of resources used by a user application. This concept is introduced in the system in order to help solving the three following problems:

Whenever a user issues a resource allocation request, it provides a target domain identifier tdi for that new resource. Then, the kernel entitles that resource to the domain identified by tdi (i.e. sets the r_domain of the Resource being allocated to the supplied domain identifier -- as defined in 2.3).

When the resource gets finally unused, the kernel notifies that using its domain identifier as a portal name, so that users could update their allocation tables. Thus, domain identifiers must be portal identifiers.

As we said, the meaning of a domain is defined by the user, not by the kernel (although the DM can be loaded into the kernel for efficiency).

The 2K operating system will utilize this basic mechanism for implementing the concept of nested domains. From a single abstraction, it will be possible to implement replacements for traditional OS abstractions like group of users, user, process group, process, and thread. This will be possible because nothing forbids domains to persist, so that a permanent user can be seen as a persistent domain.

2K will also be able to support new abstractions like, for example, temporary users. They would be implemented as non-persistent domains and could be created by regular users inheriting a subset of the creator permissions. This would add flexibility compared to existing systems in which usually only a ``super-user'' is allowed to create new users.

A resource could be also logically moved from one domain to another by changing its r_domain. There is nothing in the system architecture forbidding such feature.

3 The node

3.1 The node interface

A Node object is responsible for bringing the system into operation and also for shutting it down (i.e. to either halt, reboot, or suspend it). Node is also a singletongif used as a placeholder for node-wide system properties (e.g. the portal for the authentication server used to establish trust relationships among distributed kernels (auth), and the portal for the external data translator used to translate foreign kernel objects to the native architecture (xdt) are placed here.)

Being the outer-most starting point in the container hierarchy, its id (obtained from the method get_id) is all the user needs to know to start looking for available system resources.

<Off node. >=
// An Off Node. 
// Can be also handled as a resource: freeze, melt, etc. 
class off_Node : public off_Resource { 
private:
   <Off private members for sequencing objects. >
   <Other off_Node private members. > 
protected:
   <Off protected members for sequencing objects. >
   <Other off_Node protected methods. > 
public:
  // Returns the node identifier
  off_id_t   get_id(void) const;

  // Returns or sets the authorization server and
  // the external data translator portals.
  off_prtl_t get_auth(void) const;
  off_prtl_t get_xdt(void)  const;
  void       set_auth(off_prtl_t p);
  void       set_xdt(off_prtl_t  p);

  <Other off_Node public methods. > 

  // Halts, reboots, or suspends this node
  void    halt( char *msg="System On." );
  void    reboot( void );
  void    suspend( void );
};
Defines off_Node (links are to index).

The Node can be also considered to be a façadegif (see  [11]) for node-wide system operations.

\section{Navigation}

To support efficient browsing of node components, the Node has a set of methods that provide access to its components. Most of them will be simply inlined.

For example, for memory banks we have:

<Other off_Node protected methods. >= (<-U) [D->]
   // Returns the number of specific containers found at this node
   inline natural_t get_num_mbanks(void) const;
   // Returns a pointer to a local memory bank. 
   inline off_MBank   *get_mbank( natural_t id = 0 ) const;

The same holds to gain access to remaining resource containers.

<Other off_Node protected methods. >+= (<-U) [<-D]
   // Get a pointer to a hardware resource container
   inline off_IOBank  *get_iobank( void )   const;
   inline off_DMA     *get_dma( void )      const;
   inline off_ProcTbl *get_proctbl( void )  const;
   inline off_ShtlSrv *get_shtlsrv( void )  const;
   inline off_PrtlSrv *get_prtlsrv( void )  const;
   inline off_DMM     *get_dmm( void )      const;

Besides, a generic navigator specialized for the Node will support the common interface for navigation. That is primarily for system users and not for the kernel itself.

\section{Miscellaneous}

There are some global operations that do not fit well anywhere else.

The following one, for example, redirects system output within the kernel to the serial line. Output can be redirected at any time.

<Other off_Node public methods. >= (<-U)
  // Use the serial console? 
  void use_serial_console( boolean_t doit = TRUE );

\section{Nodes for plain users}

The wrapper for node services export almost everything as can be seen.

<Off node for users. >=
ENTRY class off_uNode : public off_uResource{
public:
  off_prtl_t get_auth(const off_Rights &r) const;
  off_prtl_t get_xdt(const off_Rights &r)  const;
  void       set_auth(off_prtl_t p, const off_Rights &r);
  void       set_xdt(off_prtl_t  p, const off_Rights &r);
  void    halt( char *msg="System On." );
  void    reboot( const off_Rights &r );
  void    suspend( const off_Rights &r );

  off_IOBank  *get_iobank( const off_Rights &r )   const;
  off_DMA     *get_dma( const off_Rights &r )      const;
  off_ProcTbl *get_proctbl( const off_Rights &r )  const;
  off_ShtlSrv *get_shtlsrv( const off_Rights &r )  const;
  off_PrtlSrv *get_prtlsrv( const off_Rights &r )  const;
  off_DMM     *get_dmm( const off_Rights &r )      const;
  // Use the serial console? 
  void use_serial_console( const off_Rights &r, boolean_t doit = TRUE );
  
};
Defines off_uNode (links are to index).

\section{System booting}

The system entry point, which is called by the OSKit after basic hardware initialization is named main. Its arguments proceed from the secondary boot loader.

<Off main entry point. >=
// main system entry point. 
int main(int argc, char *argv[]);
Defines main (links are to index).

4 Exporting the hardware

\section{Memory banks}

Physical memory is distributed among different memory containers named mbanks. Each mbank supports allocation and deallocation of physical memory inside a single memory bank. To do so, it provides the alloc and free methods. Also, it provides means to locate a page frame either by address or by page frame number (PFN).

<Off memory bank. >=
// A bank of memory
class off_MBank : public off_HWCompResource {
public:
  // Allocates n contiguous page frames. 
  off_PFrame *alloc(off_Protection *prot, 
                    natural_t n=1, off_pg_id_t at=OFF_EU_ID_NULL);
  // Deallocates the n contiguous page frames starting at pf. 
  void free(off_PFrame *pf, natural_t n=1 );

  //Gets a page frame from its physical address
 off_PFrame *operator[](off_pg_id_t pfn) const;

  //Gets the size of pages inside the bank
  vm_size_t get_pgsize(void);
};
Defines off_MBank (links are to index).

where

<Off page frame identifier. >= [D->]
typedef off_eu_id_t off_pg_id_t;
Defines off_pg_id_t; (links are to index).

PFrame maintains the information needed about each page frame installed in the system. Their methods are just a means to access and update such state.

<Off page frame. >=

class off_PFrame : public off_HWResUnit {
public: 
  // Gets the page frame bits.
  pg_bits_t get_bits(void);

  // Sets the page frame bits and returns the old ones. 
  pg_bits_t set_bits(pg_bits_t b);
};
Defines off_PFrame (links are to index).

where pg_bits_t correspond to the bits maintained by the address translation machinery.

<Off page frame bits data type. >=
typedef pt_entry_t pg_bits_t;
Defines pg_bits_t; (links are to index).

\subsection{Memory banks and page frames for plain users}

These are the wrappers for memory banks and page frames:

<Off memory bank for users. >=
ENTRY class off_uMBank : public off_uHWCompResource {
public:
 off_uPFrame *alloc(off_Protection *prot, 
                   natural_t n=1, off_pg_id_t at=OFF_EU_ID_NULL);
 void free(off_uPFrame *pf, const off_Rights &r, natural_t n=1 );
 off_uPFrame *fetch(off_pg_id_t pfn,const off_Rights &r) const;
 vm_size_t get_pgsize(const off_Rights &r);

};
Defines off_uMBank (links are to index).

<Off page frame for users. >=
ENTRY class off_uPFrame : off_uHWResUnit {
public:
  pg_bits_t get_bits(const off_Rights &r);
  pg_bits_t set_bits(pg_bits_t b, const off_Rights &r);
};
Defines off_uPFrame (links are to index).

where

<Off page frame identifier. >+= [<-D]
typedef off_eu_id_t off_pg_id_t;
Defines off_pg_id_t (links are to index).

%% --------------------------------------------------------------

4.2 Input/Output

IO bank objects are merely used to allocate ports, as I/O should be able to operate (when permitted) from user level.

<Off IO bank. >=
//An IO bank
class off_IOBank : public off_HWCompResource {
public:
  // Allocates n contiguous IO ports. 
  off_IOPort *alloc(off_Protection *prot,natural_t n=1,
                    off_io_id_t at=OFF_EU_ID_NULL);
  // Deallocates the n contiguous IO ports starting at iop. 
  void free( off_IOPort *iop, natural_t n=1 );
   
  //Gets an IO port from its port number.
  off_IOPort *operator[](off_io_id_t iop) const;
};
Defines off_IOBank (links are to index).

where

<Off IO identifier. >=
typedef off_eu_id_t off_io_id_t;
Defines off_io_id_t (links are to index).

<Off IO port. >=
// An IO port
class off_IOPort : public off_HWResUnit {
public: 
  // Input/Output 8, 16, 32 or 64 bit words through the port.
  unsigned8_t   in8(void);
  void          out8(unsigned8_t o);

  unsigned16_t  in16(void);
  void          out16(unsigned16_t o);

  unsigned32_t  in32(void);
  void          out32(unsigned32_t o);
  
  unsigned64_t  in64(void);
  void          out64(unsigned64_t o);
  
};
Defines off_IOPort (links are to index).

\subsection{Input/Output for plain users}

Users may access IO ports using these wrappers.

<Off IO bank for users. >=
ENTRY class off_uIOBank : public off_uHWCompResource {
public:
  off_uIOPort *alloc(off_Protection *prot,natural_t n=1,
            off_io_id_t at=OFF_EU_ID_NULL);
  void free( off_uIOPort *iop, const off_Rights &r, natural_t n=1 );
  off_uIOPort *fetch(off_io_id_t iop,const off_Rights &r) const;
};
Defines off_uIOBank (links are to index).

<Off IO port for users. >=
ENTRY class off_uIOPort: public off_uHWResUnit {
public:
  unsigned8_t   in8(const off_Rights &r,);
  void          out8(unsigned8_t o,const off_Rights &r);

  unsigned16_t  in16(const off_Rights &r);
  void          out16(unsigned16_t o,const off_Rights &r);

  unsigned32_t  in32(const off_Rights &r);
  void          out32(unsigned32_t o,const off_Rights &r);
  
  unsigned64_t  in64(const off_Rights &r);
  void          out64(unsigned64_t o,const off_Rights &r);
};
Defines off_uIOPort (links are to index).

%% --------------------------------------------------------------

4.3 Traps and interrupts

  Both traps and interrupts are allocated using per-processor event tables. Each processor will have an associated a trap table and an associated interrupt table. Traps may be caused by the hardware or by the tex2html_wrap_inline4297kernel itself. Software interrupts are also available.

<Off event table. >=
// An event table 
class off_EventTbl : public off_HWCompResource {
protected:
  // Allocates n contiguous events. 
  off_Event *alloc( off_Protection *prot,  natural_t n=1, 
                    off_ev_id_t at=OFF_EU_ID_NULL );
  // Deallocates the n contiguous events starting at t. 
  void free( off_Event *t, natural_t n=1 );
  //Gets an event from its  number.
  off_Event *fetch(off_ev_id_t n) const;
};
Defines off_EventTbl (links are to index).

where

<Off event identifier. >=
typedef off_eu_id_t off_ev_id_t;
Defines off_ev_id_t (links are to index).

<Off trap table. >=
// Definition of traps for the current processor
class off_TrapTbl: public off_EventTbl {
public:
  // Allocates n contiguous Traps. 
  off_Trap *alloc( off_Protection *prot, natural_t n=1, 
                   off_ev_id_t at=OFF_EU_ID_NULL );
  // Deallocates the n contiguous traps starting at t. 
  void free( off_Trap *t, natural_t n=1 );
  //Gets a trap from its  number.
  off_Trap *operator [](off_ev_id_t n) const;
};
Defines off_TrapTbl (links are to index).

<Off interrupt table. >=
// Definition of interrupts for the current processor.
class off_IntTbl: public off_EventTbl {
public:
  // Allocates n contiguous interrupts. 
  off_Irq *alloc( off_Protection *prot,  natural_t n=1, 
                  off_ev_id_t at=OFF_EU_ID_NULL );
  // Deallocates the n contiguous interrupts starting at i. 
  void free( off_Irq *i, natural_t n=1 );
  //Gets an Irq from its  number.
  off_Irq *operator [](off_ev_id_t n) const;
};
Defines off_IntTbl (links are to index).

Event tables are a set of Events. Each event is identified by an unique id and a reason.

<Off event. >=
// An event, it could be a trap, an interrupt, ...
class off_Event: public off_HWResUnit {
public:

  // Sets h as the handler for this event
  void set_handler( off_EventHandler h);
  // Gets the handler.
  off_EventHandler get_handler(void);
};
Defines off_Event (links are to index).

Events are handled by EventHandlers which can rely either on portals or, when delivered inside the kernel, on function calls.

<Off event handler. >=
class off_EventHandler {
public:
  // Calls the handler
  virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... )=0;
};

class off_FnEventHandler : public off_EventHandler {
private:
  void (*handler)(off_event_t ev, off_ev_reason_t reason, ...);
public:
  // Calls the handler
  virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... );
};

class off_PrtlEventHandler : public off_EventHandler {
private:
  // Its handler is
  //    void (*handler)(off_event_t ev, off_ev_reason_t reason, ...);
  off_prtl_id_t handler;
public:
  // Calls the handler
  virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... );
};

Defines off_EventHandler, off_FnEventHandler, off_PrtlEventHandler (links are to index).

where

<Off events and reasons data types. >=
typedef natural_t off_event_t;
typedef natural_t off_reason_t;
Defines off_event_t, off_reason_t (links are to index).

In turn, events may be either Traps or Irqs (interrupts).

<Off Trap. >=
class off_Trap: public off_Event {
private:
  off_mdepTrap &t_mdep;
};
Defines off_Trap (links are to index).

Although traps cannot be raised artificially, interrupts can (e.g. to notify users or interrupts ``caused by the kernel'', and not by the hardware).

<Off Irq. >=
class off_Irq: public off_Event {
private:
  off_mdepIrq &i_mdep;
public:
  void raise(void);
  <Other public methods of off_Irq. >

};
Defines off_Irq (links are to index).

Whenever an interrupt happens, we must choose either to evict the current shuttle to dispatch the interrupt to its handler or to defer interrupt delivering in favor of the current shuttle. To make a choice, both shuttles and interrupts have priority-levels. See section 5.1.3 for a discussion of interrupt priorities.

<Other public methods of off_Irq. >= (<-U)
   //Sets/Gets the priority level for this interrupt. 
   void set_prty(off_pl_t prty);
   off_pl_t get_prty(void);

where

<Off interrupt priority level data type. >=
typedef natural_t off_pl_t;
Defines off_pl_t (links are to index).

We have been seeing a few machine dependent objects like off_mdepEvState, off_mdepTrap, and off_mdepIrq. Those are defined for a particular architecture and their meaning should be clear by the context. Machine dependent objects are always prefixed by off_mdep and will be described along with the system implementation.

Finally, there are some ``traps'' defined by Off++ and not by the hardware, some of them have appeared before and some will appear later on.

<Off Virtual traps. >=
// Virtual (Off++ raised) exceptions
enum off_ex_t { OFF_EX_UNUSED,       // resource no longer referenced
                OFF_EX_MISSING,      // resource missing (remote?)
                OFF_EX_UNAVAILABLE,  // resource exhaust
                OFF_EX_RELOC,        // resource has moved
                OFF_EX_XDT,          // translation to native format needed
                OFF_EX_FWD,          // request should be processed remotely
        OFF_EX_SCERR         // system call failed
                };
Defines OFF_EX_FWD, OFF_EX_MISSING, OFF_EX_RELOC, OFF_EX_SCERR, off_ex_t, OFF_EX_UNAVAILABLE, OFF_EX_UNUSED, OFF_EX_XDT (links are to index).

4.3.1 Traps and interrupts for plain users

On the one hand, users can interact with trap and interrupt tables through methods provided by Shuttles and Processors. Thus, Shuttle (see section 5.1) and and Processor (see section 4.5) behave as a facade because they provide simple entry points to handle both trap and interrupt tables.

On the other hand, individual events (trap and interrupt entries) are exported to users by means of a wrapper class.

<Off event for users. >=
ENTRY class off_uEvent : public off_uHWResUnit {
public:
   void set_handler( off_prtl_id_t h, const off_Rights &r);
   off_prtl_id_t get_handler(const off_Rights &r) const;
};
Defines off_uEvent (links are to index).

There is nothing specific for user traps as of this day.

<Off Trap for users. >=
ENTRY class off_uTrap : public off_uEvent {
public:
};
Defines off_uTrap (links are to index).

On the other hand, interrupts export a few methods.

<Off Irq for users. >=
ENTRY class off_uIrq: public off_uEvent {
public:
   void raise(const off_Rights &r);
   void set_prty(off_pl_t prty, const off_Rights &r);
   off_pl_t get_prty(const off_Rights &r) const;
};
Defines off_uIrq (links are to index).

\section{DMA lines}

DMA objects are used to allocate DMALines. Note that DMA lines are kept in the kernel although they could have been considered to be a device (and kept in userland). The reason is efficiency, because we do not want protection domain crossings just to program a DMA line. Also, we believe that no flexibility is lost by placing this small piece of code into the kernel.

<Off DMA table. >=
// Allocation of DMA lines
class off_DMA: public off_HWCompResource {
public:
  // Allocates n DMA lines. 
  off_DMALine *alloc(off_Protection *prot,natural_t n=1,
                     off_dma_id_t at=OFF_EU_ID_NULL );
  // Deallocates the n contiguous DMA lines starting at d. 
  void free( off_DMALine *d, natural_t n=1 );
  //Gets a DMA line from its  number.
  off_DMALine *operator [](natural_t n) const;
};
Defines off_DMA (links are to index).

where

<Off DMA line identifier. >=
typedef off_eu_id_t off_dma_id_t;
Defines off_dma_id_t (links are to index).

DMA lines are very not well modeled, instead we provided a model of DMA pretty close to that of Intel based machines. Future Off portings to different architectures should provide feedback on how should DMA lines be modeled.

<Off DMA line. >=
typedef off_mode_t dma_mode_t;
class off_DMAline: public off_HWResUnit {
public:
   // Enables/disables this dma line.
   void enable(void);
   void disable(void);
   // Programs a DMA operation. 
   void program(off_pg_id_t addr, vm_offset_t length, dma_mode_t mode);
   // Gets # of copied or pending bytes. 
   vm_offset_t get_copied(void);
   vm_offset_t get_pending(void);
};
Defines off_DMALine (links are to index).

where

<Off DMA line mode data type. >=
typedef natural_t dma_mode_t;
Defines dma_mode_t (links are to index).

\subsection{DMA lines for plain users}

DMA related wrappers are straightforward.

<Off DMA table for users. >=
ENTRY class off_uDMA : public off_uHWCompResource {
public:
 off_uDMALine *alloc(off_Protection *prot, 
                     natural_t n=1, off_dma_id_t at=OFF_EU_ID_NULL);
 void free(off_uDMALine *pf, const off_Rights &r, natural_t n=1 );
 off_uDMALine *fetch(off_dma_id_t d,const off_Rights &r) const;
};
Defines off_uDMA (links are to index).

<Off DMA line for users. >=
ENTRY class off_uDMALine : public off_uHWResUnit {
public:
   // Enables/disables this dma line.
   void enable(const off_Rights &r);
   void disable(const off_Rights &r);
   // Programs a DMA operation. 
   void program(off_pg_id_t addr, vm_offset_t length, dma_mode_t mode,
        const off_Rights &r);
   // Gets # of copied or pending bytes. 
   vm_offset_t get_copied(void,const off_Rights &r);
   vm_offset_t get_pending(void,const off_Rights &r);
};
Defines off_uDMALine (links are to index).

4.5 Processors and processor pools

  The processor table can be used to allocate a whole processor to certain processes. Usually the node singleton will be the owner of every processor in the system and will delegate ownership to user level (or in-kernel) schedulers.

Processor pools provide Processors. Usually a processor will be allocated to a user-level (or dynamically loaded in-kernel) scheduler which will in turn implement a processor allocation policy.

<Off processor table. >=
class off_ProcTbl: public off_HWCompResource {
public:
  // Allocates n processors. 
  off_Processor *alloc(off_Protection *prot,natural_t n=1,
                       off_proc_id_t at=OFF_EU_ID_NULL);
  // Deallocates the n contiguous processors starting at p. 
  void free( off_Processor *p, natural_t n=1 );
  //Gets a processor from its  number.
  off_Processor *operator [](off_proc_id_t n) const;
};
Defines off_ProcTbl (links are to index).

where

<Off processor identifier. >=
typedef off_eu_id_t off_proc_id_t;
Defines off_proc_id_t (links are to index).

To help schedulers, processors provide some methods which we now discuss.

First, a new shuttle can be explicitly installed in the processor using the switch_to method. This is to avoid using run queues when they are not needed. Thus, multiprocessing is an optional feature. Dedicated single-threaded nodes can avoid using run queues at all (no processor multiplexing code will run).

When they are, a run queue is an ordered collection of shuttle identifiers. Each identifier will be accompanied by the number of clock ticks it is expected to run.

The queue will be run in a round robin fashion until the occurrence of an event of interest to the scheduler. To let the processor know when to stop the run queue and notify to the scheduler notify should be used. It specifies which events are to cause the stop and also, who should be notified (either by portal or by direct function call when inside the kernel).

In any case, get_current will return a pointer to shuttle being run (or to the last shuttle run, when idle).

<Off processor. >=
class off_Processor : public off_HWResUnit {
private:
   <Other private members of off_Processor. >
public:
   // Installs a new shuttle
   void switch_to(off_Shtl &new);
   // Executes a run queue
   void run(off_shtl_id_t s[], off_offset_t ticks[], natural_t n);
   // Arranges for the processor to notify the owner on this events
   void notify( off_sevent_t events, off_EventHandler sched);
   // Returns the current shuttle
   off_shtl_id_t get_current(void);

  <Other public methods of off_Processor. >
};
Defines off_Processor (links are to index).

After the processor has invoked the scheduler to notify the event, the scheduler may either install a new run queue or just adjust the members of the current one. By now, the only events defined are these ones (NB. these are not hardware events, but scheduling events).

<Off scheduling events. >=
// Scheduling related events.
enum off_sevent_t { OFF_SEV_TRAP=0x01,// Trap occurred.
                    OFF_SEV_IRQ =0x02, // Interrupt occurred.
                    OFF_SEV_YLD =0x04, // Current shuttle blocked.
                    OFF_SEV_BLK =0x08, // Other than current shuttle blocked.
                    OFF_SEV_AWK =0x10, // A shuttle got awakened.
                    OFF_SEV_END =0x20 // Run queue exhausted.  
};
Defines off_sevent_t (links are to index).

Although it is an implementation issue, we will say that the run queue should be kept in the memory of the scheduler. The page the queue is in should be locked so that the processor could maintain a pointer to the run queue. In this way a user level scheduler may adjust the run queue just by writing its own memory without further system calls.

Finally, as we said when we discussed traps and interrupts, each Processor contains a couple of event tables, one for traps and another one for interrupts.

<Other private members of off_Processor. >= (<-U)
   off_TrapTbl p_traps;         // Processor trap table.
   off_IntTbl  p_irqs;          // Processor interrupt table.

The trap and interrupt table for a given processor can be obtained with a public method of the Processor class:

<Other public methods of off_Processor. >= (<-U)
    //Returns a reference to the trap/interrupt event table.
    off_TrapTbl *&get_traps_ref(void);
    off_IntTbl  *&get_irqs_ref(void);

\subsection{Processors and processor pools for plain users}

The processor pool wrapper is defined as follows

<Off processor table for users. >=
class off_uProcTbl : public off_uHWCompResource {
public:
 off_uProcessor *alloc(off_Protection *prot, 
                       natural_t n=1, off_proc_id_t at=OFF_EU_ID_NULL);
 void free(off_uProcessor *pf, const off_Rights &r, natural_t n=1 );
 off_uProcessor *fetch(off_proc_id_t d,const off_Rights &r) const;
};
Defines off_uProcTbl (links are to index).

Processors for users also include a few methods coming from trap and interrupt tables.

<Off processor for users. >=
ENTRY class off_uProcessor : public off_uHWResUnit {
   // Installs a new shuttle
   void switch_to(off_shtl_id_t &new,
                  const off_Rights &proc_r, const off_Rights &shtl_r);
   // Executes a run queue
   void run(off_shtl_id_t s[], off_offset_t ticks[], natural_t n,
            const off_Rights &proc_r, const off_Rights &shtl_r[]);
   // Arranges for the processor to notify the owner on this events
   void notify( off_sevent_t events, off_prtl_id_t sched,
                const off_Rights &proc_t);
   // Returns the current shuttle
   off_shtl_id_t get_current(const off_Rights &proc_r);

   off_uTrap *alloc_trap(off_Protection *prot, 
                    natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL);
   void free_trap(off_uTrap *t, const off_Rights &r, natural_t n=1 );
   off_uTrap *fetch_trap(off_ev_id_t t,const off_Rights &r) const;
   off_uIrq *alloc_irq(off_Protection *prot, 
                    natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL);
   void free_irq(off_uIrq *t, const off_Rights &r, natural_t n=1 );
   off_uIrq *fetch_irq(off_ev_id_t t,const off_Rights &r) const;

};
Defines off_uProcessor (links are to index).

5 Implementing abstract resources

The overall picture is depicted in figure 5.1. The relationships there depicted will become clear as this chapter proceeds.

  
Figure 5.1: Relationship among Shuttles, Portals, DTLBs and other resources.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

5.1 Shuttles

 

Shuttles provide flows of control which can be extended to support user-level or in-kernel process abstractions. The ShtlSrv (there is one per node) maintains a pool of shuttles. Thus, the ShtlSrv main task is to allocate shuttles.

<Off shuttle server. >=
// A shuttle server
class off_ShtlSrv: public off_AbsCompResource {
public:
  // Allocates a shuttle.
  off_Shtl *alloc(const off_Protection &prot,
                  natural_t n=1, off_shtl_id_t at=OFF_SHTL_NULL);
  // Deallocates a shuttle.
  void free(off_shtl_id_t *p, natural_t n=1);

  // Locates a shuttle by its number. 
  off_Shtl *operator [](off_shtl_id_t id);

  <Other public methods of off_ShtlSrv. >
};
Defines off_ShtlSrv (links are to index).

A shuttle has a processor context and a set of properties (as described below) and it is the only schedulable entity provided by the kernel. Shuttles run when installed in the run queue of a given processor (as we saw in section 4.5).

As we will see in the next chapter, shuttles move from one protection domain to another, including the kernel protection domain, by means of portals. Besides, the use of upcalls from kernel to user space means that the set of kernel and user activationsgif (see figure 5.2) will be mixed. Thus, a user shuttle may call the kernel and the kernel my upcall back to the user during the system call; the upcall routine may in turn issue new system calls, and so on.

  
Figure 5.2: Shuttle activations: system calls and up-calls.

We may be interested in three different processor contexts when referring to a shuttle:

  1. The current register set, i.e. the last saved register set which holds register values at the time the shuttle was last preempted.

    This is the one used when we are interested in the ``current'' state of the shuttle, no matter what its privilege level is.

  2. The last user register set, i.e. the topmost user context saved in the stack of activations (system calls/upcalls).

    This is the one used when we are interested either in the user state at the last system call or in the state to be returned to the user.

  3. The last kernel register set, i.e. the topmost kernel context saved in the stack of activations (system calls/upcalls).

    This is the one used when we are interested in the state of a shuttle while proceeding inside the kernel.

The first register set will always match one of the two latter ones.

<Off off_Shtl register accessors. >= (U->)
  // Get a pointer to current, last user, or last kernel
  // registers in the stack.
  off_mdepPRegs *get_regref(void);
  off_mdepPRegs *get_uregref(void);
  off_mdepPRegs *get_kregref(void);

A reference to the current processor context can be obtained with get_regref. Last user and kernelgif registers are accessible by means of get_uregref and get_kregref respectively. As these methods return a reference to the register storage, no ``set_'' methods are provided.

Shuttles may also wait on abstract system resources due to some reason. When a shuttle goes to sleep it should call block. Later on, someone else may call ready to unblock it. These methods may cause scheduling events (SEV_BLK, SEV_YLD, and SEV_AWK -- see section 4.5) to be delivered to the processor scheduler so that it could reschedule if desired.

Another effect of calling block is that (even if no event is ever raised) the shuttle will be ignored by its processor until a subsequent call to ready is performed. The Processor where the shuttle is running checks the shuttle's blocked flag and ignores the shuttle if it is set. The blocked flag means ``unable to run'' and can be used not only to block a shuttle, but also during shuttle initialization, during shuttle dismantling, etc. It should not be confused with the more abstract ``blocked'' state used in traditional process abstractions.

<Off shuttle. >=
// A shuttle. 
class off_Shtl : public off_AbsResUnit {
private:
  <Other off_Shtl private members. >
public:
  <Off off_Shtl register accessors. >
  <Off off_Shtl property accessors. >

  // Blocks due to some reason related to the given resource.
  // [May notify the scheduler]
  void block(off_id_t culprit, off_object_t culprit_type); 
  // Awakes this shuttle.
  // [May notify the scheduler]
  void ready(void);

  // Is this shuttle blocked?
  inline boolean_t is_blocked(void) const;

  <Other public methods of off_Shtl. >
};
Defines off_Shtl (links are to index).

Shuttles must not run at the same time on more than one processor. To ensure it, running shuttles are flagged and such a flag is checked by every processor before switching to a new shuttle. Two new methods do the job:

<Other public methods of off_Shtl. >= (<-U)
   // Checks/Sets if this shuttle is running.
   inline boolean_t is_running(void);
   inline void run(void);
   inline void stop(void);

\subsection{Shuttles for plain users. }

<Off shuttle server for users. >=
ENTRY class off_uShtlSrv: public off_uAbsCompResource {
public:
  // Allocates a shuttle.
  off_uShtl *alloc(const off_Protection &prot,
                   natural_t n=1, off_shtl_id_t at=OFF_SHTL_NULL);
  // Deallocates a shuttle.
  void free(off_uShtl *p, natural_t n=1, const off_Rights &r);

  // Locates a shuttle by its number. 
  off_uShtl *fetch(off_shtl_id_t id, const off_Rights &r);

  <Other public methods of off_uShtlSrv. >
};
Defines off_uShtlSrv (links are to index).

<Off shuttle for users. >=
ENTRY class off_uShtl : public off_uAbsResUnit {
public:
  off_mdepPRegs get_reg(const off_Rights &r);
  off_mdepPRegs get_ureg(const off_Rights &r);
  off_mdepPRegs get_kreg(const off_Rights &r);
  void set_reg( off_mdepPRegs regs, const off_Rights &r);
  void set_ureg(off_mdepPRegs regs, const off_Rights &r);
  void set_kreg(off_mdepPRegs regs, const off_Rights &r);

  <Off off_uShtl property accessors. >

  // Blocks due to some reason related to the given resource.
  // [May notify the scheduler]
  void block(off_prtl_id_t culprit, const off_Rights &r); 
  // Awakes this shuttle
  // [May notify the scheduler]
  void ready(const off_Rights &r);

  // Is this shuttle blocked?
  boolean_t is_blocked(const off_Rights &r) const;

  <Other public methods of off_uShtl. >

};
Defines off_uShtl (links are to index).

\subsection{Shuttle properties}

Properties are used to specify which resources (e.g. DTLBs, IO maps, etc.) should be readily available for the shuttle to run. Specific resource instances are identified by a property value.

Properties are named by values of type off_prop_t and their values by system identifiers of type off_id_t.

<Off shuttle property identifer. >=
// Properties
typedef natural_t off_prop_t;
Defines off_prop_t (links are to index).

The context switch code is guaranteed to call a property switch function pswitch for every property changing its value (i.e. for every resource whose instance must be changed) in the context switch.

Two additional functions, pset and pclr, are provided for those cases when the hardware is switching the context automatically but some work should be done to install (or deinstall) a property value in a given shuttle.

Any resource provider implementing this interfacegif can be considered to be a property provider.

<Off shuttle property server. >=
// The interface of a property server. 
signature off_ShtlPropSrv {

  // Switches property values. 
  // Returns either 0 or an error code.
  int pswitch(off_id_t from, off_id_t to, off_Shtl &s);
  
  // Set or clear the property at a given shuttle. 
  int pset(off_id_t pval, off_Shtl &s, const off_Rights &r);
  int pclr(off_id_t pval, off_Shtl &s);

  // Should we call to pswitch?
  boolean_t needs_switch(void);
};
Defines off_ShtlPropSrv (links are to index).

needs_switch method is provided to save some calls to pswitch. Those properties which don't need to implement pswitch may provide a trivial (i.e. empty) implementation and arrange for needs_switch to return true. When a new property is being defined, the shuttle server accepting the definition will use needs_switch to determine whether it should arrange for pswitch to be called or not.

To define the property being implemented as a shuttle property, a property implementor must contact ask the shuttle server to define it. The ShtlSrv must include now this method to support property definitions. Indeed, we include two new methods. One to define properties implemented inside the tex2html_wrap_inline4297kernel (which may use a simple procedure call) and another one to define properties which might be implemented outside the tex2html_wrap_inline4297kernel (which must use portals).

<Other public methods of off_ShtlSrv. >= (<-U)
   // Defines a (new) shuttle property.
   void define_kprop(off_ShtlPropSrv &implementor, 
                     off_prop_t id, off_Protection &p);
   void define_uprop(off_prtl_id_t   implementor, 
                     off_prop_t id, off_Protection &p);
   void undefine_prop(off_prop_t id, const off_Rights &r);

If foreign shuttle servers ever find a shuttle with an unknown property being used, they will ask to other shuttle servers found in the net for the property definition.

Property values are stored in property sets, of type ShtlPSet. A ShtlPSet is an ordered (by property id) collection of property values so that context switch could be done quickly. To save some more time in context switches, ShtlPSets may be shared. Thus they must be reference counted.

<Off shuttle property set. >=
// A set of property values
class off_ShtlPSet {
private:
  <Off private mem