HP-UX 10.0 Memory Management White Paper

HP 9000 Series 700/800 Computers

                   Printed in U.S.A. January 1995
                           First Edition
                               E0195

LEGAL NOTICES

The information in this document is subject to change without notice.

Hewlett-Packard makes no warranty of any kind with regard to this manual, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Hewlett-Packard shall not be held liable for errors contained herein or direct, indirect, special, incidental or consequential damages in connection with the furnishing, performance, or use of this material.

Warranty. A copy of the specific warranty terms applicable to your Hewlett-Packard product and replacement parts can be obtained from your local Sales and Service Office.

Restricted Rights Legend. Use, duplication, or disclosure by the U.S. Government Department is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 for DOD agencies, and subparagraphs (c) (1) and (c) (2) of the Commercial Computer Software Restricted Rights clause at FAR 52.227-19 for other agencies.

HEWLETT-PACKARD COMPANY
3000 Hanover Street
Palo Alto, California 94304 U.S.A.

Reproduction, adaptation, or translation of this document without prior written permission is prohibited, except as allowed under the copyright laws.

Trademark Notices. UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company Limited.

First Edition: January 1995 (HP-UX Release 10.0)

HP-UX 10.0 Memory Management White Paper

Memory Management

Memory is high-speed data storage, implemented using various hardware devices on the HP-UX system. Each device stores and retrieves data.

The data and instructions of any process (a program in execution) must be available to the CPU by residing in physical memory at the time of execution. RAM, the actual physical memory (also called "main memory"), is shared by all processes. To execute a process, the kernel executes through a per-process virtual address space that has been mapped into physical memory.

The term "memory management" refers to the rules that govern physical and virtual memory and allow for efficient sharing of the system's resources by user and system processes.

Memory management allows the total size of user processes to exceed physical memory by using an approach termed demand-paged virtual memory. Virtual memory enables you to execute a process by bringing into main memory parts of the process only as needed, that is, on demand, and pushing out parts of a process that have not been recently used.

The system uses a combination of paging and deactivation to manage virtual memory. Paging involves writing unreferenced pages from main memory to disk at a certain time period. Unreferenced pages are small units of memory.

Deactivation takes place if the system is unable to maintain a large enough free pool of memory. In such a case, entire processes are deactivated;
the pages associated with these processes can be written out by the pager to secondary storage over a period of time. Deactivated processes cannot reference their data.

This paper provides an overview of HP-UX memory management, including:

Physical and virtual memory
Mapping files into virtual memory
Shared libraries
Paging
Deactivation

Physical Memory

The physical memory of most interest to system administrators is the random access memory (RAM). RAM usually consists of memory cards that plug into the computer's backplane. For the CPU to execute a process, the relevant parts of a process must exist in RAM.

The more main memory in the system, the more data the system can access and the more (or larger) processes it can execute without having to page or cause deactivation. This is because the system can retain more processes in main memory, thus requiring the kernel to page less frequently. Memory-resident resources (such as page tables) also take up space in main memory, reducing the space available to applications.

During system startup, the system displays on the system console the amount of physical memory installed:

real mem = no_of_bytes

Refer to the HP Configuration Guide for your system's minimum, maximum, and recommended RAM requirements.

Power and Data Permanence

A characteristic of memory is its volatility or nonvolatility -- whether or not a storage medium retains data when power is removed. Before the system boots, only data or microcode stored in nonvolatile memory (ROM, EPROM, magnetic tape, disk) is available.

Once the system is brought up, data is stored and used in RAM. Data held in RAM is volatile; that is, it is not retained in RAM when the system is brought down.

Transactions between RAM and Disk

At boot time, the system loads the operating system from disk. (This is essentially what booting means.) The operating system then resides in RAM until the system is shut down.

When the user runs programs and commands, they too are loaded from disk into RAM. When a program terminates, it is usually flushed out of RAM (that is, the operating system frees the memory used by the process).

Complex deactivation and paging algorithms determine when data and code for currently running programs will get returned from RAM to disk.

User and system programs write data to disk (for example, to update the password file or write a database record). This data-to-be-written is either written directly to RAM (if raw data) or buffered in cache and written to disk in relatively big chunks. Programs also read files and database structures from disk into RAM. Buffering algorithms try to minimize disk access by going to disk as infrequently as possible; disk access is a bottleneck on all systems.

When you issue the sync command before shutting down a system, all modified buffers of the buffer cache are flushed (written) out to disk.

On the Series 800, if the system loses power for a short time and powerfail is enabled, the powerfail routines will put the system in a consistent state and bring it back up without the user having to reboot it.

There are two other characteristics of physical memory involved at system start-up: available memory and lockable memory.

Available Memory

Not all physical memory is available to user processes. The kernel (/stand/vmunix) always resides in main memory (that is, it is never swapped), occupying approximately 3 MB on a Series 700/800 system. (Note, however, these are static sizes only, and do not include kernel tables, daemons, device drivers, processes, diagnostics, user interfaces, or other executing code on a working system.)

The amount of main memory not reserved for the kernel is termed available memory. Available memory is used by the system for executing user processes.

Instead of allocating all its data structures at system initialization, the HP-UX kernel dynamically allocates and releases some kernel structures as needed by the system during normal operation. This allocation comes from the available memory pool; thus, at any given time, part of the available memory is used by the kernel and the remainder is available for user programs.

During system startup, the system displays on the system console the amount of available memory:

avail mem = no_of_bytes

Lockable Memory

Physical memory that can be "locked" (that is, its pages kept in memory for the lifetime of a process) by the kernel or plock() is known as lockable memory.

User processes can lock memory using the plock or shmctl system call (see plock(2) or shmctl(2) in the HP-UX Reference). Locked memory cannot be paged and processes with locked memory cannot be deactivated. Typically, locked memory holds frequently accessed programs or data structures, such as critical sections of application code. Keeping them memory-resident improves system performance.

By default, lockable memory can be no more than the system paging threshold. (Refer to the following discussion on "HP-UX Demand-Paged Virtual Memory" for a description of how the paging threshold is determined.) Available memory is a portion of physical memory, minus the amount of space required for the kernel and its data structures. The size of the kernel varies depending on the number of interface cards, users, and values of the tunable parameters; thus, available memory varies from system to system.

Lockable memory is extensively used in real-time environments, like hospitals, where some processes require immediate response and must be constantly available.

If the default value of lockable memory is changed, care must be taken to allow sufficient space for paging and deactivation. As the amount of lockable memory increases, the amount of memory available for paging and deactivation decreases. The existing processes compete for a smaller and smaller pool of memory. The system parameter to change the amount of unlockable memory is unlockable_mem. (See the /usr/conf/master.d/core-hpux file and System Administration Manager (SAM) on-line help for information concerning tunable operating-system parameters.)

During system startup, the system displays on the system console the amount of its lockable memory (along with available memory and physical memory). You can display the values later by running /sbin/dmesg.

lockable_mem = no_of_bytes

lockable_mem is the total amount of physical memory all users can lock down at once. This value can be changed by modifying the operating-system parameter, unlockable_mem.

The system has boundary conditions requiring that unlockable_mem be more than zero and less than the amount of memory available after boot. If unlockable_mem is set to zero or a negative value, the kernel will compute an appropriate default value at boot time.

Available memory minus the memory locked by processes is the memory actually available for virtual memory demand paging.

You can determine the default value of unlockable_mem by subtracting the amount of lockable memory from the available memory during boot-up.

available memory - lockable memory = unlockable_mem

For example, the following boot information was displayed on a Model 835:

        physical page size = 2 KB
        avail mem = 9670656 bytes
        real mem = 1677216 bytes
        lockable mem = 659456 bytes

The system's unlockable memory would be

9670656 - 659456 = 9011200 bytes

If pseudo-swap reservation (discussed at the end of this paper) is enabled, it does not allow lockable memory to exceed 3/4 of available memory on a Series 800 and 7/8 of available memory on a Series 700 system. These limits can be altered using the unlockable_mem parameter.

Pseudo-swap affects the amount of lockable memory. As the amount of pseudo-swap used increases, the amount of lockable memory decreases.

Secondary Storage

Main memory (RAM) stores computer data required for program execution. During process execution, data resides in two faster implementations of memory found in the processor subsystem, registers and cache. Program files are kept in secondary storage, typically disks accessible either via system buses or network. When data is no longer needed in main memory, it is also stored in secondary storage to make room for active processes.

A transitory form of secondary data storage is termed swap, dating from early UNIX implementations that managed physical memory resources by moving entire processes between main memory and secondary storage. Most modern virtual memory systems today no longer swap entire processes because this method causes the system to spend most of its time processing I/O instead of doing real work. Swapping has been replaced by a deactivation scheme which allows pages to be pushed out over time by a paging mechanism. Paging is a more efficient memory resource management mechanism for virtual memory.

While executing a program, pages of data and instructions can be swapped (copied) to and from secondary storage if the system load warrants such behavior.

Device swap can take the form of an entire disk or LVM logical volume of a disk. (For backward compatibility, a section can be used.)

You can also enable a file system so that remaining space not used for files is used for swap; this is termed file-system swap. If more swap space is required, it can be added dynamically to a running system, as either device swap or file-system swap.

The swapon command is used to allocate disk space or a directory in a file system for swap.

Note: You cannot remove swap without rebooting the system.

The concepts of swap and swap-space management are further discussed in "Swap Space Management," later in this paper. For procedures on allocating swap space, see the HP-UX System Administration Tasks manual.

HP-UX Demand-Paged Virtual Memory

Early UNIX systems transferred only entire processes between the swap device and main memory while executing a program. Complex algorithms governed the priority by which systems moved processes to and from swap. To better accommodate larger programs, HP-UX memory-handling algorithms can also allow pages (individual units of memory) of a process to be read (paged) in and out of memory.

Hardware Perspective on Memory Transactions

Most of the kernel's memory-management code operates independently of hardware.

When a program is compiled, the compiler generates addresses, called virtual addresses for the code. These virtual addresses must be mapped into memory to a physical address for the compiled code to execute. User programs use virtual addresses only. The kernel and the hardware coordinate a mapping of these virtual and physical addresses for the CPU, called "address translation." The system uses the addresses to locate the process in memory.

When a process executes, it stores its code (text) and data in processor registers for referencing. If the data or code is not present in the registers, the processor goes out to the cache, a small high-speed memory between RAM and the processor, to fill a cache line. If the data is not in cache, the processor consults the translation tables in RAM that hold the mapping between virtual and physical addresses of the data. If not in RAM, the data and code might have to be paged into RAM from disk, in which case the disk-to-memory transaction must be performed.

A memory management unit (MMU) manages the interface between blocks of virtual address spaces and their physical memory locations. The MMU also ensures that processes do not illegally access each others' address space. Access to data in memory is thus facilitated and protected at a page-by-page level.

An important element of the memory management unit is the translation lookaside buffer (TLB) hardware in the processor. The TLB caches the most recently used virtual-to-physical address translations with their corresponding access information. Once an address is translated, it can be used to reference physical memory.

It is also interesting to compare the relative magnitude of these memory components. The virtual address space can potentially be a thousand times greater than the physical address space, while the physical address space might be a thousand times greater than the TLB. These relationships enable the CPU to execute programs much larger than the available physical memory. It also lets you run many more programs at a time than you could without a virtual memory system.

Process-to-Page Mapping

When a process executes, various dynamic structures are used to locate the memory segments, translate them from virtual to physical addresses, and manipulate them to achieve the results desired. The following figure shows how these structures point from a process to its segment page mappings.

When a process executes, the kernel maintains a proc structure, a data structure for each active process. The proc structure points to a virtual address space, which forms the header of a doubly linked list of several per-process regions, each corresponding to some element of the process (such as code, data, bss, shared memory). In turn, each per-process region points to a region, which points to page mappings of the address space in physical memory.

Subsequent sections of this paper describe the virtual address space and its related structures.

|
v
| | -------------------------->| Vas |---------------------------------- | | | | | |--------------------| |<------------------------------- | | | -------------- | | | v | | | ---------- ---------- ---------- ---------- ---------- | | | |p_type |---->| |---->| |--->| |--->| |-- | | |p_count | | | | | | | | | | ---|p_off |<----| |<----| |<---| |<---| |<---
|p_reg |-- | | | | | | | |
- | ---------- ---------- ---------- ---------- Pregion | Pregion Pregion Pregion Pregion
  | Process-oriented |
  System-oriented |
  | v
  | r_off |---(Offset in file) | r_fstore |---(Where to load from "front store") | r_bstore |---(Where to load from "back store") | | | | | |----- (vfd's/dbd's)
  - |
    Region |
    | | | v
    
    | V Page#20 | |
    | | DBD_FSTORE | -----------
    - \ ---------------| V Page#10 | | \ | ---------------------------- \ | | | DBD_BSTORE |----- \ | ---------------------------- | v | | | | | ------------ | ---------------------------- | | | | | | | | | Disk | | ---------------------------- | | | | | | | | | F/S | | ---------------------------- | | | | | | | | ------------ | ---------------------------- | | Vfd Dbd \ | \ | \ | Physical Memory \ | Pfdat ---------------| | ----------- v | | | ----------------- | ----------- | Disk | --->| #10 | | |
      - | Swap | | | | |
      - | |
        . | | . | |
      - | | | #20 | -----------------

Vas:

        Points to a doubly linked list of pregions describing what objects 
        and what ranges this process can address.

Pregion:

        p_type = How the data is being used (Text, Data, Shared Memory)
        p_off  = Where in the "Region" this pregion starts at
        p_count= Size of mapping in pages
        p_vaddr= Where the mapping starts.
        p_reg  = Pointer to the region containing Page/Disk information.

Region:

        It contains information on where specific pages are in memory or if not
        where the page is on disk.

Note: Permissions are kept at the "Pregion" level so that multiple processes

can share the same page with differing permissions.

Note: Multiple Pregions can point at the same region.

Pages

Pages are the smallest contiguous block of physical memory that can be allocated for storing data and processes. Pages are also the smallest unit of memory protection. As of the current release, the page size of all HP-UX systems is 4 KB.

A free page is a page of physical memory not mapped to a virtual address; that is, not currently being used by any process or by the system itself. A used page is mapped to a virtual address. The system keeps track of all free and used pages in physical memory.

Every dynamic page in the system is represented by an entry in the kernel's pfdat array data structure. (The system's static kernel text and data, determined during bootup, are not represented in pfdat.) A pfdat entry keeps information for the hardware-independent code on page conditions such as whether the page is free, the use count on the page, and how the page is currently being used. Implementation of the actual page mapping is hardware dependent.

You can lock critical pages of processes in memory to prevent their being paged out by using the system call plock(2) as described in the HP-UX Reference; see "Lockable Memory" earlier in this paper.

Virtual Address Space

Virtual memory uses a structure for mapping processes termed the virtual address space, described in the kernel header structure /usr/include/sys/vas.h. The virtual address space contains information and pointers to the memory that the process can reference.

One virtual address space exists per process and serves several purposes:

It provides the overall description of each process.
It contains pointers to another element in the memory-management subsystem -- per-process regions, or pregions.
It keeps track of pregions most recently involved in page faults

On HP-UX systems, the entire virtual-memory system can handle addresses of 32 bits, allowing processes a maximum virtual address space of 4 Gigabytes.

On the Series 700/800, space registers hold either 16 bits (for 48-bit addressing) or 32 bits (for 64-bit addressing) and are used to point to the virtual space to be accessed. The specific location within that space is specified by a 32-bit quantity called the byte offset.

The present Series 700/800 48-bit addressing implementation can be thought of as having 2 to the 16th 32-bit spaces.

Configuring Virtual Address Space

Operating-system parameters limit the size of the code, data, stack, and shared-memory segments for each individual process. These parameters have predefined defaults, but you can reconfigure them in the kernel using the procedures outlined in the System Administration Tasks manual. The following list shows configurable system parameters for code, data, stack, and shared memory:

maxtsiz Limits the size of the code (text) segment.

maxdsiz Limits the size of the data segment.

maxssiz Limits the size of the stack segment.

        shmseg          Limits the number of shared memory segments
                        that can be attached to a process.

        shmmni          Limits the number of shared-memory identifiers.

Per-Process Regions (pregions)

The virtual address space structure points to per-process regions, or pregions. Pregions are logical segments that point to specific segments of a process, including code (text, or process instructions), data, u_area and kernel stack, user stack, shared-memory segments, memory-mapped files, and shared-library code and data segments.

Pregions hold page protections and the number of pages mapped to each segment.

Each segment also corresponds to the segments of virtual address space. The text segment (or code segment) holds a process's executable object code and may be shared by multiple processes. The maximum size of the text (or code) segment can be changed by the configurable operating-system parameter maxtsiz.

The data segment contains a process's initialized and uninitialized data structures (bss). It can grow as needed by a program's run-time logic (using sbrk(2), malloc(3C), or malloc(3X), as described in the HP-UX Reference). The total allotment for initialized data, uninitialized data and dynamically allocated memory can be changed by the configurable operating-system parameter maxdsiz.

The u_area contains information about process characteristics. The kernel stack segment, which is in the u_area segment, contains a process's run-time stack during kernel mode and only accessible by the kernel. Both u_area and kernel stack segment are fixed in size.

The user stack segment contains a process's run-time stack during user mode. The user stack expands during process execution; its default maximum size can be changed using the configurable operating-system parameter maxssiz.

Memory-mapped files allow applications to map file data into memory and perform I/O to files through direct loads and stores instead of read() and write() system calls.

Shared memory segments are typically used when multiple processes must share data (for example, in a windowing system, where all window processes must be able to update common data structures). They can be created using the shmop(2) system call (see specifications in the HP-UX Reference).

Each shared library segment typically has three pregions -- code, initialized data, and bss.

I/O mappings are the ranges of addresses the operating system uses to deal with I/O devices.

When a process executes, the memory management subsystem traverses the pregions to find pointers to another data structure, called a region, which contains the physical page location, essential for retrieving or creating the page.

Per-Process Region Layouts

Per-process regions (pregions) represent the virtual address space in memory.

This section shows the mappings of Series 700/800 virtual address space.

Placement of Segments in Address Space (Series 700/800)

In the Series 700/800, the virtual address space is divided into one-GB quadrants and addressed by units of 32 bits. Each quadrant has several segments associated with it.

The basic segments (quadrants) of the Series 700/800 can be described as follows:

First one-GB quadrant always contains the process's code and sometimes some of the data.

Text is addressed from 0x00000000 to 0x3FFFFFFF.

Second quadrant contains data (static data, stack, and heap, and private memory-mapped files).

Data is addressed from 0x40000000 to 0x7FFFFFFF.

Third quadrant contains shared memory, shared mapped files, and shared library code.

Addressed from 0x80000000 to 0xBFFFFFFF.

Fourth quadrant, contains shared-memory segments, shared memory-mapped files, and shared library code.

Addressed from 0xC0000000 to 0xFFFFFFFF,

          On PA-RISC architecture (both Series 700 and 800),
          addresses from 0xF0000000 to 0xFFFFFFFF are used for I/O space.

An EXEC_MAGIC user executable (a.out) format allows data to start immediately after the code area in the first quadrant, instead of at the beginning of the second quadrant, and grow to the beginning of the user stack. Executables created with this format can handle more than 1 GB of data. However, EXEC_MAGIC executables cannot share text; the text thus requires swap space. Refer to ld(1) in the HP-UX Reference for information.

Regions

Regions are data structures unique to the file or chunk of memory being accessed. Regions represent ranges of addresses and inform the process about where the data exists in physical memory.

Each process segment's per-process region maps directly to a segment of memory in a region. Regions are associated with the system rather than a process.

A region might be private or shared. A private region (such as a process's data segment, u_area, or stack) has only one pregion pointing to it. A shared region (such as shared memory, code, graphics) can have more than one pregion point to it, because more than one process might be accessing the same segment of memory.

Virtual Nodes (vnodes)

A region usually references a virtual node (vnode), an interface for reading and writing pages of data between memory and disk. Vnodes are file-system independent data and operations that the MMU stores to keep track of swapped or paged memory while it is out on disk. HP-UX performs the VOP_* operations defined in /usr/include/sys/vnode.h to extract information about the file.

Kernel routines read the stored file type when:

Bringing requested and additional pages in from disk.
Writing pages out to disk.
Giving page information to the kernel for later computation.

Memory-Mapped Files

Memory-mapped files allow applications to map file data into a process's virtual address space. Once mapped, the process can directly manipulate the file data as a portion of memory. File data can be shared between processes more efficiently, resulting in higher I/O throughput for processes. Memory-mapped files are always page aligned. I/O to the files can be performed through direct loads and stores of memory instead of using read(2) and write(2), avoiding copying of data between user and system buffers. Memory-mapped files provide identical representation of in-memory and on-disk data. The data can be selectively synchronized to flush out a portion of a file range.

Files can be mapped into memory as either private or shared. Modifications to a private mapping are not visible to other processes, regardless of whether the other processes have created a private or shared mapping, or perform a read or write of the same file. The modifications are never written back to the file. Modifications to a shared mapping are visible to all other processes with a shared mapping and the modifications are written back to the disk file.

Memory-mapped files are well suited to randomly dispersed I/O, such as in database applications with data that is most frequently read, and which require frequent, small, random updates. Likewise, multiple processes that read huge data files can use memory-mapping for repeatedly searching and scanning the same data.

If the access for the mmap file is known, madvise can be used to inform the system as to how the I/O should occur. It can be either

random I/Os to read single pages or
sequential-access I/Os to read ahead.

Several system calls manipulate memory mappings to a file. Also, user-level semaphores interface with the memory-mapped regions:

mmap            Create a mapping file from a range in the process's
                address space into a range in a file.

mprotect        Modify the protections on memory pages mapped to a
                file.

munmap          Unmap a mapping created by mmap.

msync           Synchronize memory pages of a mapped file back to disk.

madvise         Advises the system of the expected access path to the
                range specified.

msem_init       Initializes a semaphore in a memory-mapped region.

msem_lock       Locks a semaphore in a memory-mapped region.

msem_unlock     Unlocks a semaphore in a memory-mapped region.

msem_remove     Removes a semaphore in a memory-mapped region.

For detailed information on these system calls, refer to section two of the HP-UX Reference.

Limitations to Memory Mapping

Memory-mapped files cannot be locked into memory. Memory-mapped files map file data into main physical memory, not a buffer cache. Because of this, as the total space consumed by memory-mapped files (mmap) increases, so does the amount of physical memory occupied in the system. This could result in memory pressure and some degradation of performance.

Access to the same file by read and write system calls, on the one hand, and memory mapping, on the other hand, might produce inconsistent data, since both approaches maintain separate in-memory caches of file data. Use one access method exclusively for any given application.

The mmap interface does not map character device files and device-specific addresses (such as graphics framebuffers). Use framebuf(7) for graphics devices on the Series 700/800.

The Series 700/800 memory-mapped file interface is defined to work in a 32-bit process address and cannot share data with processes outside this address space. The following limits apply to the HP-UX implementation:

One GB is the maximum size of a single file that can be mapped entirely into memory.
1.75 GB is the maximum combined size of all files mapped shared and shared memory, by all processes on a system.
Only one fixed, contiguous mapping can exist of a shared mapped file in the shared virtual address space.

How the Kernel Executes Processes Using Demand Paging

A compiled program has a header containing information on the size of the data and code regions.

As a process is created from the compiled code (by fork and exec), the kernel sets up its data structures and the process starts executing its instructions from user mode.

A page fault occurs when the process tries to access an address that is not currently in main memory.

The kernel switches execution from user mode to kernel mode and tries to resolve the page fault by locating the pregion containing the sought-after virtual address. The kernel then uses the pregion's offset and region to locate information needed for reading in the page.

In main memory, the kernel also looks for a free page in which to load the requested page. If no free page is available, the system swaps or pages out selected used pages to make room for the requested page.

The kernel then retrieves (pages in) the required page from file space on disk. It also often pages in additional (adjacent) pages that the process might need.

Then the kernel sets up the page's permissions and protections, and exits back to user mode. The process executes the instruction again, this time finding the page and continuing to execute.

Pages are not loaded in memory until are "demanded" by a process -- hence the term, demand paging.

copy-on-write

An HP-UX enhancement to earlier UNIX implementations is the technique of "copy-on-write." The system used to copy the entire data segment of a process every time the process forked, increasing fork time as the size of the data and code segments increased.

On the Series 700/800, HP-UX implements copy-on-write, which enables the system to create processes more quickly. Only one translation of a physical page is maintained; a parent process can point to and read a physical page, but copies it only when writing on the page. The child process does not have a page translation and must copy the page for either read or write access.

copy-on-access

copy-on-write by definition has two virtual pointers to a common physical page, each pointer having read-only access. Restrictions in the current HP-UX model allow only one virtual address to a given physical page. Because of this, HP-UX copy-on-write implementation cannot create two virtual pointers to one physical page. Instead HP-UX implements copy-on-access for the child process. With copy-on-access, there is no translation for the virtual pointer. Instead, the page is copied on any reference (read or write), rather than only on a write operation, as is the case with true copy-on-write.

Maintaining Page Availability -- vhand and swapper Daemons

Two daemons (background processes) -- vhand and swapper -- are involved in paging; the actual paging being performed by vhand, which is also known as the pageout daemon.

vhand daemon

vhand monitors free pages and tries to keep their number above a threshold to ensure sufficient memory for the most efficient operation of demand paging.

The pageout daemon utilizes a "two-handed" clock algorithm. The first hand clears reference bits on a group of pages in an active pregion. If the bits are still clear by the time the second hand reaches them, the pages are paged out.

The kernel automatically keeps an appropriate distance between the hands, based on the available paging bandwidth, the number of pages that need to be stolen, the number of pages already scheduled to be freed, and the frequency in which vhand runs.

The vhand daemon decides when to start paging by determining how much free memory is available. Once free memory drops below the "paging threshold" paging occurs. The paging threshold, known as gpgslim, adjusts dynamically according to the needs of the system. It oscillates between an upper bound called lotsfree and a lower bound known as desfree. Both lotsfree and desfree are calculated at system boot up time and are based on the size of system memory.

swapper daemon

The swapper daemon's upper limit is minfree. If free memory falls below minfree, or if thrashing is detected, the swapper detects the condition and becomes active also. (See the next section for an explanation of thrashing.) The swapper deactivates processes and prevents them from running, thus reducing the rate at which new pages are accessed. Once swapper detects that available memory has risen above minfree and the system isn't thrashing, swapper reactivates the deactivated processes and continues monitoring memory availability.

Thrashing

On systems with very demanding memory needs (for example, systems that run many large processes), the paging daemons can become so busy swapping pages in and out that the system spends too much time paging and not enough time running processes.

When this happens, system performance degrades rapidly, sometimes to such a degree that nothing seems to be happening. At this point, the system is said to be thrashing, because it is doing more overhead than productive work.

Consider, for example, that your system might be thrashing because it is handling a lot of disk activity. Perhaps a data base is in constant use, yet on the same disk is swap space. (On the Series 800, you can check system activity using the sar(1) command; see the manual page in the HP-UX Reference.) If so, you can minimize disk activity by moving a busy file system to a different disk, so that you are distributing busy disk activity among different spindles.

Another way to control thrashing is to use the serialize() command. All processes marked via the serialize command will run serially with other processes marked with the same command. The serialize command addresses the problem caused when a group of large processes all try to make forward progress at once, which results in degrading throughput. In such a case, each process constantly faults on its working set, only to have the pages stolen when another process starts running. By using the serialize command to run large processes one at a time, the system can make more efficient use of the CPU as well as system memory.

In many cases, even this approach is insufficient. Thrashing is often eliminated by adding more main memory to the system, thus alleviating memory pressure. This reduces the amount of time the system spends paging and swapping.

How the Memory-Management System Handles Executable Code

The speed at which processes execute is related partly to how the operating system accesses virtual address space segments of the compiled and linked object code (a.out or other executables).

If you are doing applications or system programming, you may be aware that a program's magic number and internal attributes determine which type of executable code -- standard, shared, and demand loaded -- are possible. (A later subsection, "Benefits and Shortcomings of Shared and Demand-Loaded Code," discusses this in more practical terms. Also see magic(4).)

The following HP-UX Reference manual pages describe magic numbers and internal attributes in detail:

The chatr(1) command is used to change a program's internal attributes to shared or demand-loaded.
The link editor ld(1) produces executable files from one or more object files or libraries.
magic(4) describes predefined file types and magic numbers for HP-UX implementations.
a.out(4) and its Series 700/800-specific manual pages describe the output file format from the assembler (as(1)), compilers, and link editor.

For Series 700 and Series 800 computers executable code can be either standard, shared, or demand-loaded.

Subsequent sections describe the types of executable code by several criteria:

Addressing in separate segments of code and data.
Capability of virtual address code segments being shared among multiple processes.
Alignment of pages to corresponding address boundaries on both disk and in main memory, for direct memory-to-memory copy. (Object code should align on 4K page boundaries for all HP-UX systems.)

Although page size was 2 KB for some previous Series 800 releases, executables generated during these releases will be handled properly by the current release, but with some performance penalty. You might want to recompile for better performance.

Characteristic Types of Executable Code

This section describes the three types of executable code, including page alignment and memory features.

EXEC_MAGIC - EXEC_MAGIC is supported on Series 700/800

4 KB-page aligned on disk
combined segments
code and data writable, not shared
page aligned
code loaded entirely at execution

SHARE_MAGIC - 4KB-page aligned on disk

                  Programs built on the Series 800 can be executed on 
                  the Series 700. Note too, executables generated on 
                  previous Series 800 operating  systems featuring 2K 
                  page size will run, but with some performance penalty. 
                  You might want to recompile for better performance.

                        - separate segments for code and data
                        - code is read-only
                        - code is shared
                        - page aligned in memory
                        - code and data faulted in as needed

DEMAND_MAGIC - 4KB-page aligned on disk

separate segments for code and data
code is read-only
code is shared
page aligned in memory
page aligned in disk
code and data faulted in as needed

Standard Executable Code (EXEC_MAGIC)

In this traditional UNIX implementation (now rarely used in HP-UX), compiled object files consists of code (machine instructions) and data that occupies the same area of memory, with read, write, and execute permissions. Each time a common program like vi, for example, runs, a distinct copy of its code is read into main memory. The code is not shared, even when several vi processes are run simultaneously.

Because the vi code occupies area segments with write permissions, it is vulnerable to being overwritten. File header, code, data, and debugger code are not aligned in main memory with their corresponding page boundaries on disk, and so all code has to be read through a buffer cache before being copied to the user's address space. This slows the translation.

The EXEC_MAGIC user-executable (a.out) format allows creation of processes with more than 1 GB of data. This expanded data area is implemented with a new option to the linker; refer to ld(1) in the HP-UX Reference for information.

Several factors contribute to the efficiency of EXEC_MAGIC. Pages are aligned at 4 KB, improving address translation. Text for processes greater than 1 GB is relatively small, compared to data; thus, private code does not pose a problem. Also, when processes are greater than 1 GB, normally only one of them is run at a time.

Shared Code (SHARE_MAGIC)

Sharing of code segments reduces the amount of code kept in main memory. With shared code, the code and data segments are separated, so that code can be read-only and data can have write permission. This protects code from being overwritten.

Unlike EXEC_MAGIC, whose text is backed by swap, shared code requires no swap space because it cannot be modified.

When several processes run the same program simultaneously, the processes use the same copy of code in main memory via pointers to the code's virtual address space. Only one copy of the code exists in memory, regardless of how many processes run the program. The system keeps track of multiple processes sharing code by maintaining a use count. These mechanisms decrease dramatically the amount of memory required for each user's process space.

For example, when a shared program such as vi is first loaded into the user code area, the use count for the program is set to one. While the first process is executing vi, if another process invokes vi also, no additional memory is allocated because the code already resides in main memory. The new process merely executes the copy of vi's code residing in main memory; the shared-code segment use count is incremented from one to two. When one of these processes finishes executing the code, the use count is decremented.

As long as the use count is greater than zero, the shared code remains in memory. When the last process finishes editing and terminates the vi program, however, the system decrements the use count to zero and releases vi's shared code data structure and its associated physical memory.

Shared code was also designed to facilitate page alignment between main memory and disk, although pages are not guaranteed to be aligned.

Demand-Loaded Code (DEMAND_MAGIC)

For the Series 700/800, DEMAND_MAGIC code behaves identically to SHARE_MAGIC, despite a difference in the magic number.

Demand-loaded code encompasses some of the advantages of shared code and provides additional optimizations. (See comparison of shared and demand-loaded code in the next section.)

Like shared code, demand-loaded code addresses code and data separately. Demand-loaded code is also shared; only one copy of code need be in main memory for use by multiple processes.

When a user runs a demand-loadable program, pages of the program are read into memory only as required. The system also anticipates (from prior page usage) what subsequent pages might be required, and brings in additional pages.

Demand loading eliminates the need to allocate main memory to rarely accessed routines and code, such as error handling routines, which in some instances might account for a large percentage of a program's code.

Another demand-loaded code enhancement is guaranteed page alignment. Guaranteed page alignment is based on a loading algorithm simpler for the system to implement. One-to-one mapping between paging device and main-memory pages allows for direct disk-to-memory transfer without an intermediary file-system buffer cache. Individual pages can also be copied faster.

When a executing process faults on a page, the process looks in the page cache (a data structure that associates in-core pages with file-system data; for example, a text page). If used in a recent process, the sought-after page is likely to be present in physical memory. If the page is present, the kernel can use the in-core copy rather (a faster operation) than reading the page from disk. If the page is not present, the kernel determines the location of the page and reads it in from disk.

Most executable code is now page aligned on 4KB page boundaries. Exceptions are executables that were compiled on Series 800 systems that were 2 KB-page aligned. When faulting in pages for executables whose code is not page aligned, pages must be copied in through the file-system buffer cache, a slower method than mapping through the page cache.

Benefits and Shortcomings of Shared and Demand-Loaded Code

If you work in an environment where applications are written in-house, you have some choice about how to link the object code for optimal performance. You might consider the following questions:

How is the program linked by default?
How do you expect the process's pages to be used?
- At random?
- Serially?
Are your programs running on a system with ample or limited memory?
Are there many or few users?

If an application program is running more slowly than expected, using chatr(1) or relinking to demand-loaded might improve its execution time.

Comparison of Shared and Demand-Loaded Code

The following points might help you determine which kind of executable to use:

Most programs shipped as shared code by default. You can tell how the code is shipped by running the file command on the executable, for example:

                % file /usr/bin/cat
                /usr/bin/cat: s800 shared executable

        * Demand-loaded code gives flexibility; you can relink user
          code with ld(1) or mark programs demand-loaded with chatr(1).

        * Both shared code and demand-loaded code reduce the amount of
          memory required for user code space when multiple processes
          execute the same program.

        * For both shared code and demand-loaded code, less memory is
          required to run programs if only a subset of their pages are
          used, because pages are loaded only as needed.

Shared Libraries

A shared library is a collection of commonly used subroutines located in one shared location in memory and which can be invoked dynamically at run-time by programs that need it.

The libc, libm, libM (the POSIX-conformant math library), as well as some X and Starbase libraries are provided in the /usr/lib directories as shared libraries. The dynamic loader has two parts -- /usr/lib/dld.sl and /usr/lib/libdld.sl. For information on the dynamic loader, consult the manpage dld.sl(5).

Shared libraries are distinguished from archived libraries by suffix; for example, the archive form of libc libraries are designated libc.a, whereas their shared-library counterparts are designated libc.sl.

A shared library reduces the amount of memory occupied by code during execution because only one copy of its routines exists in memory.

When you include a call to an archived library, on the other hand, the library code is copied to the executable file of the process at link time. Thus, each executable using a routine has a copy of that routine both on disk and in memory when running.

An example of shared versus archived libraries is the use of printf in libc. A program using archived libraries copies the printf library when linking printf. With shared libraries, libc.sl is loaded once, as a single entity. Only stubs to the printf's offset within libc.sl are added to the address space of the user program. The calling program maps to printf at run-time.

Processes using shared libraries no longer require their own copy of the library. All processes calling the same library use the same image of that library, which is dynamically loaded and linked into executing processes at run time. Even if several concurrent processes use the same shared library code, only one copy of the code exists in core memory. The memory image of a shared library is shared among all programs using that library.

How Shared Libraries Are Dynamically Loaded

Shared libraries represent a change in the behavior of the compiled object-file code that gets executed before main(). That code (crt0.o) invokes the dynamic loader (dld.sl), which consults a table stored at the beginning of the process's code segment to determine what libraries to load. The executing program attaches all shared libraries specified on the linker command line. The library source is compiled as position-independent code. The object modules of this position-independent code are combined by the linker to form an object file given the extension .sl, for shared library.

When loaded into a system, shared libraries reside in main memory. Like demand-loaded code, only one physical copy of a shared library exists in memory; code and data are faulted in as needed. Library code is shared among all processes using that library. A private copy of library data is allocated for each process.

On attaching a shared library, the start-up code maps the shared library code and data into the process' address space, relocates any pointers in the shared library data that depend on the actual virtual address, allocates BSS, and binds all references into and out of the shared library by filling in linkage table entries in the program and the shared library.

To accelerate the program startup process, binding is normally deferred until the program actually calls the shared library routine. Each linkage table entry is initialized with a pointer to the dynamic loader. The first reference to each routine causes the dynamic loader to intercept the call and bind the reference to that routine. This deferred binding distributes the cost of symbol table lookup across the execution time of the program, and is especially useful if the program contains many references not likely to be executed.

The user can specify at link time that immediate binding be used. (On the Series 700/800, the chatr command has also been enhanced to allow you to specify binding mode without relinking.) Immediate binding causes all references to be bound at start-up time; thus, the cost of symbol table lookup is taken at startup rather than spread across the execution of the program. Immediate binding detects unresolved symbols at load time, rather than during the execution of the program. The default is deferred binding.

As an example of how the dynamic loader (dld.sl) handles processes, let's look at what happens when a user invokes sh. When sh is invoked, a process is created with multiple stubs pointing to dld.sl. For example, there is a stub for printf(3S) pointing to dld.sl. (Linkage tables are used as well as stubs.) When sh requires printf, the process first uses its stub to dld.sl, which does the work of establishing the pointer to printf and replacing its own stub with a stub to printf. Thereafter, when the process needs printf, a stub is in place to go right to printf, without having to point back to dld.sl.

Shared Libraries vs. Archived Libraries

Occasionally, you may prefer to use an archived library when you link your application, in order to insulate your application from future modifications to the library. You can prevent a library change from adversely affecting your application by linking in the archived version of the library.

Usually, however, an application program benefits from updates to a library. Bug fixes and enhancements to a shared library result in automatic bug fixes and enhancements to a program that uses the library, and new features in a library should not affect an existing application. You can also use version numbers described in Programming on HP-UX to ensure that you execute a particular version when you do use shared libraries.

Another reason for using an archived library would be that the program will execute on another system that does not have the shared library. A program that uses a shared library cannot stand alone; the shared library must be present at run time. (An environment variable, SHLIB_PATH enables you to specify search path for shared libraries; see Programming on HP-UX.) By copying library code from an archive library into your program, you can prepare a program that is independent of any other system file.

Another difference between shared and archive libraries is performance. For object code to be shared, it must be position-independent; that is, the library cannot know where its code or data will reside while it is running, since different programs may place the library code and data at different addresses. (Text might vary the first time the library is loaded and stay fixed until the last reference is gone; data will be at a different address for each program.)

In most cases, the performance improvement resulting from using shared code and smaller executable files will outweigh the penalties of position-independent code. However, if your library will be used by only one running application at a time, it might perform better using an archive library.

Administering Shared Libraries

Shared (not archived) libraries are the default implementation shipped. To use archived libraries when a shared library exists requires a special linker option.

Consider carefully the path of shared libraries in the file-system tree. Unless the environment variable, SHLIB_PATH is used, the shared library must be found at run time at the same location (that is, at the same pathname) as it is found at link time, or the program will not run.

Be very careful how you use /usr/lib, /usr/local/lib, and /usr/contrib/lib. Mistakes can be harrowing!

You can NFS mount /usr/local/lib, but consider that all programs that use shared libraries from that directory will execute a little bit more slowly. (However, once the shared libraries are used, they remain in memory for a while; the NFS mount will then degrade performance only minimally.)

For security, /usr/lib should never be writable to by all. If /usr/local is writable by all, root should not be executing any programs that use shared libraries from /usr/local/lib. This precaution will prevent a potential security problem -- actions of a malicious user, who could replace a library in that directory, thereby compromising any program using it.

Missing or incorrect critical shared libraries can be detected during the boot procedure. If this occurs, the system administrator can recover the libraries using a backup tape.

For a more thorough discussion of shared libraries, refer to Programming on HP-UX.

HP-UX Memory-Management Features

Besides providing fundamental support for virtual memory, HP-UX provides these important features:

Shared memory for high-bandwidth interprocess communication (refer to shmget(2), shmat(2), and shmctl(2) in the HP-UX Reference).
On Series 700 systems, device mapping for mapping physical addresses into virtual address space. This allows direct access to I/O devices (refer to iomap(7) and graphics(7) in the HP-UX Reference).
Process locking for locking all or part of the user process space for real-time application needs (refer to plock(2) in the HP-UX Reference).
Memory-mapped files allow applications to map file data into memory and perform I/O to files through direct loads and stores of memory instead of read and write. (Refer to mmap(2) in the HP-UX Reference.)
Batch scheduling for large processes. (Refer to serialize(2) in the HP-UX Reference.

Swap Space Management

Swap space is an area on a high-speed storage device (almost always a disk drive), reserved for use by the virtual memory system for deactivation and paging processes.

There are two types of swap space, device swap space and file-system swap space, both of which can be configured using SAM or the swapon(1M) command.

Device swap space resides in its own reserved area -- an entire disk, a section, or a logical volume of an LVM disk.

Device swap is faster than file-system swap.

          On a system implementing LVM, the swap logical volume can be 
          increased as needed. (With traditional disk partitions, 
          device swap space was fixed in size.)

          At least one swap device (primary swap) should be present on
          the system.

        * File-system swap space is located on a mounted file system.
          File-system swap is slower, but varies in size with the
          system's swapping activity.

HP-UX swap-space management allows you to allocate swap as needed (that is, dynamically) while the system is running, without having to regenerate the kernel.

Total available swap on a system consists of all swap space available on all devices and file systems enabled as swap. The swapping subsystem reserves swap space at process creation time, but does not allocate swap space from the disk until pages need to go out to disk. Reserving swap at process creation protects the swapper from running out of swap space.

A variation on traditional swapping, called pseudo-swap reservation, allows users to execute processes in memory without allocating physical swap. Pseudo-swap reservation is described at the end of this paper.

Swap Parameters

HP-UX deals with swapping in terms of several parameters:

swchunk         The number of DEV_BSIZE blocks in a unit of swap space, 
                by default, 2 MB on all systems.

maxswapchunks Maximum number of swap chunks allowed on a system.

swapmem_on      Parameter allowing creation of more processes than you
                have swap space for.  (See "Pseudo-Swap Reservation,"
                at the end of this discussion.)

During system startup, the location and size of each swap device is displayed in 512-KB blocks.

For example, the swap space's starting disk block number and size of swap space are displayed as:

        start = xxxxxx  
        size = xxxxxx

Device Swap Space

Device swap space resides in its own reserved area -- an entire disk, or a section or logical volume of a disk -- and is not taken from file system space.

At least one swap device (or pseudo-swap reservation as a minimum) must be present on a system. Device swap space is required during system startup.

You can also configure device swap space dynamically, without bringing the system into single-user mode. Unless you are using LVM, device swap is fixed in size and location.

Using LVM, you cannot reduce the size of active device swap, because virtual memory cannot dynamically reconfigure the size of the swap device; thus, reducing the size of active device swap might cause a system panic. You can, however, increase the size of device swap by extending the swap logical volume as needed. However, the system will not recognize the change until you reboot.

One or more swap device can be configured into the system, by specifying the swap devices in the configuration file, /stand/system.

You can also configure one swap device in the configuration file and dynamically configure more swap devices later, with swapon.

The first swap device listed in the configuration file is termed the primary swap device; that is, the swap space used first by the operating system. Primary swap defaults to the disk that contains the root file system.

By listing swap devices in /etc/fstab, you ensure that the swap devices are automatically enabled when the system is rebooted. The primary swap device might be listed in /etc/fstab as, for example, the device file /dev/dsk/c0t5d0 or on a system using LVM, the device might be listed as a logical volume, such as /dev/vg00/rlvol2.

When the disk is treated as a whole, the mkfs(1M) command is used to apportion disk space for file system and swap. If LVM is used, the LVM commands apportion disk space into logical volumes for file system and swap.

/sbin/rc sequences start-up operations, depending on the run state entered. For example, among the start-up scripts in /sbin/rc2.d (the directory for run-level 2) is a swap_start script that calls swapon -a. swapon enables device or file system for paging and swapping and the -a option instructs the system to read /etc/fstab for swap information and mount all swap sections. (See rc(1M) and swapon(1M) for further information.)

HP-UX allows you to configure swap on several disk drives, making it easy to expand the swap space. You can have multiple swap volumes per disk.

Having multiple swap devices on a system also increases throughput, by using the principle of "interleaving swap." Interleaving swap is achieved by assigning identical priority to multiple swap devices to minimize disk-head movement and enhance performance.

Swap space must be large enough to hold all segments of all existing processes (data, stack, shared memory). When it lacks sufficient swap space, the system either returns an error (such as ENOMEM or ENOSPC) for system calls, or in the case of a stack-growth failure, kills the user process.

If you need more swap space, do one of the following:

Enable an additional swap device or file-system swap to your running system.
Rebuild the file system on the existing device to reserve more swap space. To do this, you will also need to back up the existing file system, remake the file system using newfs(1M), then restore the saved file system to the new configuration. If you are administering a Series 700 as a whole disk (used only for root and swap and without LVM), run mkboot. This must be done from another system.
Rebuild the kernel using SAM or config to change primary swap space. (You would do this if your needs have changed, and rather than continually allocating swap as needed, you decide to reorganize your disk space and allocate a different primary swap.)
Create another logical volume or extend an existing swap logical volume. Note: to make use of the extended swap volume space, you must reboot the system.

File-System Swap

File-system swap, another form of secondary swap space, can also be configured dynamically. File-system swap space allows a process to use an existing file system if it needs more than the designated device swap space. The operating system swaps to space in the file system, in addition to device swap space. File-system swap is used only when device swap space is insufficient to meet demand-paging needs.

Whereas a swap device is limited in size to a specific section of a disk (or logical volume), file-system swap consumes a variable amount of space. This is an efficient use of disk space, because it enables the system to swap to unused portions of a file system as needed.

A process might need 20 MB of swap space for a short time. If only device swap is enabled, an entire 20 MB of swap space would have to be allocated permanently just to handle that brief work. With file-system swap, the paging system can fill these temporary needs from file-system space. The paging system and the file system share file system space. You can also limit file-system swap to a fixed size to prevent it from consuming too much space.

To optimize system performance, file-system swap space is allocated and de-allocated in swchunk-sized chunks. swchunk is a configurable operating system parameter; its default is 2048 KB (2 MB). Once a chunk of file system space is no longer being used by the paging system, it is released for file system use, unless it has been preallocated with swapon. (See swapon(1m) or swapon(2) in the HP-UX Reference.)

Guidelines for Adding Swap Space

Swap configured into the system is always available to a booted and running system. However, file-system and device swap can be allocated to a running system for short-term use, and is used by the system like any other configured swap device.

To make swap-space allocation automatic at boot time, you must include it in the /etc/fstab file. This ensures that the swap area is enabled when the system is rebooted.

If a swap area is not added to /etc/fstab, the system will no longer swap to that area after a reboot. To disable swap, you must edit /etc/fstab before rebooting. (These procedures can be performed using SAM.)

If you need more swap space but do not have any spare devices, we recommend that you add file-system swap space. System performance is nearly as good when using file-system swap space as device swap. If the system is using so much file-system swap space that performance degrades badly, you might want to increase device swap space.

If you have additional devices that can be used for swapping, enable these before file-system swap, because the performance is better. If using both device and file-system swap, give devices a higher priority (that is, a lower number in the pri argument when executing the swapon command). File system swap, given equal priority, is always used after device swap.

For instructions on adding swap space, see swapon(1M) in the HP-UX Reference.

Sample /etc/fstab Entry for Device Swap

To enable your swap device each time the system is rebooted, be sure to include an entry in /etc/fstab, as shown:

/dev/dsk/c1t5d0 /default swap pri=1

or when using LVM:

/dev/vg00/lvol3 /default swap pri=1

Refer to the fstab(4) manual page of the HP-UX Reference for a thorough discussion of the fields used when adding device swap.

Sample /etc/fstab Entry for File-System Swap

To enable file-system swap when the system is rebooted, include an entry in /etc/fstab.

The following /etc/fstab entry enables swap on the file system containing the directory /swap. The size of a swap block is set by swchunk. It is currently set to 2048, in units of dev_bsize, which is 1024 bytes. (2048/size of swap block X 1024 bytes = 2MB, which is the allocation size of file-system swap-space chunks.) To increase efficiency and performance, the minimum of 10 file-system system blocks actually becomes a chunk allocation and is consumed immediately; a maximum of 4500 blocks can be used; 100 blocks are reserved for file system use; and the priority is 2.

default /swap swapfs min=10,lim=4500,res=100,pri=2

Refer to the fstab(4) manual page of the HP-UX Reference for a thorough discussion of the fields used when adding file-system swap.

Comparing Device and File-System Swap

Device swap is faster than file-system swap. This is because the system can write an entire request (256 KB on Series 700/800) to a device at once.

File-system swap makes more efficient use of file-system space, but it might degrade system performance somewhat, because its throughput is slower than device swap. This is because free file-system blocks may not always be contiguous; therefore, separate read/write requests must be made for each file-system block.

Swap Space Priorities

Priorities, ranging from zero to ten, can be set for all devices or file systems. The lower the number, the higher the priority. Thus, a device with a priority of zero is used for swapping before a device of priority one or higher.

Swapping rotates among both devices and file systems of equal priority. Given equal priority, however, devices are swapped to by the operating system before file systems, because devices make more efficient use of CPU time.

We recommend that you assign the same swapping priority to most swap devices, unless a device is significantly slower than the rest. Assigning equal priorities limits disk head movement, which improves swapping performance.

The swapon Command

The swapon(1M) command allows you to enable additional device or file system for paging and swapping. The swap space can be added while the system is running.

Sample Swap Device Allocations

Here are some sample swap device allocations.

The following example enables swapping to a block device using the space after the end of the file system (-e option) for swap and letting the priority default to 1.

/usr/sbin/swapon -e /dev/dsk/c4t0d0

The next example enables swapping to a block device with instance number 0, CS/80 drive unit number 0 with the highest priority (zero).

/usr/sbin/swapon -p 0 /dev/dsk/c0t0d0

The example following enables swapping on logical volume lvol4 of volume group vg03 on system using LVM. (See the section on Device Swap for limitations to changing the size of device swap. Also, see "Managing Logical Volumes" in the System Administration Tasks manual for use of LVM commands.)

/usr/sbin/swapon /dev/vg03/lvol4

The last example forces swapping (-f option) to be enabled on a block device /dev/dsk/c12t3d0, even if a file system exists on the device. (Note that the -f option makes this a potentially destructive command, to be used with caution.) The device is assigned a swapping priority of 2.

/usr/sbin/swapon -p 2 -f /dev/dsk/c12t3d0

Sample File System Swap Allocations

Here are some sample file-system swap allocations.

The following example enables file system swap to a directory named /swap. Initially, 256 file system blocks are consumed; no more than 1024 file system blocks may be consumed. Also, 0 blocks are reserved for file system use and this file system is assigned a swapping priority of 3.

/usr/sbin/swapon /swap -m 256 -l 1024 -r 0 -p 3

The next example enables file system swap to a directory named /disk2. Initially, no (zero) file system blocks are consumed; no more than 2048 file system blocks maximum may be consumed. No file system blocks are reserved for file system use. The swapping priority is 10, meaning that /disk2 is the least likely file system to be used for swap space.

/usr/sbin/swapon /disk2 -m 0 -l 2048 -r 0 -p 10

The example following enables file system swap to a directory named /bigdisk. Initially, 1024 file system blocks are consumed, no more than a maximum of 4096 file system blocks may be consumed. No file system blocks are reserved. /bigdisk is assigned a swapping priority of 0, meaning it will be used most often for swap space.

/usr/sbin/swapon /bigdisk -m 1024 -l 4096 -r 0 -p 0

The last example enables file system swap to a directory named /dyndisk. Although no file system space is allocated initially, no limit is set on the amount of space that can be used. 2048 blocks are reserved for file system use. /dyndisk has a swapping priority of 0

the highest.

/usr/sbin/swapon /dyndisk -m 0 -l 0 -r 2048 -p 0

Evaluating Swap-Space Needs

As a system administrator, you need to monitor your system's swap space regularly, because swap space usage varies with system load. You want to understand your system's swap requirements when demand is heaviest.

Swap space must be large enough to hold the sum of all shared memory, shared libraries, stack, and data for the largest executable process.

The size command is a useful tool for acquiring information on process size:

        % size /usr/bin/vi
        243332 + 222992 + 147620 = 423944

size returns the code (text), data, and bss (data uninitialized at the beginning of process execution) for a program. Since no swap space is reserved for code (the first figure), you would combine only the second and third sizes to estimate the size a program occupies in the swap area. Multiply the figure by the number of users executing the program at the system's busiest time, since each user is allocated data and bss. Repeat this calculation for each program executed when the system is at its busiest.

Shared memory must also be included in your swap-space needs assessment. To display the amount of shared memory your system is using at any given moment, use the ipcs command:

     %  ipcs -b
     IPC status from /dev/kmem as of Fri May 29 14:53:42 1994
     T     ID     KEY        MODE       OWNER    GROUP QBYTES
     Message Queues:
     q      0 0x3c341834 -Rrw--w--w-     root     root  16384
     q      1 0x3e341834 --rw-r--r--     root     root    264
     T     ID     KEY        MODE       OWNER    GROUP  SEGSZ
     Shared Memory:
     m      0 0x41341837 --rw-rw-rw-     root     root    512
     m      1 0x4134184f --rw-rw-rw-     root     root   7452
     m      2 0x411800ac --rw-rw-rw-     root     root   8192
     T     ID     KEY        MODE       OWNER    GROUP NSEMS
     Semaphores:
     s      0 0x4134184f --ra-ra-ra-     root     root     2

In this example, 512, 7452, and 8192 are shared memory segment sizes. You would add them to the data and bss sizes. To account conservatively for page boundaries, round up your estimate.

The default value of swchunk is 2048.
swchunk is allocated in units of DEV_BSIZE, which regardless of device is 1024 bytes. If you do not use a multiple of swchunk some swap space is wasted.

swchunk is then multiplied by the operating-system parameter, maxswapchunks, to derive the limit of swap space allowed on your system. maxswapchunks is by default 256 (2 to the 8th power), but can be set to any value up to 2 to the 14th power. The value specified does not have to be a power of 2.

Based on these default values, the default limit for the amount of swap enabled is 500 MB (that is, 2048*1024*256).

One more parameter should be considered when determining swap-space requirements. While overlaying the old process image with the new process image, the exec system call uses an area of swap space to temporarily hold arguments and environment variables. The size of this area is determined by the configurable system parameter argdevnblks, whose default size is 256 KB.

If you need to change the amount of swap space to accommodate an application, refer to the application's manual to see if it describes swap space requirements.

For details on setting and changing device and file system swap space, see the System Administration Tasks manual.

Pseudo-Swap Reservation

HP-UX has an operating-system parameter to provide the capability of using system memory for swap space. By default, swapmem_on is set to 1, enabling pseudo-swap reservation.

Typically, some physical memory (such as a disk) must be configured for swap space. When the system executes a process, swap space is usually reserved for the entire process, in case the entire process must be paged out. According to this model, to run one gigabyte of processes, the system would have to have one gigabyte of configured swap space. Although this protects the system from running out of swap space, disk space reserved for swap is under-utilized if minimal or no swapping occurs.

To avoid such waste of resources, HP-UX swap space is configured to access device swap space, file system swap space, plus up to three-quarters of system memory capacity. This means that system memory serves two functions: as process-execution space and as swap space.

System memory used for swap space is called pseudo-swap space. By using pseudo-swap space, a one-gigabyte memory system with one-gigabyte of swap can run up to 1.75 GB of processes. As before, if any process attempts to grow or be created beyond this extended threshold, the process will fail.

Note, on systems using pseudo-swap, as the amount of pseudo-swap increases, the amount of lockable memory decreases.

Pseudo-swap space is set to a maximum of three-quarters of system memory because the system can begin paging once three-quarters of system available memory has been used (see "Maintaining Page Availability"). The unused quarter of memory allows a buffer between the system and the swapper to give the system more breathing room.

For factory-floor systems (such as controllers), which perform best when the entire application is resident in memory, pseudo-swap space can be used to enhance performance: you can either lock the application in memory or make sure the total number of processes created does not exceed three-quarters of system memory.

Note, however, that when the number of processes created approaches capacity, the system might exhibit thrashing and a decrease in system response time. If necessary, you can disable pseudo-swap space by setting the tunable parameter swapmem_on in /usr/conf/master.d/core-hpux to zero. See master(4) in the HP-UX Reference for more information.

| Welcome Page | HP Northrop Team Organization | Upcoming Events | New Products and Services | Northrop Links | Other HP Links | Customer Education | Professional Services Organization | HP-UX 10.X | COOL Sites | Internet Solutions | Email Comments/Suggestions to HP |

HP-UX 10.0 Memory Management White Paper

Memory Management

Physical Memory

| V Page#20 | |

Per-Process Regions (pregions)

Memory-Mapped Files

How the Kernel Executes Processes Using Demand Paging

Thrashing

Standard Executable Code (EXEC_MAGIC)

Shared Libraries

Swap Space Management

Device Swap Space

File-System Swap

| Welcome Page | HP Northrop Team Organization | Upcoming Events | New Products and Services | Northrop Links | Other HP Links | Customer Education | Professional Services Organization | HP-UX 10.X | COOL Sites | Internet Solutions | Email Comments/Suggestions to HP |