Skip to main content

skip to main content

developerWorks  >  Linux  >

Explore the Linux memory model

A first step to understanding the Linux design

developerWorks
Document options

Document options requiring JavaScript are not displayed


Learn and share!

Exchange know-how with your peers -- try our new Pass It Along beta app


Rate this page

Help us improve this content


Level: Introductory

Vikram Shukla (vikshukl@in.ibm.com), Software Engineer, IBM

24 Jan 2006

Learn the fundamentals of how memory is constructed and managed in this guided introduction to the Linux® memory model. This guide includes an examination of the segment control unit and the paging models as well as a detailed look at the physical memory zone.

Understanding the memory models used in Linux is the first step to grasping Linux design and implementation on a grander scale, so this gives you an introductory-level tour of Linux memory models and management.

Linux uses the monolithic approach that defines a set of primitives or system calls to implement operating system services such as process management, concurrency, and memory management in several modules that run in supervisor mode. And although Linux maintains the segment control unit model as a symbolic representation for compatibility purposes, it uses this model at a minimal level.

The main issues that relate to memory management are:

  • Virtual memory management, a logical layer between application memory requests and physical memory.
  • Physical memory management.
  • Kernel virtual memory management/kernel memory allocator, a component that tries to satisfy the requests for memory. The request can be from within the kernel or from a user.
  • Virtual address space management.
  • Swapping and caching.

This article can help you understand the Linux internals from a memory-management perspective within the operating system by addressing the following:

  • The segment control unit model, in general, and specifically for Linux
  • The paging model, in general, and specifically for Linux
  • The physical details of the memory zone

This article does not detail how the memory is managed by the Linux kernel, but the information on the overall memory model and how it is addressed should give you a framework for learning more. This article focuses on the x86 architecture, but you can use the material in this article with other hardware implementations.

x86 memory architecture

In the x86 architecture, the memory is divided into three kinds of addresses:

  • A logical address is a storage location address that may or may not relate directly to a physical location. The logical address is usually used when requesting information from a controller.
  • A linear address (or a flat address space) is memory that is addressed starting with 0. Each subsequent byte is referenced by the next sequential number (0, 1, 2, 3, etc.) all the way to the end of memory. This is how most non-Intel CPUs address memory. Intel® architectures use a segmented address space in which memory is broken up into 64KB segments, and a segment register always points to the base of the segment that is currently being addressed. The 32-bit mode in this architecture is considered a flat address space, but it too uses segments.
  • A physical address is an address represented by bits on a physical address bus. The physical address may be different from the logical address, in which case the memory management unit translates the logical address into a physical address.

The CPU uses two units to transform the logical address into physical addresses. The first is called the segmented unit and other is called the paging unit.


Figure 1. Two units convert address spaces
Two units convert address spaces

Let's examine the segment control unit model.



Back to top


Segment control unit model in general

The basic idea behind the segmentation model is that memory is managed using a set of segments. Essentially, each segment is its own address space. A segment consists of two components:

  • A base address that contains the address of some physical memory location
  • A length value that specifies the length of the segment

A segmented address also consists of two components -- a segment selector and an offset into the segment. The segment selector specifies the segment to use (that is, the base address and length values) while the offset component specifies the offset from the base address for the actual memory access. The physical address of the actual memory location is the sum of the offset and the base address values. If the offset exceeds the length of the segment, the system generates a protection violation.

To summarize the representation:

Segmented Unit is represented as -> Segment: Offset model
can also be represented as -> Segment Identifier: Offset

Each segment is a 16-bit field called a segment identifier or segment selector. x86 hardware consists of few programmable registers called segment registers which hold these segment selectors. These registers are cs (code segment), ds (data segment), and ss (stack segment). Each segment identifier identifies a segment which is represented by a 64-bit (8 bytes) segment descriptor. These segment descriptors are stored in a GDT (global descriptor table) and can be also stored in an LDT (local descriptor table).


Figure 2. Interplay of segment descriptors and segment registers
Interplay of segment descriptors and segment registers

Each time a segment selector is loaded on to segment registers, the corresponding segment descriptor is loaded from memory into a matching non-programmable CPU register. Each segment descriptor is eight bytes long and represents a single segment in memory. These are stored in LDTs or GDTs. The segment descriptor entry contains both a pointer to the first byte in the associated segment represented by the Base field and a 20-bit value (the Limit field) which represents the size of the segment in memory.

Several other fields contain special attributes such as a privilege level and the segment's type (cs or ds). The segment type is represented by a four-bit Type field.

Because we use the non-programmable register, GDT or LDT is not referred to while the translation from the logical address to the linear address is performed. This speeds the translation of memory.

A segment selector contains following:

  • A 13-bit index that identifies the corresponding segment descriptor entry contained in the GDT or LDT.
  • The TI (Table Indicator) flag that specifies whether the segment descriptor is included in GDT if the value is 0; if the value is 1, then the segment descriptor is included in the LDT.
  • The RPL (request privilege level) defines the current privilege level of the CPU when the corresponding segment selector is loaded in the segment register.

Since a segment descriptor is eight bytes long, its relative address inside the GDT or LDT is obtained by multiplying the most significant 13 bits of segment selector by 8. For example, if the GDT is stored at address 0x00020000 and Index specified by segment selector is 2, then the address of corresponding segment descriptor is equal to (2*8) + 0x00020000. The total number of segment descriptor that can be stored in a GDT equals (2^13 - 1). This comes to 8191.

Figure 3 shows the graphical representation to obtain the linear address from logical address.


Figure 3. Obtaining a linear address from a logical address
Obtaining a linear address from a logical address

Now how is this different with Linux?



Back to top


Segment control unit in Linux

In Linux, this model employs a small modification. I've already noted that Linux uses the segmentation model in limited way (mostly for compatibility purposes).

In Linux, all the segment registers point to the same range of segment addresses - in other words, each uses same set of linear addresses. This enables Linux to use a limited number of segment descriptors, therefore all descriptors can be kept in the GDT. Two advantages of this model is that:

  • Memory management is simpler when all processes use the same segment register values (when they share same set of linear addresses).
  • Portability with most architectures can be achieved. Several RISC processors also support segmentation in this limited way.

Figure 4 demonstrate this modification.


Figure 4. In Linux segment registers point to the same set of addresses
In Linux segment registers point to the same set of addresses

Segment descriptors

Linux uses following segments descriptors:

  • The kernel code segment
  • The kernel data segment
  • The user code segment
  • The user data segment
  • The TSS segment
  • The default LDT segment

Let's look at each of these in detail.

The kernel code segment descriptor in the GDT has the following values:

  • Base = 0x00000000
  • Limit = 0xffffffff (2^32 -1) = 4GB
  • G (granularity flag) = 1 for segment size expressed in pages
  • S = 1 for normal code or data segment
  • Type = 0xa for code segment that can be read and executed
  • DPL value = 0 for kernel mode

The linear address associated with this segment is 4 GB. S =1 and type = 0xa refers to the code segment. The selector is in the cs register. The macro in Linux through which the corresponding segment selector is accessed is via the _KERNEL_CS macro.

The kernel data segment descriptor for this has similar values to the kernel code segment except the file Type where its value is set to two. This represents that the segment is a data segment and the selector is stored in the ds register. The macro in Linux through which the corresponding segment selector is accessed is via the _KERNEL_DS macro.

The user code segment is shared by all the processes in the user mode. The corresponding segment descriptor stored in the GDT has following values:

  • Base = 0x00000000
  • Limit = 0xffffffff
  • G = 1
  • S = 1
  • Type = 0xa for code segment that can be read and executed
  • DPL = 3 for user mode

The macro used in Linux to access this segment selector is the _USER_CS macro.

In the user data segment descriptor, the only field that changes is Type which is set to two and which defines the data segment that can be read and written. The macro used in Linux to access this segment selector is the _USER_DS macro.

In addition to these segment descriptors, the GDT contains two more segment descriptors for each process created -- the TSS and LDT segments.

Each TSS segment descriptor refers to a different process. TSS holds hardware context information for each CPU which helps to take effect in context switching. For example, during a U->K mode switch, the x86 CPU gets the address of the kernel mode stack from TSS.

Each process has its own TSS descriptor for the corresponding process stored in the GDT. Following are the values of the descriptors:

  • Base = &tss (the address of the TSS field of the corresponding process descriptor; for example, &tss_struct) which is defined in the schedule.h file of the Linux kernel
  • Limit = 0xeb (TSS segment is 236 bytes long)
  • Type = 9 or 11
  • DPL = 0. The user mode does not access TSS. The G flag is cleared

All the processes share the default LDT segment. By default it contains a null segment descriptor. This default LDT segment descriptor is stored in the GDT. The LDT generated by Linux has a size of 24 bytes. By default, three entries are always present:

LDT[0] = null
LDT[1] = user code segment
LDT[2] = user data/stack segment descriptor

Calculating TASKS

Understanding NR_TASKS (a variable that determines the number of simultaneous processes that Linux supports -- the default value in the kernel source is 512, allowing a maximum of 256 simultaneous connections to a single instance) is necessary to calculate the maximum permissible entries in the GDT.

The total number of entries allowed in the GDT can be determined by following formula:

Number of entries in GDT = 12 + 2 * NR_TASKS.
As mentioned earlier GDT can have entries = 2^13 -1 = 8192.

Out of 8192 segment descriptors, Linux uses 6 segment descriptors, 4 additional ones cover for APM features (advanced power management features) and 4 entries in the GDT are left unused. Therefore, the net number of entries possible in the GDT is equal to 8192 - 14 or 8180.

At any point of time we cannot have more than 8180 number of entries in GDT, therefore:

2 * NR_TASKS = 8180
And NR_TASKS = 8180/2 = 4090

(Why 2 * NR_TASKS? Because for each process created, not only is just the TSS descriptor (used for maintaining context-switch context) being loaded, but an LDT descriptor is being loaded too.)

This restriction on number of processes in the x86 architecture was a component of Linux 2.2, but since kernel 2.4, this problem has been eliminated, partly by doing away with hardware context switching (which made using TSS inevitable) and replacing it with process switching.

Next, let's look at the paging model.



Back to top


Paging model in general

The paging unit translates the linear addresses into physical ones (see Figure 1). A set of linear addresses are grouped together to form pages. These linear addresses are contiguous in nature -- the paging unit maps these sets of contiguous memory to corresponding set of contiguous physical addresses called page frames. Note that the paging unit visualizes RAM to be partitioned into a fixed size of page frames.

Because of this, paging has following advantages:

  • Access rights defined for a page will hold good for those group of linear addresses forming a page
  • The length of page equals length of page frame

The data structure that maps these pages to page frames is called a page table. These page tables are stored in main memory and are properly initialized by the kernel before enabling paging unit. Figure 5 shows a page table.


Figure 5. A page table matches pages to page frames
A page table matches pages to page frames

Note that the set of addresses contained within the Page1 matches with the corresponding set of addresses contained within the Page Frame1.

Linux uses the paging unit more than it does the segmentation unit. As we saw earlier in the section on Linux and segmentation, each segment descriptor uses same set of addresses for linear addressing, minimizing the need to use the segmentation unit to convert logical addresses to linear addresses. By using the paging unit more than the segmentation unit, Linux greatly facilitates memory management and portability across different hardware platforms.

Fields used in paging

Here's a description of the fields used to specify paging in x86 architectures which help to achieve paging in Linux. The paging unit gets in the linear address as an output of segmentation unit which it then further divides into the following fields:

  • Directory is represented by 10 MSBs (Most Significant Bit is the bit position in a binary number having the greatest value -- the MSB is sometimes referred to as the left-most bit).
  • Table is represented by the intermediate 10 bits
  • Offset is represented by 12 LSBs. (A Least Significant Bit is the bit position in a binary integer giving the units value, that is, determining whether the number is even or odd. The LSB is sometimes referred to as the right-most bit. It is analogous to the least significant digit of a decimal integer which is the digit in the ones or right-most position.)

The translation of linear addresses into their corresponding physical location is a two-step process. The first step uses a translation table called Page Directory (goes from the Page Directory to the Page Table) and the second step uses translation table called Page Table (which is the Page Table plus the Offset to required page frame). You can see this in Figure 6.


Figure 6. Paging fields
Paging fields

To start with, the physical address of Page Directory is loaded into cr3 register .The directory field within the linear address determines the entry in Page Directory that points to the proper Page Table. The address in table field determines the entry in the Page Table that contains the physical address of the page frame containing the page. The offset field determines relative position within the page frame. Since this offset is 12 bits long, each page contains 4 KB of data.

To summarize the physical address computation:

  1. cr3 + Page Directory (10 MSBs) = points to table_base
  2. table_base + Page Table (10 intermediate bits) = points to page_base
  3. page_base + Offset = physical address (gets the page frame)

Since Page Directory and Page Table are 10 bits long, the addressable limit possible from them is equal to 1024*1024 KB and Offset can address up to 2^12 (4096 bytes). Therefore, in total the addressable limit by Page Directory is equal to 1024*1024*4096 (equal to 2^32 memory cells which comes to 4 GB). So on x86 architectures, the total addressable limit is 4 GB.

Extended paging

Extended paging is obtained by removing the Page Table translation table; then the division of linear address is done in between the Page Directory (10 MSBs) and the Offset (22 LSBs).

The 22 LSBs form the 4 MB boundary for the page frame (2^22). Extended paging coexists with normal paging and is enabled to map large contiguous linear addresses into corresponding physical ones. The operating system removes the Page Table and thus provides the extended paging. This is enabled by setting the PSE (page size extension) flag.

The 36-bit PSE extends 36-bit physical address support to 4 MB pages while maintaining a 4- byte page-directory entry thereby providing a simple mechanism to address physical memory above 4 GB without requiring major design changes to operating systems. This approach has practical limitations with respect to demand paging.



Back to top


Paging model in Linux

Paging in Linux is similar to general paging, but the x86 architecture introduced a three-level page table mechanism consisting of:

  • Page Global Directory (pgd), the abstracted top level of the multi-level page tables. Each level of page table deals with different sizes of memory -- this global directory may deal with areas 4 MB in size. Each entry will be a pointer to a lower table of a smaller-sized directory, so the pgd is a directory of page tables. When code traverses this structure (some drivers do this), it is said to "walk" the page tables.
  • Page Middle Directory (pmd), the middle level of page tables. On x86 architectures, the pmd is not present in hardware, but is folded in to the pgd in the kernel code.
  • Page Table Entry (pte), the bottom level which deals in pages directly (look for PAGE_SIZE), is a value containing the physical address of a page along with associated bits indicating, for example, that the entry is valid and the related page is present in real memory.

This three-level paging scheme also got incorporated into Linux in order to support large memory areas. When large-memory-area support is not required, you can fall back to two-level paging by defining the pmd as "1."

The levels are optimized at compile time, enabling both the second and third levels (using the same set of code) by just enabling or disabling the middle directory. The 32-bit processor uses pmd paging and 64-bit processors use pgd paging.


Figure 7. Three levels of paging
Three levels of paging

Just so you know, in 64-bit processors:

  • 21 MSBs are unused
  • 13 LSBs are represented by page offset
  • The remaining 30 bits are divided into
    • 10 bits for Page Table
    • 10 bits for Page Global Directory
    • 10 bits for Page Middle Directory

As we see from the architecture, actually 43 bits are used for addressing. So effectively on a 64-bit processor the virtual memory available for usage is 2 to the power of 43.

Each process has its own set of page directories and page tables. In order to reference a page frame which contains actual user data, the operating system begins by loading the (on x86 architectures) pgd into the cr3 register. Linux saves in the TSS segment the content of the cr3 register and then loads another value from the TSS segment into the cr3 register whenever a new process is executed on CPU. The result is that the paging unit refers to correct set of page tables.

Each entry into the pgd table points to a page frame containing an array of pmd entries which in turns points to a page frame containing pte which finally points to a page frame containing the user data. If the page being looked for has been swapped out, a swap entry is stored at in the pte table which is used (when there is a page fault) for finding the page frame to reload in memory.

Figure 8 shows that we are adding offsets at each page table level to map to corresponding page frame entry. We get these offsets by breaking the linear addresses we receive as an output from segmentation unit. To break the linear address corresponding to each page table component, various macros are used in the kernel. Without going into detail about these macros, let's diagrammatically see the split of linear address.


Figure 8. Linear addresses have different address lengths
Linear addresses have different address lengths

Reserved page frame

Linux reserves few page frames exclusively for the kernel code and data structures. These pages are never swapped to disk. Linear address from 0x0 to 0xc0000000 (PAGE_OFFSET) are referred by both user code and kernel code. From PAGE_OFFSET up to 0xffffffff is addressed by kernel code.

This means that out of 4 GB, only 3 GB are available for user application.

How paging is enabled

The paging mechanism used by Linux processes is set up in two phases:

  • At bootstrapping, the system sets up the page table for 8 MB of physical memory.
  • Then, the second phase completes the mapping for rest of the physical memory.

In the bootstrap phase, the startup_32() call is responsible for initiating the paging. This is implemented in within the arch/i386/kernel/head.S file. The mapping of this 8 MB happens at address above PAGE_OFFSET. The initialization begins with a statically defined compile-time array called swapper_pg_dir.This is placed at a particular address (0x00101000) at compile time.

This action establishes page table entries for two pages defined statically in the code -- pg0 and pg1. The sizes of these page frames are by default 4 KB unless the page size extension bit is set (see the Extended paging section for more on the PSE). The sizes are 4 MB each. The data address pointed to by the global array is stored at the cr3 register which I suppose is first step in setting the paging unit for Linux processes. The rest of the page entries are set up in the second phase.

The second phase is taken care by the method call paging_init().

RAM mapping is done between the PAGE_OFFSET and the address represented by fourth GB limit (0xFFFFFFFF) in the x86 32-bit architecture. That means the RAM of approximately 1 GB can be mapped when Linux starts and this happens by default. However, if someone has set up HIGHMEM_CONFIG, then physical memory of more than 1 GB could be also mapped to kernel - keep in mind that this is a temporary arrangement. It is done by a kmap() call.



Back to top


Physical memory zone

I've already shown you that the Linux kernel (on a 32-bit architecture) divides the virtual memory into a 3:1 ratio, 3 GB virtual memory for user space and 1 GB for kernel space. The kernel code and its data structures must reside in this 1 GB of address space, but an even bigger consumer of this address space is the virtual mappings for the physical memory.

This is done because kernel cannot manipulate the memory if it is not mapped into its address space. Thus, the maximum amount of physical memory that could be handled by the kernel was the amount that could be mapped into kernel's virtual address space minus the space needed to map the kernel code itself. As a result, an x86-based Linux system could work with a maximum of little less than 1 GB of physical memory.

In order to cater to large numbers of users, to support more memory, to improve performance, and to establish an architecture-independent way to describe memory, the Linux memory model had to evolve. To achieve these goals, the newer model arranged memory into banks assigned to each CPU. Each bank is called a node; each node is divided into zones. Zones (which represent ranges within memory) are further categorized into following types:

  • ZONE_DMA (0-16 MB): The memory range residing in the lower physical memory area which certain ISA/PCI devices require.
  • ZONE_NORMAL (16-896 MB): The memory range that is directly mapped by the kernel into the upper regions of physical memory. All kernel operations can only take place using this memory zone, therefore it is the most performance-critical zone.
  • ZONE_HIGHMEM (896 MB and higher): The remaining available memory in the system which is not mapped by the kernel.

The node concept is implemented in kernel by using structure struct pglist_data. A zone is described by using structure struct zone_struct. The physical page frame is represented by structure struct Page and all these structs are kept in global structure array struct mem_map which is stored at beginning of NORMAL_ZONE. The basic relationships between node, zone, and page frame are shown in Figure 9.


Figure 9. Relationships among the node, zone, and page frame
Relationships among the node, zone, and page frame

The high memory zone made its appearance in kernel memory management when support for both Pentium II's virtual memory extension was implemented (to access up to 64 GBs by means of PAE -- Physical Address Extension -- on 32-bit systems) and support for 4 GB of physical memory (again, on 32-bit systems). It is a concept applied to x86 and SPARC platforms. Generally this 4 GB of memory is made accessible by mapping the ZONE_HIGHMEM onto ZONE_NORMAL by means of kmap(). Please note that it is not advisable to have more than 16 GB of RAM on a 32-bit architecture, even when PAE is enabled.

(PAE is an Intel-provided memory address extension that enables processors to expand the number of bits that can be used to address physical memory from 32 bits to 36 bits through support in the host operating system for applications using the Address Windowing Extensions API.)

The management of this physical memory zone is done by a zone allocator. It is responsible for dividing memory into a number of zones; it treats each zone as a unit for allocation purposes. Any particular allocation request utilizes a list of zones from which the allocation may be attempted, in a most-preferred-to-least-preferred order.

For example:

  • A request for a user page should be filled first from the "normal" zone (ZONE_NORMAL);
  • if that fails, from ZONE_HIGHMEM;
  • and if that fails, from ZONE_DMA.

The zone list for such allocations consists of the ZONE_NORMAL, ZONE_HIGHMEM, and ZONE_DMA zones, in that order. On the other hand, a request for a DMA page may only be fulfilled from the DMA zone, so the zone list for such requests contains only the DMA zone.



Back to top


Conclusion

Memory management is a large, complex, and time-consuming set of tasks, one that is difficult to achieve because crafting a model how systems behave in real-world, multi-programmed environments is a tough job. Components like scheduling, paging behavior, and multiple-process interactions presents a considerable challenge. I hope this article will help you decipher the basic knowledge required to engage the challenge of Linux memory management, providing you with a start.



Resources

Learn

Get products and technologies
  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

  • Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.



Discuss


About the author

Vikram Shukla, with more than six year's experience in development and design using object-oriented languages, currently works as a staff software engineer in the Java Technology Center at IBM, Banglore, India, supporting IBM JVM on Linux.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top


Linux is a trademark of Linus Torvalds in the United States, other countries, or both. DB2, Lotus, Rational, Tivoli, and WebSphere are trademarks of IBM Corporation in the United States, other countries, or both. Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. Other company, product, or service names may be trademarks or service marks of others.