diff -urN linux.orig/CREDITS linux/CREDITS --- linux.orig/CREDITS Wed Apr 24 14:07:24 2002 +++ linux/CREDITS Wed Apr 24 14:08:36 2002 @@ -1883,6 +1883,10 @@ S: Halifax, Nova Scotia S: Canada B3J 3C8 +N: Toshiyuki Maeda +E: tosh@is.s.u-tokyo.ac.jp +D: Kernel Mode Linux + N: Kai Mäkisara E: Kai.Makisara@metla.fi D: SCSI Tape Driver diff -urN linux.orig/Documentation/00-INDEX linux/Documentation/00-INDEX --- linux.orig/Documentation/00-INDEX Wed Apr 24 14:08:04 2002 +++ linux/Documentation/00-INDEX Wed Apr 24 14:08:36 2002 @@ -108,6 +108,8 @@ - listing of various WWW + books that document kernel internals. kernel-parameters.txt - summary listing of command line / boot prompt args for the kernel. +kml.txt + - info on Kernel Mode Linux. kmod.txt - info on the kernel module loader/unloader (kerneld replacement). locks.txt diff -urN linux.orig/Documentation/Configure.help linux/Documentation/Configure.help --- linux.orig/Documentation/Configure.help Wed Apr 24 14:08:04 2002 +++ linux/Documentation/Configure.help Thu Apr 25 22:49:27 2002 @@ -140,6 +140,18 @@ If you don't know what to do here, say N. +Kernel Mode Linux support +CONFIG_KERNEL_MODE_LINUX + This enables Kernel Mode Linux. In Kernel Mode Linux, user programs + can be executed safely in a kernel mode and access a kernel address space + directly. Thus, for example, costly mode switching between a user and a kernel + can be eliminated. If you say Y here, the kernel enables Kernel Mode Linux. + + More information about Kernel Mode Linux can be found in the + + + If you don't know what to do here, say N. + Intel or compatible 80x86 processor CONFIG_X86 This is Linux's home port. Linux was originally native to the Intel diff -urN linux.orig/Documentation/kml.txt linux/Documentation/kml.txt --- linux.orig/Documentation/kml.txt Thu Jan 1 09:00:00 1970 +++ linux/Documentation/kml.txt Sat Apr 27 16:43:56 2002 @@ -0,0 +1,88 @@ +Kernel Mode Linux (http://web.yl.is.s.u-tokyo.ac.jp/~tosh/kml) +Toshiyuki Maeda + + +Introduction: + +Kernel Mode Linux is a technology which enables us to execute user programs +in a kernel mode. In Kernel Mode Linux, user programs can be executed as +user processes that have the privilege level of a kernel mode. +The benefit of executing user programs in a kernel mode +is that the user programs can access a kernel address space directly. +So, for example, user programs can invoke +system calls very fast because it is unnecessary to switch between a kernel +mode and a user mode by using costly software interruptions or context switches. +Unlike kernel modules, user programs are executed +as ordinary processes (except for their privilege level), +so scheduling and paging are performed as usual. + +Although it seems dangerous to let user programs access a kernel directly, +safety of the kernel can be ensured, for example, by static type checking, +software fault isolation, and so forth. +For proof of concept, we are developing a system which is based on the combination +of Kernel Mode Linux and Typed Assembly Language, TAL. +(TAL can ensure safety of programs through its type checking and +the type checking can be done at machine binary level. +For more information about TAL, see http://www.cs.cornell.edu/talc) + +Currently, only IA-32 is supported. + + +Instruction: + +To enable Kernel Mode Linux, say Y in Kernel Mode Linux field of +kernel configuration, build and install the kernel, and reboot your machine. +Then, all executables under directory /trusted are executed in a kernel mode +in current Kernel Mode Linux implementation. For example, to execute a program +named "cat" in a kernel mode, copy the program to directory /trusted +and execute it as follows: + +% /trusted/cat + + +Implementation for IA-32: + +To execute user programs in a kernel mode, Kernel Mode Linux have +special start_thread (start_kernel_thread) routine, +which is called in execve(2) and set registers +of a user process to specified initial values. The original start_thread +routine set CS segment register to USER_CS. The start_kernel_thread routine +set the CS register to KERNEL_CS (same as DS, SS, and so on). +Thus, a user program is started as a user process executed in a kernel mode. + +The biggest problem to implement Kernel Mode Linux is +a stack starvation problem. Let's assume that a user program is executed +in a kernel mode and it does a page fault on its user stack. +To generate a page fault exception, a IA-32 CPU tries to push several +registers (EIP, CS, and so on) to the same user stack because the program +is executed in a kernel mode and the IA-32 CPU doesn't switch its stack +to a kernel stack. Therefore, the IA-32 CPU cannot push the registers +and generate a double fault exception and fail again. +Finally the IA-32 CPU gives up and reset itself. +This is the stack starvation problem. + +To solve the stack starvation problem, we use IA-32 hardware task mechanism to +handle exceptions. By using IA-32 task, IA-32 CPU doesn't push the registers +to its stack but switch an execution context to special contexts. +Therefore, the stack starvation problem doesn't occur. +However, it is costly to handle all exceptions by IA-32 tasks. +So, in current Kernel Mode Linux implementation, +only a double fault exception is handled by IA-32 task. + +The other problem is a manual stack switching problem. +In normal Linux Kernel, IA-32 CPU switches a stack from a user stack +to a kernel stack at exceptions or interruptions. +However, in Kernel Mode Linux, a user program may be executed in a kernel mode +and IA-32 CPU may not switch a stack. Therefore, +in current Kernel Mode Linux implementation, the kernel switches a stack +manually at exceptions and interruptions. To switch a stack, +a kernel must know a location of a kernel stack in an address space. +However, at exceptions and interruptions, the kernel cannot use +general registers (EAX, EBX, and so on). Therefore, it is very difficult +to get the location of the kernel stack. + +To solve the above problem, in current Kernel Mode Linux implementation, +a task struct(and a kernel stack) of a user process is mapped at bottom of +an address space of the user process in a fixed address. +Therefore, the kernel can get the location of the kernel stack +with one mov instruction from the fixed address. diff -urN linux.orig/MAINTAINERS linux/MAINTAINERS --- linux.orig/MAINTAINERS Wed Apr 24 14:07:35 2002 +++ linux/MAINTAINERS Wed Apr 24 14:08:36 2002 @@ -872,6 +872,12 @@ W: http://kbuild.sourceforge.net S: Maintained +KERNEL MODE LINUX +P: Toshiyuki Maeda +M: tosh@is.s.u-tokyo.ac.jp +W: http://www.yl.is.s.u-tokyo.ac.jp/~tosh/kml/ +S: Maintained + KERNEL NFSD P: Neil Brown M: neilb@cse.unsw.edu.au diff -urN linux.orig/Makefile linux/Makefile --- linux.orig/Makefile Wed Apr 24 14:07:22 2002 +++ linux/Makefile Wed Apr 24 14:08:36 2002 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 4 SUBLEVEL = 18 -EXTRAVERSION = +EXTRAVERSION = -experimental KERNELRELEASE=$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION) diff -urN linux.orig/arch/i386/config.in linux/arch/i386/config.in --- linux.orig/arch/i386/config.in Wed Apr 24 14:07:53 2002 +++ linux/arch/i386/config.in Wed Apr 24 14:08:36 2002 @@ -412,6 +412,17 @@ fi mainmenu_option next_comment +comment 'Kernel Mode Linux' + +bool 'Kernel Mode Linux' CONFIG_KERNEL_MODE_LINUX +if [ "$CONFIG_KERNEL_MODE_LINUX" != "n"]; then + comment ' Safety check have not been implemented' + define_bool CONFIG_KML_CHECK_SAFETY n +fi + +endmenu + +mainmenu_option next_comment comment 'Kernel hacking' bool 'Kernel debugging' CONFIG_DEBUG_KERNEL diff -urN linux.orig/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S --- linux.orig/arch/i386/kernel/entry.S Wed Apr 24 14:07:53 2002 +++ linux/arch/i386/kernel/entry.S Wed Apr 24 14:08:36 2002 @@ -58,6 +58,14 @@ ORIG_EAX = 0x24 EIP = 0x28 CS = 0x2C +#ifdef CONFIG_KERNEL_MODE_LINUX +/* + * CS_HW is used as stack switch indicator. + * If CS_HW is non-zero, stack switch occured. + * That is, we were in Kernel-User mode before interruption. + */ +CS_HW = 0x2E +#endif EFLAGS = 0x30 OLDESP = 0x34 OLDSS = 0x38 @@ -97,6 +105,82 @@ movl %edx,%ds; \ movl %edx,%es; +#ifdef CONFIG_KERNEL_MODE_LINUX +/* + * These variables(macro-constants) are copied from + * include/asm-i386/page.h, include/asm-i386/processor.h + */ +#define PAGE_OFFSET 0xc0000000 +#define SIZEOF_TASK_UNION 8192 +#define OFFSET_OF_ESP0 0x270 +#define __SW_KERNEL_CS (0xffff0000 | __KERNEL_CS) + +#define TASK_SIZE (PAGE_OFFSET - SIZEOF_TASK_UNION) +#define FIX_TASK_START TASK_SIZE + +/* + * This is a pointer to a location where a pointer to a kernel stack is stored. + * In Kernel Mode Linux for IA-32, a task_union of a process is mapped + * in the bottom of user address space of the process + * as non-pageable kernel memory. This is because we want to get + * a pointer to per process kernel stack using no registers. + */ +#define FIX_KERNEL_STACK_POINTER (FIX_TASK_START+OFFSET_OF_ESP0) + +/* + * This is a macro for stack switching. + */ +#define SWITCH_STACK_TO_KK \ + /* Check whether if we were in Kernel-User mode or not. */ \ + cmpl $TASK_SIZE, %esp; \ + /* For anceint processors, clear stack switch in XCS */ \ + /* because they doesn't clear High 16 bits of XCS. */ \ + movw $0x0, 6(%esp); \ + ja 1f; \ + /* \ + * We were in Kernel-User mode, \ + * therefore, XCS == __KERNEL_CS. \ + * Thus, we can safely overwrite XCS \ + */ \ + movl %eax, 4(%esp); /* save %eax to XCS */ \ + movl %esp, %eax; \ + addl $12, %eax; \ + movl (FIX_KERNEL_STACK_POINTER), %esp; \ + addl $-4, %esp; /* XSS */ \ + pushl %eax; /* ESP */ \ + pushl -4(%eax); /* EFLAGS */ \ + pushl $__SW_KERNEL_CS; /* XCS */ \ + pushl -12(%eax); /* EIP */ \ + movl -8(%eax), %eax; /* restore %eax from XCS */ \ + 1: + +/* + * This is as same as the SWITCH_STACK_TO_KK + * but handles an error code on a stack + */ +#define SWITCH_STACK_TO_KK_WITH_ERROR_CODE \ + cmpl $TASK_SIZE, %esp; \ + movw $0x0, 10(%esp); /* clear stack switch in XCS, sigh... */ \ + ja 1f; \ + /* \ + * We are in Kernel-User mode, \ + * therefore, XCS == __KERNEL_CS. \ + */ \ + movl %eax, 8(%esp); /* save %eax to XCS */ \ + movl %esp, %eax; \ + addl $16, %eax; \ + movl (FIX_KERNEL_STACK_POINTER), %esp; \ + addl $-4, %esp; /* XSS */ \ + pushl %eax; /* ESP */ \ + pushl -4(%eax); /* EFLAGS */ \ + pushl $__SW_KERNEL_CS; /* XCS */ \ + pushl -12(%eax); /* EIP */ \ + pushl -16(%eax); /* error_code */ \ + movl -8(%eax), %eax; /* restore %eax from XCS */ \ + 1: +#endif + +#ifndef CONFIG_KERNEL_MODE_LINUX #define RESTORE_ALL \ popl %ebx; \ popl %ecx; \ @@ -127,6 +211,54 @@ .long 2b,5b; \ .long 3b,6b; \ .previous +#else +#define RESTORE_ALL \ + popl %ebx; \ + popl %ecx; \ + popl %edx; \ + popl %esi; \ + popl %edi; \ + popl %ebp; \ + popl %eax; \ +1: popl %ds; \ +2: popl %es; \ + addl $4,%esp; \ +/* Switch stack KK -> KU. */ \ + /* check whether if stack switch occured or not */ \ + cmpw $0x0, 6(%esp); \ + je 4f; \ + /* clear stack switch record in XCS */ \ + movw $0x0, 6(%esp); \ + pushl %eax; \ + movl 16(%esp), %eax; \ + addl $-16, %eax; \ +3: popl (%eax); \ + popl 4(%eax); \ + popl 8(%eax); \ + popl 12(%eax); \ + movl %eax, %esp; \ + popl %eax; \ +4: iret; \ +.section .fixup,"ax"; \ +5: movl $0,(%esp); \ + jmp 1b; \ +6: movl $0,(%esp); \ + jmp 2b; \ +7: pushl %ss; \ + popl %ds; \ + pushl %ss; \ + popl %es; \ + pushl $11; \ + call do_exit; \ +.previous; \ +.section __ex_table,"a";\ + .align 4; \ + .long 1b,5b; \ + .long 2b,6b; \ + .long 3b,7b; \ + .long 4b,7b; \ +.previous +#endif #define GET_CURRENT(reg) \ movl $-8192, reg; \ @@ -192,6 +324,10 @@ */ ENTRY(system_call) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl %eax # save orig_eax SAVE_ALL GET_CURRENT(%ebx) @@ -252,6 +388,10 @@ movb CS(%esp),%al testl $(VM_MASK | 3),%eax # return to VM86 mode or non-supervisor? jne ret_from_sys_call +#ifdef CONFIG_KERNEL_MODE_LINUX + cmpw $0x0, CS_HW(%esp) # return to Kernel-User mode? + jne ret_from_sys_call +#endif jmp restore_all ALIGN @@ -260,6 +400,10 @@ jmp ret_from_sys_call ENTRY(divide_error) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 # no error code pushl $ SYMBOL_NAME(do_divide_error) ALIGN @@ -292,16 +436,28 @@ jmp ret_from_exception ENTRY(coprocessor_error) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_coprocessor_error) jmp error_code ENTRY(simd_coprocessor_error) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_simd_coprocessor_error) jmp error_code ENTRY(device_not_available) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $-1 # mark this as an int SAVE_ALL GET_CURRENT(%ebx) @@ -317,11 +473,19 @@ jmp ret_from_exception ENTRY(debug) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_debug) jmp error_code ENTRY(nmi) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl %eax SAVE_ALL movl %esp,%edx @@ -332,64 +496,262 @@ RESTORE_ALL ENTRY(int3) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_int3) jmp error_code ENTRY(overflow) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_overflow) jmp error_code ENTRY(bounds) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_bounds) jmp error_code ENTRY(invalid_op) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_invalid_op) jmp error_code ENTRY(coprocessor_segment_overrun) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_coprocessor_segment_overrun) jmp error_code ENTRY(double_fault) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK with error code. */ + SWITCH_STACK_TO_KK_WITH_ERROR_CODE +#endif + pushl $ SYMBOL_NAME(do_double_fault) + jmp error_code + +#ifdef CONFIG_KERNEL_MODE_LINUX +ENTRY(double_fault_no_stack_switch) pushl $ SYMBOL_NAME(do_double_fault) jmp error_code +#endif + +#ifdef CONFIG_KERNEL_MODE_LINUX + +PAGE_FAULT_ERROR_CODE = 0x2 +TSS_CR3 = 28 +TSS_EIP = 32 +TSS_EFLAGS = 36 +TSS_CS = 76 +TSS_ESP = 56 +TSS_SS = 80 + +/* + * This is a task-handler for double fault. + * In Kernel Mode Linux, user programs may be executed in ring 0 (kernel mode). + * Therefore, normal interruption handling mechanism doesn't work. + * For example, if a page fault occurs in a stack, + * CPU cannot generate a page fault exception because there is no stack + * to save the CPU context. We call this problem "stack starvation". + * To solve the stack starvation, we handle double fault with task-handler. + */ +ENTRY(double_fault_task) + str %ax # get current Task register. + movzwl %ax, %eax + + movl $gdt_table, %edx + leal (%edx, %eax, 1), %edi # get current Task Gate. + movl 4(%edi), %eax + movzwl 2(%edi), %edi + movl %eax, %esi + andl $0xff000000, %esi + orl %esi, %edi + andl $0x000000ff, %eax + sall $16, %eax + orl %eax, %edi # get current TSS. +/* %edi = current_tss */ + + movzwl (%edi), %eax + leal (%edx, %eax, 1), %ebx # get previous Task Gate. + movl 4(%ebx), %eax + movzwl 2(%ebx), %ebx + movl %eax, %esi + andl $0xff000000, %esi + orl %esi, %ebx + andl $0x000000ff, %eax + sall $16, %eax + orl %eax, %ebx # get previous TSS. +/* %ebx = prev_tss */ + + # get kernel stack. + cmpw $__KERNEL_CS, TSS_CS(%ebx) + jne 1f + movl TSS_ESP(%ebx), %esi + cmpl $TASK_SIZE, %esi + ja 2f +1: + movl (FIX_KERNEL_STACK_POINTER), %esi +2: + movl %esi, %esp +/* From now on, we can use stack. */ + + # recreate stack layout as if normal interruption occurs. + cmpw $__KERNEL_CS, TSS_CS(%ebx) + jne 3f + movl TSS_ESP(%ebx), %esi + cmpl $TASK_SIZE, %esi + ja 4f +3: + pushl TSS_SS(%ebx) + pushl TSS_ESP(%ebx) + + movl TSS_ESP(%ebx), %esi +4: + pushl TSS_EFLAGS(%ebx) + pushl TSS_CS(%ebx) + pushl TSS_EIP(%ebx) + + movw $0x0, 6(%esp) + cmpw $__KERNEL_CS, TSS_CS(%ebx) + jne 5f + cmpl $TASK_SIZE, %esi + ja 5f + /* record stack switch in XCS */ + movw $0xffff, 6(%esp) +5: + + # check whether if stack starvation occured or not. +/* %esi = prev_tss->esp */ + # calling address_exists + addl $-4, %esi /* %esi = prev_tss->esp - 4 */ + addl $-12, %esp + pushl %esi + call address_exists + addl $16, %esp + + testl %eax, %eax + jne 7f +6: + pushl $PAGE_FAULT_ERROR_CODE + movl $page_fault_no_stack_switch, TSS_EIP(%ebx) + andb $253, 37(%ebx) /* == andl $~IF_MASK, TSS_EFLAGS(%ebx) */ + movl %esi, %eax + movl %eax, %cr2 + jmp 9f +7: + addl $-12, %esi /* %esi = prev_tss->esp - 16 */ + addl $-12, %esp + pushl %esi + call address_exists + addl $16, %esp + + testl %eax, %eax + jne 8f + jmp 6b +8: + pushl $0 + movl $double_fault_no_stack_switch, TSS_EIP(%ebx) +9: + andb $254, 37(%ebx) /* == andl $~TF_MASK, TSS_EFLAGS(%ebx) */ + movw $__KERNEL_CS, TSS_CS(%ebx) + movl %esp, TSS_ESP(%ebx) + movw $__KERNEL_DS, TSS_SS(%ebx) + + movl TSS_CR3(%edi), %eax + movl %eax, TSS_CR3(%ebx) + + movl TSS_ESP(%edi), %esp + + iret + jmp double_fault_task +#endif ENTRY(invalid_TSS) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK with error code. */ + SWITCH_STACK_TO_KK_WITH_ERROR_CODE +#endif pushl $ SYMBOL_NAME(do_invalid_TSS) jmp error_code ENTRY(segment_not_present) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK with error code. */ + SWITCH_STACK_TO_KK_WITH_ERROR_CODE +#endif pushl $ SYMBOL_NAME(do_segment_not_present) jmp error_code ENTRY(stack_segment) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK with error code. */ + SWITCH_STACK_TO_KK_WITH_ERROR_CODE +#endif pushl $ SYMBOL_NAME(do_stack_segment) jmp error_code ENTRY(general_protection) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK with error code. */ + SWITCH_STACK_TO_KK_WITH_ERROR_CODE +#endif pushl $ SYMBOL_NAME(do_general_protection) jmp error_code ENTRY(alignment_check) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK with error code. */ + SWITCH_STACK_TO_KK_WITH_ERROR_CODE +#endif pushl $ SYMBOL_NAME(do_alignment_check) jmp error_code ENTRY(page_fault) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK with error code. */ + SWITCH_STACK_TO_KK_WITH_ERROR_CODE +#endif + pushl $ SYMBOL_NAME(do_page_fault) + jmp error_code + +#ifdef CONFIG_KERNEL_MODE_LINUX +ENTRY(page_fault_no_stack_switch) pushl $ SYMBOL_NAME(do_page_fault) jmp error_code +#endif ENTRY(machine_check) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_machine_check) jmp error_code ENTRY(spurious_interrupt_bug) +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Switch stack KU -> KK. */ + SWITCH_STACK_TO_KK +#endif pushl $0 pushl $ SYMBOL_NAME(do_spurious_interrupt_bug) jmp error_code diff -urN linux.orig/arch/i386/kernel/head.S linux/arch/i386/kernel/head.S --- linux.orig/arch/i386/kernel/head.S Wed Apr 24 14:07:53 2002 +++ linux/arch/i386/kernel/head.S Wed Apr 24 14:08:36 2002 @@ -242,7 +242,11 @@ call check_x87 incb ready lgdt gdt_descr +#ifndef CONFIG_KERNEL_MODE_LINUX lidt idt_descr +#else + lidt idt_descrs +#endif ljmp $(__KERNEL_CS),$1f 1: movl $(__KERNEL_DS),%eax # reload all the segment registers movl %eax,%ds # after changing gdt. @@ -303,6 +307,7 @@ * are enabled elsewhere, when we can be relatively * sure everything is ok. */ +#ifndef CONFIG_KERNEL_MODE_LINUX setup_idt: lea ignore_int,%edx movl $(__KERNEL_CS << 16),%eax @@ -318,6 +323,35 @@ dec %ecx jne rp_sidt ret +#else +#define IDT_ENTRIES 256 +setup_idt: + lea SYMBOL_NAME(idt_tables), %edx + lea SYMBOL_NAME(idt_descrs), %edi + mov $(NR_CPUS), %ecx +rp_sidtdescr: + movw $(IDT_ENTRIES * 8 - 1), (%edi) + movl %edx, 2(%edi) + addl $8, %edi + addl $(IDT_ENTRIES * 8), %edx + dec %ecx + jne rp_sidtdescr + + lea ignore_int, %edx + movl $(__KERNEL_CS << 16), %eax + movw %dx, %ax + movw $0x8E00, %dx + + lea SYMBOL_NAME(idt_tables), %edi + movl $(IDT_ENTRIES * NR_CPUS), %ecx +rp_sidt: + movl %eax, (%edi) + movl %edx, 4(%edi) + addl $8, %edi + dec %ecx + jne rp_sidt + ret +#endif ENTRY(stack_start) .long SYMBOL_NAME(init_task_union)+8192 @@ -352,19 +386,25 @@ * the global descriptor table is dependent on the number * of tasks we can have.. */ +#ifndef CONFIG_KERNEL_MODE_LINUX #define IDT_ENTRIES 256 +#endif #define GDT_ENTRIES (__TSS(NR_CPUS)) +#ifndef CONFIG_KERNEL_MODE_LINUX .globl SYMBOL_NAME(idt) +#endif .globl SYMBOL_NAME(gdt) ALIGN +#ifndef CONFIG_KERNEL_MODE_LINUX .word 0 idt_descr: .word IDT_ENTRIES*8-1 # idt contains 256 entries SYMBOL_NAME(idt): .long SYMBOL_NAME(idt_table) +#endif .word 0 gdt_descr: diff -urN linux.orig/arch/i386/kernel/init_task.c linux/arch/i386/kernel/init_task.c --- linux.orig/arch/i386/kernel/init_task.c Wed Apr 24 14:07:53 2002 +++ linux/arch/i386/kernel/init_task.c Wed Apr 24 14:08:36 2002 @@ -31,3 +31,11 @@ */ struct tss_struct init_tss[NR_CPUS] __cacheline_aligned = { [0 ... NR_CPUS-1] = INIT_TSS }; +#ifdef CONFIG_KERNEL_MODE_LINUX +/* + * We need per cpu TSS of double fault task-handler + * because task-handler cannot be executed cocurrently. + */ +struct tss_struct dfts[NR_CPUS] __cacheline_aligned = { [0 ... NR_CPUS-1] = INIT_DFT }; +unsigned long null_for_dft; +#endif diff -urN linux.orig/arch/i386/kernel/setup.c linux/arch/i386/kernel/setup.c --- linux.orig/arch/i386/kernel/setup.c Wed Apr 24 14:07:53 2002 +++ linux/arch/i386/kernel/setup.c Wed Apr 24 14:08:36 2002 @@ -2879,6 +2879,9 @@ { int nr = smp_processor_id(); struct tss_struct * t = &init_tss[nr]; +#ifdef CONFIG_KERNEL_MODE_LINUX + struct tss_struct* dft = &dfts[nr]; +#endif if (test_and_set_bit(nr, &cpu_initialized)) { printk(KERN_WARNING "CPU#%d already initialized!\n", nr); @@ -2898,7 +2901,11 @@ #endif __asm__ __volatile__("lgdt %0": "=m" (gdt_descr)); +#ifndef CONFIG_KERNEL_MODE_LINUX __asm__ __volatile__("lidt %0": "=m" (idt_descr)); +#else + __asm__ __volatile__("lidt %0": "=m" (idt_descrs[nr])); +#endif /* * Delete NT @@ -2919,6 +2926,14 @@ gdt_table[__TSS(nr)].b &= 0xfffffdff; load_TR(nr); load_LDT(&init_mm); + +#ifdef CONFIG_KERNEL_MODE_LINUX + __asm__("pushl $0x00004002; popl %0\n\t" : "=m" (dft->eflags)); + set_dft_desc(nr, dft); + + t->ldt = __LDT(nr) << 3; + dft->ldt = __LDT(nr) << 3; +#endif /* * Clear all 6 debug registers: diff -urN linux.orig/arch/i386/kernel/signal.c linux/arch/i386/kernel/signal.c --- linux.orig/arch/i386/kernel/signal.c Wed Apr 24 14:07:53 2002 +++ linux/arch/i386/kernel/signal.c Wed Apr 24 14:08:36 2002 @@ -197,10 +197,22 @@ err |= __get_user(tmp, &sc->seg); \ regs->x##seg = tmp; } +#ifndef CONFIG_KERNEL_MODE_LINUX #define COPY_SEG_STRICT(seg) \ { unsigned short tmp; \ err |= __get_user(tmp, &sc->seg); \ regs->x##seg = tmp|3; } +#else +#define COPY_SEG_STRICT(seg) \ + { unsigned short tmp; \ + err |= __get_user(tmp, &sc->seg); \ + regs->x##seg = tmp|(regs->x##seg & 3); } + +#define COPY_CS_STRICT \ + { unsigned long tmp; \ + err |= __get_user(tmp, &sc->xcs); \ + regs->xcs = tmp|(regs->xcs & 3); } +#endif #define GET_SEG(seg) \ { unsigned short tmp; \ @@ -219,7 +231,11 @@ COPY(edx); COPY(ecx); COPY(eip); +#ifndef CONFIG_KERNEL_MODE_LINUX COPY_SEG_STRICT(cs); +#else + COPY_CS_STRICT; +#endif COPY_SEG_STRICT(ss); { @@ -340,7 +356,11 @@ err |= __put_user(current->thread.trap_no, &sc->trapno); err |= __put_user(current->thread.error_code, &sc->err); err |= __put_user(regs->eip, &sc->eip); +#ifndef CONFIG_KERNEL_MODE_LINUX err |= __put_user(regs->xcs, (unsigned int *)&sc->cs); +#else + err |= __put_user(regs->xcs, &sc->xcs); +#endif err |= __put_user(regs->eflags, &sc->eflags); err |= __put_user(regs->esp, &sc->esp_at_signal); err |= __put_user(regs->xss, (unsigned int *)&sc->ss); @@ -376,11 +396,20 @@ } /* This is the legacy signal stack switching. */ +#ifndef CONFIG_KERNEL_MODE_LINUX else if ((regs->xss & 0xffff) != __USER_DS && !(ka->sa.sa_flags & SA_RESTORER) && ka->sa.sa_restorer) { esp = (unsigned long) ka->sa.sa_restorer; } +#else + else if ((regs->xss & 0xffff) != __USER_DS && + (regs->esp > TASK_SIZE) && + !(ka->sa.sa_flags & SA_RESTORER) && + ka->sa.sa_restorer) { + esp = (unsigned long) ka->sa.sa_restorer; + } +#endif return (void *)((esp - frame_size) & -8ul); } @@ -435,11 +464,13 @@ regs->esp = (unsigned long) frame; regs->eip = (unsigned long) ka->sa.sa_handler; +#ifndef CONFIG_KERNEL_MODE_LINUX set_fs(USER_DS); regs->xds = __USER_DS; regs->xes = __USER_DS; regs->xss = __USER_DS; regs->xcs = __USER_CS; +#endif regs->eflags &= ~TF_MASK; #if DEBUG_SIG @@ -510,11 +541,13 @@ regs->esp = (unsigned long) frame; regs->eip = (unsigned long) ka->sa.sa_handler; +#ifndef CONFIG_KERNEL_MODE_LINUX set_fs(USER_DS); regs->xds = __USER_DS; regs->xes = __USER_DS; regs->xss = __USER_DS; regs->xcs = __USER_CS; +#endif regs->eflags &= ~TF_MASK; #if DEBUG_SIG @@ -592,8 +625,13 @@ * kernel mode. Just return without doing anything * if so. */ +#ifndef CONFIG_KERNEL_MODE_LINUX if ((regs->xcs & 3) != 3) return 1; +#else + if ((regs->xcs & 3) != 3 && (regs->xcs & 0xffff0000) == 0) + return 1; +#endif if (!oldset) oldset = ¤t->blocked; diff -urN linux.orig/arch/i386/kernel/traps.c linux/arch/i386/kernel/traps.c --- linux.orig/arch/i386/kernel/traps.c Wed Apr 24 14:07:53 2002 +++ linux/arch/i386/kernel/traps.c Wed Apr 24 14:08:36 2002 @@ -62,7 +62,18 @@ * F0 0F bug workaround.. We have a special link segment * for this. */ +#ifndef CONFIG_KERNEL_MODE_LINUX struct desc_struct idt_table[256] __attribute__((__section__(".data.idt"))) = { {0, 0}, }; +#else +/* + * We need per CPU idt because we handle double fault as a task + * and the task cannot be executed concurrently. + */ +typedef struct Xgt_desc_struct idt_table_type[256]; +idt_table_type idt_tables[NR_CPUS] __attribute__((__section__(".data.idt"))); + +struct Xgt_desc_struct idt_descrs[NR_CPUS]; +#endif asmlinkage void divide_error(void); asmlinkage void debug(void); @@ -195,7 +206,11 @@ esp = (unsigned long) (®s->esp); ss = __KERNEL_DS; +#ifndef CONFIG_KERNEL_MODE_LINUX if (regs->xcs & 3) { +#else + if ((regs->xcs & 3) || (regs->xcs & 0xffff0000)) { +#endif in_kernel = 0; esp = regs->esp; ss = regs->xss & 0xffff; @@ -253,8 +268,13 @@ static inline void die_if_kernel(const char * str, struct pt_regs * regs, long err) { +#ifndef CONFIG_KERNEL_MODE_LINUX if (!(regs->eflags & VM_MASK) && !(3 & regs->xcs)) die(str, regs, err); +#else + if (!(regs->eflags & VM_MASK) && !(3 & regs->xcs) && !(0xffff0000 & regs->xcs)) + die(str, regs, err); +#endif } static inline unsigned long get_cr2(void) @@ -271,8 +291,13 @@ { if (vm86 && regs->eflags & VM_MASK) goto vm86_trap; +#ifndef CONFIG_KERNEL_MODE_LINUX if (!(regs->xcs & 3)) goto kernel_trap; +#else + if (!(regs->xcs & 3) && !(regs->xcs & 0xffff0000)) + goto kernel_trap; +#endif trap_signal: { struct task_struct *tsk = current; @@ -353,8 +378,13 @@ if (regs->eflags & VM_MASK) goto gp_in_vm86; +#ifndef CONFIG_KERNEL_MODE_LINUX if (!(regs->xcs & 3)) goto gp_in_kernel; +#else + if (!(regs->xcs & 3) && !(regs->xcs & 0xffff0000)) + goto gp_in_kernel; +#endif current->thread.error_code = error_code; current->thread.trap_no = 13; @@ -519,8 +549,14 @@ /* If this is a kernel mode trap, save the user PC on entry to * the kernel, that's what the debugger can make sense of. */ +#ifndef CONFIG_KERNEL_MODE_LINUX info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : (void *)regs->eip; +#else + info.si_addr = ((regs->xcs & 3) == 0 && (regs->xcs & 0xffff0000) == 0) ? + (void *)tsk->thread.eip : + (void *)regs->eip; +#endif force_sig_info(SIGTRAP, &info, tsk); /* Disable additional traps. They'll be re-enabled when @@ -720,6 +756,8 @@ #endif /* CONFIG_MATH_EMULATION */ #ifndef CONFIG_M686 + +#ifndef CONFIG_KERNEL_MODE_LINUX void __init trap_init_f00f_bug(void) { unsigned long page; @@ -750,6 +788,47 @@ idt = (struct desc_struct *)page; __asm__ __volatile__("lidt %0": "=m" (idt_descr)); } +#else +void __init trap_init_f00f_bug(void) +{ + unsigned long page; + unsigned long offset; + int i; + + /* + * Allocate a new page in virtual address space, + * move the IDT into it and write protect this page. + */ + + page = (unsigned long)vmalloc(sizeof(idt_tables)); + + for (offset = 0; offset < sizeof(idt_tables); offset += PAGE_SIZE) { + pgd_t* pgd; + pmd_t* pmd; + pte_t* pte; + + pgd = pgd_offset(&init_mm, (page + offset)); + pmd = pmd_offset(pgd, page); + pte = pte_offset(pmd, page); + __free_page(pte_page(*pte)); + *pte = mk_pte_phys(__pa(&idt_tables) + offset, PAGE_KERNEL_RO); + } + /* + * Not that any PGE-capable kernel should have the f00f bug ... + */ + __flush_tlb_all(); + + for (i = 0; i < NR_CPUS; i++) { + idt_descrs[i].address = (unsigned long)&((idt_table_type*)page)[i]; + } + + /* XXX : Is idt initialization required here ? */ + /* + __asm__ __volatile__("lidt %0": "=m" (idt_descr)); + */ +} +#endif + #endif #define _set_gate(gate_addr,type,dpl,addr) \ @@ -774,18 +853,67 @@ */ void set_intr_gate(unsigned int n, void *addr) { +#ifndef CONFIG_KERNEL_MODE_LINUX _set_gate(idt_table+n,14,0,addr); +#else + int i; + + for (i = 0; i < NR_CPUS; i++) { + _set_gate(idt_tables[i] + n, 14, 0, addr); + } +#endif } static void __init set_trap_gate(unsigned int n, void *addr) { +#ifndef CONFIG_KERNEL_MODE_LINUX _set_gate(idt_table+n,15,0,addr); +#else + int i; + + for (i = 0; i < NR_CPUS; i++) { + _set_gate(idt_tables[i] + n, 15, 0, addr); + } +#endif } static void __init set_system_gate(unsigned int n, void *addr) { +#ifndef CONFIG_KERNEL_MODE_LINUX _set_gate(idt_table+n,15,3,addr); +#else + int i; + + for (i = 0; i < NR_CPUS; i++) { + _set_gate(idt_tables[i] + n, 15, 3, addr); + } +#endif +} + +#ifdef CONFIG_KERNEL_MODE_LINUX + +#define _set_task_gate(gate_addr,dpl,tss_sel) \ +do { \ + int __d0, __d1; \ + __asm__ __volatile__ ( \ + "movw %4,%%dx\n\t" \ + "movl %%eax,%0\n\t" \ + "movl %%edx,%1" \ + :"=m" (*((long *) (gate_addr))), \ + "=m" (*(1+(long *) (gate_addr))), "=&a" (__d0), "=&d" (__d1) \ + :"i" ((short) (0x8000+(dpl<<13)+(5<<8))), \ + "3" (0),"2" (tss_sel << 16)); \ +} while (0) + +static void __init set_double_fault_task(void) +{ + int i; + + for (i = 0; i < NR_CPUS; i++) { + _set_task_gate(idt_tables[i] + 8, 0, (__DFT(i) << 3)); + } } +#endif static void __init set_call_gate(void *a, void *addr) { @@ -823,6 +951,13 @@ _set_tssldt_desc(gdt_table+__LDT(n), (int)addr, ((size << 3)-1), 0x82); } +#ifdef CONFIG_KERNEL_MODE_LINUX +void set_dft_desc(unsigned int n, void* addr) +{ + _set_tssldt_desc(gdt_table+__DFT(n), (int)addr, 235, 0x89); +} +#endif + #ifdef CONFIG_X86_VISWS_APIC /* @@ -928,7 +1063,11 @@ set_system_gate(5,&bounds); set_trap_gate(6,&invalid_op); set_trap_gate(7,&device_not_available); +#ifndef CONFIG_KERNEL_MODE_LINUX set_trap_gate(8,&double_fault); +#else + set_double_fault_task(); +#endif set_trap_gate(9,&coprocessor_segment_overrun); set_trap_gate(10,&invalid_TSS); set_trap_gate(11,&segment_not_present); diff -urN linux.orig/arch/i386/mm/fault.c linux/arch/i386/mm/fault.c --- linux.orig/arch/i386/mm/fault.c Wed Apr 24 14:07:53 2002 +++ linux/arch/i386/mm/fault.c Wed Apr 24 14:08:36 2002 @@ -132,7 +132,11 @@ } asmlinkage void do_invalid_op(struct pt_regs *, unsigned long); +#ifndef CONFIG_KERNEL_MODE_LINUX extern unsigned long idt; +#else +#include +#endif /* * This routine handles page faults. It determines the address, @@ -177,7 +181,11 @@ * (error_code & 4) == 0, and that the fault was not a * protection error (error_code & 1) == 0. */ +#ifndef CONFIG_KERNEL_MODE_LINUX if (address >= TASK_SIZE && !(error_code & 5)) +#else + if (address >= TASK_SIZE && !(error_code & 5) && (regs->xcs & 0xffff0000) == 0) +#endif goto vmalloc_fault; mm = tsk->mm; @@ -199,7 +207,11 @@ goto good_area; if (!(vma->vm_flags & VM_GROWSDOWN)) goto bad_area; +#ifndef CONFIG_KERNEL_MODE_LINUX if (error_code & 4) { +#else + if (error_code & 4 || (regs->xcs & 0xffff0000) != 0) { +#endif /* * accessing the stack below %esp is always a bug. * The "+ 32" is there due to some instructions (like @@ -275,7 +287,11 @@ up_read(&mm->mmap_sem); /* User mode accesses just cause a SIGSEGV */ +#ifndef CONFIG_KERNEL_MODE_LINUX if (error_code & 4) { +#else + if (error_code & 4 || (regs->xcs & 0xffff0000) != 0) { +#endif tsk->thread.cr2 = address; tsk->thread.error_code = error_code; tsk->thread.trap_no = 14; @@ -293,7 +309,11 @@ if (boot_cpu_data.f00f_bug) { unsigned long nr; +#ifndef CONFIG_KERNEL_MODE_LINUX nr = (address - idt) >> 3; +#else + nr = (address - (unsigned long)idt_tables[smp_processor_id()]) >> 3; +#endif if (nr == 6) { do_invalid_op(regs, 0); @@ -348,7 +368,11 @@ goto survive; } printk("VM: killing process %s\n", tsk->comm); +#ifndef CONFIG_KERNEL_MODE_LINUX if (error_code & 4) +#else + if (error_code & 4 || (regs->xcs & 0xffff0000) != 0) +#endif do_exit(SIGKILL); goto no_context; @@ -369,7 +393,7 @@ force_sig_info(SIGBUS, &info, tsk); /* Kernel mode? Handle exceptions or die */ - if (!(error_code & 4)) + if (!(error_code & 4) && (regs->xcs & 0xffff0000) == 0) goto no_context; return; @@ -406,4 +430,18 @@ goto no_context; return; } +} + +asmlinkage int address_exists(unsigned long address) +{ + struct mm_struct* mm; + pgd_t* pgd; + + mm = current->mm; + if (mm == NULL) + return 0; + + pgd = pgd_offset(mm, address); + + return address_exists_in_pgd(pgd, address); } diff -urN linux.orig/fs/binfmt_elf.c linux/fs/binfmt_elf.c --- linux.orig/fs/binfmt_elf.c Wed Apr 24 14:07:23 2002 +++ linux/fs/binfmt_elf.c Wed May 29 02:47:28 2002 @@ -423,6 +423,42 @@ #define INTERPRETER_AOUT 1 #define INTERPRETER_ELF 2 +#ifdef CONFIG_KERNEL_MODE_LINUX +/* + * XXX : we haven't implemented safety check of user programs. + */ +#define TRUSTED_DIR_STR "/trusted/" +#define TRUSTED_DIR_STR_LEN 9 + +static inline int is_safe(struct file* file) +{ + int ret; + char* path; + char* tmp; + struct fs_struct* cur_fs; + + tmp = (char*)__get_free_page(GFP_KERNEL); + + if (!tmp) { + return 0; + } + + path = d_path(file->f_dentry, file->f_vfsmnt, tmp, PAGE_SIZE); + ret = (0 == strncmp(TRUSTED_DIR_STR, path, TRUSTED_DIR_STR_LEN)); + if (ret) { + /* Check whether if we are "chroot"ed */ + /* XXX : I don't know how to check whether if chroot occured. Is this code correct? */ + cur_fs = current->fs; + read_lock(&cur_fs->lock); + spin_lock(&dcache_lock); + ret = IS_ROOT(cur_fs->root); + spin_unlock(&dcache_unlock); + read_unlock(&cur_fs->lock); + } + free_page((unsigned long)tmp); + return ret; +} +#endif static int load_elf_binary(struct linux_binprm * bprm, struct pt_regs * regs) { @@ -773,7 +809,15 @@ ELF_PLAT_INIT(regs); #endif +#if !defined(CONFIG_KERNEL_MODE_LINUX) || defined(CONFIG_KML_CHECK_SAFETY) start_thread(regs, elf_entry, bprm->p); +#else + if (is_safe(bprm->file)) { + start_kernel_thread(regs, elf_entry, bprm->p); + } else { + start_thread(regs, elf_entry, bprm->p); + } +#endif if (current->ptrace & PT_PTRACED) send_sig(SIGTRAP, current, 0); retval = 0; diff -urN linux.orig/fs/exec.c linux/fs/exec.c --- linux.orig/fs/exec.c Wed Apr 24 14:07:22 2002 +++ linux/fs/exec.c Wed Apr 24 14:08:36 2002 @@ -397,7 +397,11 @@ old_mm = current->mm; if (old_mm && atomic_read(&old_mm->mm_users) == 1) { mm_release(); +#ifndef CONFIG_KERNEL_MODE_LINUX exit_mmap(old_mm); +#else + exit_user_mmap(old_mm); +#endif return 0; } diff -urN linux.orig/include/asm-i386/desc.h linux/include/asm-i386/desc.h --- linux.orig/include/asm-i386/desc.h Wed Apr 24 14:07:25 2002 +++ linux/include/asm-i386/desc.h Wed Apr 24 14:08:36 2002 @@ -24,11 +24,19 @@ * * 12 - CPU#0 TSS <-- new cacheline * 13 - CPU#0 LDT +#ifndef CONFIG_KERNEL_MODE_LINUX * 14 - not used +#else + * 14 - CPU#0 Double Fault Task +#endif * 15 - not used * 16 - CPU#1 TSS <-- new cacheline * 17 - CPU#1 LDT +#ifndef CONFIG_KERNEL_MODE_LINUX * 18 - not used +#else + * 18 - CPU#1 Double Fault Task +#endif * 19 - not used * ... NR_CPUS per-CPU TSS+LDT's if on SMP * @@ -36,9 +44,15 @@ */ #define __FIRST_TSS_ENTRY 12 #define __FIRST_LDT_ENTRY (__FIRST_TSS_ENTRY+1) +#ifdef CONFIG_KERNEL_MODE_LINUX +#define __FIRST_DFT_ENTRY (__FIRST_TSS_ENTRY+2) +#endif #define __TSS(n) (((n)<<2) + __FIRST_TSS_ENTRY) #define __LDT(n) (((n)<<2) + __FIRST_LDT_ENTRY) +#ifdef CONFIG_KERNEL_MODE_LINUX +#define __DFT(n) (((n)<<2) + __FIRST_DFT_ENTRY) +#endif #ifndef __ASSEMBLY__ struct desc_struct { @@ -46,14 +60,27 @@ }; extern struct desc_struct gdt_table[]; +#ifndef CONFIG_KERNEL_MODE_LINUX extern struct desc_struct *idt, *gdt; +#else +extern struct desc_struct* gdt; +#endif struct Xgt_desc_struct { unsigned short size; unsigned long address __attribute__((packed)); +#ifdef CONFIG_KERNEL_MODE_LINUX + unsigned short __pad __attribute__((packed)); +#endif }; +#ifndef CONFIG_KERNEL_MODE_LINUX #define idt_descr (*(struct Xgt_desc_struct *)((char *)&idt - 2)) +#else +extern struct Xgt_desc_struct idt_descrs[NR_CPUS]; +extern struct Xgt_desc_struct idt_tables[NR_CPUS][256]; +#endif + #define gdt_descr (*(struct Xgt_desc_struct *)((char *)&gdt - 2)) #define load_TR(n) __asm__ __volatile__("ltr %%ax"::"a" (__TSS(n)<<3)) @@ -68,6 +95,9 @@ extern void set_intr_gate(unsigned int irq, void * addr); extern void set_ldt_desc(unsigned int n, void *addr, unsigned int size); extern void set_tss_desc(unsigned int n, void *addr); +#ifdef CONFIG_KERNEL_MODE_LINUX +extern void set_dft_desc(unsigned int n, void* addr); +#endif static inline void clear_LDT(void) { diff -urN linux.orig/include/asm-i386/hw_irq.h linux/include/asm-i386/hw_irq.h --- linux.orig/include/asm-i386/hw_irq.h Wed Apr 24 14:07:25 2002 +++ linux/include/asm-i386/hw_irq.h Wed Apr 24 14:08:36 2002 @@ -95,6 +95,7 @@ #define __STR(x) #x #define STR(x) __STR(x) +#ifndef CONFIG_KERNEL_MODE_LINUX #define SAVE_ALL \ "cld\n\t" \ "pushl %es\n\t" \ @@ -109,9 +110,51 @@ "movl $" STR(__KERNEL_DS) ",%edx\n\t" \ "movl %edx,%ds\n\t" \ "movl %edx,%es\n\t" +#else +#define SAVE_ALL \ + "cld\n\t" \ + "pushl %%es\n\t" \ + "pushl %%ds\n\t" \ + "pushl %%eax\n\t" \ + "pushl %%ebp\n\t" \ + "pushl %%edi\n\t" \ + "pushl %%esi\n\t" \ + "pushl %%edx\n\t" \ + "pushl %%ecx\n\t" \ + "pushl %%ebx\n\t" \ + "movl $" STR(__KERNEL_DS) ",%%edx\n\t" \ + "movl %%edx,%%ds\n\t" \ + "movl %%edx,%%es\n\t" +#endif + +#ifdef CONFIG_KERNEL_MODE_LINUX +/* Same as a macro in arch/i386/kernel/entry.S */ +#define SWITCH_STACK_TO_KK \ + "cmpl %0, %%esp\n\t" \ + "movw $0x0, 6(%%esp)\n\t" \ + "ja 1f\n\t" \ + "movl %%eax, 4(%%esp)\n\t" \ + "movl %%esp, %%eax\n\t" \ + "addl $12, %%eax\n\t" \ + "movl (%2), %%esp\n\t" \ + "addl $-4, %%esp\n\t" \ + "pushl %%eax\n\t" \ + "pushl -4(%%eax)\n\t" \ + "pushl %1\n\t" \ + "pushl -12(%%eax)\n\t" \ + "movl -8(%%eax), %%eax\n\t" \ + "1:\n\t" +#define SWITCH_STACK_TO_KK_CONSTRAINTS \ + : : "i" (TASK_SIZE), \ + "i" (__SW_KERNEL_CS), \ + "m" (*((unsigned long*)FIX_KERNEL_STACK_POINTER)) +#endif #define IRQ_NAME2(nr) nr##_interrupt(void) #define IRQ_NAME(nr) IRQ_NAME2(IRQ##nr) +#ifdef CONFIG_KERNEL_MODE_LINUX +#define DUMMY_IRQ_NAME(nr) IRQ_NAME(_dummy_##nr) +#endif #define GET_CURRENT \ "movl %esp, %ebx\n\t" \ @@ -123,6 +166,7 @@ /* there is a second layer of macro just to get the symbolic name for the vector evaluated. This change is for RTLinux */ +#ifndef CONFIG_KERNEL_MODE_LINUX #define BUILD_SMP_INTERRUPT(x,v) XBUILD_SMP_INTERRUPT(x,v) #define XBUILD_SMP_INTERRUPT(x,v)\ asmlinkage void x(void); \ @@ -135,7 +179,28 @@ SYMBOL_NAME_STR(call_##x)":\n\t" \ "call "SYMBOL_NAME_STR(smp_##x)"\n\t" \ "jmp ret_from_intr\n"); +#else +#define BUILD_SMP_INTERRUPT(x,v) XBUILD_SMP_INTERRUPT(x,v) +#define XBUILD_SMP_INTERRUPT(x,v)\ +asmlinkage void x(void); \ +asmlinkage void call_##x(void); \ +static void dummy_##x(void) __attribute__ ((unused)); \ +static void dummy_##x(void) { \ +__asm__( \ +"\n"__ALIGN_STR"\n" \ +SYMBOL_NAME_STR(x) ":\n\t" \ +/* XXX : Switch stack KU -> KK. */ \ + SWITCH_STACK_TO_KK \ + "pushl $"#v"-256\n\t" \ + SAVE_ALL \ + SYMBOL_NAME_STR(call_##x)":\n\t" \ + "call "SYMBOL_NAME_STR(smp_##x)"\n\t" \ + "jmp ret_from_intr\n" \ + SWITCH_STACK_TO_KK_CONSTRAINTS); \ +} +#endif +#ifndef CONFIG_KERNEL_MODE_LINUX #define BUILD_SMP_TIMER_INTERRUPT(x,v) XBUILD_SMP_TIMER_INTERRUPT(x,v) #define XBUILD_SMP_TIMER_INTERRUPT(x,v) \ asmlinkage void x(struct pt_regs * regs); \ @@ -151,16 +216,42 @@ "call "SYMBOL_NAME_STR(smp_##x)"\n\t" \ "addl $4,%esp\n\t" \ "jmp ret_from_intr\n"); +#else +#define BUILD_SMP_TIMER_INTERRUPT(x,v) XBUILD_SMP_TIMER_INTERRUPT(x,v) +#define XBUILD_SMP_TIMER_INTERRUPT(x,v) \ +asmlinkage void x(struct pt_regs * regs); \ +asmlinkage void call_##x(void); \ +static void dummy_##x(void) __attribute__ ((unused)); \ +static void dummy_##x(void) { \ +__asm__( \ +"\n"__ALIGN_STR"\n" \ +SYMBOL_NAME_STR(x) ":\n\t" \ +/* XXX : Switch stack KU -> KK. */ \ + SWITCH_STACK_TO_KK \ + "pushl $"#v"-256\n\t" \ + SAVE_ALL \ + "movl %%esp,%%eax\n\t" \ + "pushl %%eax\n\t" \ + SYMBOL_NAME_STR(call_##x)":\n\t" \ + "call "SYMBOL_NAME_STR(smp_##x)"\n\t" \ + "addl $4,%%esp\n\t" \ + "jmp ret_from_intr\n" \ + SWITCH_STACK_TO_KK_CONSTRAINTS); \ +} +#endif #define BUILD_COMMON_IRQ() \ asmlinkage void call_do_IRQ(void); \ +static void dummy_call_do_IRQ(void) __attribute__ ((unused)); \ +static void dummy_call_do_IRQ(void) { \ __asm__( \ "\n" __ALIGN_STR"\n" \ "common_interrupt:\n\t" \ SAVE_ALL \ SYMBOL_NAME_STR(call_do_IRQ)":\n\t" \ "call " SYMBOL_NAME_STR(do_IRQ) "\n\t" \ - "jmp ret_from_intr\n"); + "jmp ret_from_intr\n" : :); \ +} /* * subtle. orig_eax is used by the signal code to distinct between @@ -171,7 +262,7 @@ * * Subtle as a pigs ear. VY */ - +#ifndef CONFIG_KERNEL_MODE_LINUX #define BUILD_IRQ(nr) \ asmlinkage void IRQ_NAME(nr); \ __asm__( \ @@ -179,6 +270,21 @@ SYMBOL_NAME_STR(IRQ) #nr "_interrupt:\n\t" \ "pushl $"#nr"-256\n\t" \ "jmp common_interrupt"); +#else +#define BUILD_IRQ(nr) \ +asmlinkage void IRQ_NAME(nr); \ +static void DUMMY_IRQ_NAME(nr) __attribute__ ((unused)); \ +static void DUMMY_IRQ_NAME(nr) { \ +__asm__( \ +"\n"__ALIGN_STR"\n" \ +SYMBOL_NAME_STR(IRQ) #nr "_interrupt:\n\t" \ +/* XXX : Switch stack KU -> KK. */ \ + SWITCH_STACK_TO_KK \ + "pushl $"#nr"-256\n\t" \ + "jmp common_interrupt" \ + SWITCH_STACK_TO_KK_CONSTRAINTS); \ +} +#endif extern unsigned long prof_cpu_mask; extern unsigned int * prof_buffer; diff -urN linux.orig/include/asm-i386/mmu_context.h linux/include/asm-i386/mmu_context.h --- linux.orig/include/asm-i386/mmu_context.h Wed Apr 24 14:07:25 2002 +++ linux/include/asm-i386/mmu_context.h Wed Apr 24 14:08:36 2002 @@ -10,7 +10,79 @@ * possibly do the LDT unload here? */ #define destroy_context(mm) do { } while(0) +#ifndef CONFIG_KERNEL_MODE_LINUX #define init_new_context(tsk,mm) 0 +#else + +static inline int map_kernel_page_one(struct mm_struct* mm, unsigned long src_addr, unsigned long dst_addr) +{ + pgd_t* pgd; + pmd_t* pmd; + pte_t* pte; + pte_t pteval; + + pgd = pgd_offset(mm, dst_addr); + pmd = pmd_alloc(mm, pgd, dst_addr); + if (pmd == NULL) { + return -ENOMEM; + } + pte = pte_alloc(mm, pmd, dst_addr); + if (pte == NULL) { + return -ENOMEM; + } + pteval = mk_pte(virt_to_page(src_addr), __pgprot(__PAGE_KERNEL)); + set_pte(pte, pteval); + + set_pmd(pmd, pmd_mksticky(*pmd)); + set_pgd(pgd, pgd_mksticky(*pgd)); + + return 0; +} + +static inline void clear_sticky_pte(struct mm_struct* mm) +{ + unsigned long addr; + + for (addr = FIX_TASK_START; addr < FIX_TASK_END; addr += PAGE_SIZE) { + pgd_t* pgd; + pmd_t* pmd; + pte_t* pte; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd) || pgd_bad(*pgd)) + continue; + pmd = pmd_offset(pgd, addr); + if (!pmd_present(*pmd) || pmd_bad(*pmd)) + continue; + pte = pte_offset(pmd, addr); + + pte_clear(pte); + } +} + +/* + * Map a task_union of a process to the bottom of the address space of + * the process. + */ +static inline int init_new_context(struct task_struct* tsk, struct mm_struct* mm) +{ + int ret = 0; + unsigned long addr; + + spin_lock(&mm->page_table_lock); + for (addr = FIX_TASK_START; addr < FIX_TASK_END; addr += PAGE_SIZE) { + if (map_kernel_page_one(mm, ((unsigned long)tsk) + addr - FIX_TASK_START, addr)) { + ret = -ENOMEM; + break; + } + } + spin_unlock(&mm->page_table_lock); + if (ret) { + clear_page_tables(mm, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); + } + return ret; +} +#endif #ifdef CONFIG_SMP @@ -41,6 +113,10 @@ #endif set_bit(cpu, &next->cpu_vm_mask); set_bit(cpu, &next->context.cpuvalid); + +#ifdef CONFIG_KERNEL_MODE_LINUX + dfts[smp_processor_id()].__cr3 = __pa(next->pgd); +#endif /* Re-load page tables */ asm volatile("movl %0,%%cr3": :"r" (__pa(next->pgd))); } diff -urN linux.orig/include/asm-i386/pgtable.h linux/include/asm-i386/pgtable.h --- linux.orig/include/asm-i386/pgtable.h Wed Apr 24 14:07:25 2002 +++ linux/include/asm-i386/pgtable.h Wed Apr 24 14:08:36 2002 @@ -129,7 +129,11 @@ #define PGDIR_SIZE (1UL << PGDIR_SHIFT) #define PGDIR_MASK (~(PGDIR_SIZE-1)) +#ifndef CONFIG_KERNEL_MODE_LINUX #define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) +#else +#define USER_PTRS_PER_PGD (((TASK_SIZE+PGDIR_SIZE-1)&PGDIR_MASK)/PGDIR_SIZE) +#endif #define FIRST_USER_PGD_NR 0 #define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT) @@ -174,6 +178,13 @@ #define _PAGE_BIT_DIRTY 6 #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page, Pentium+, if present.. */ #define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ +#ifdef CONFIG_KERNEL_MODE_LINUX +/* + * This bit in PTE indicates that a page is sticky page, + * that is, the page is not unmapped in normal fork & exec. + */ +#define _PAGE_BIT_STICKY 9 +#endif #define _PAGE_PRESENT 0x001 #define _PAGE_RW 0x002 @@ -184,6 +195,9 @@ #define _PAGE_DIRTY 0x040 #define _PAGE_PSE 0x080 /* 4 MB (or 2MB) page, Pentium+, if present.. */ #define _PAGE_GLOBAL 0x100 /* Global TLB entry PPro+ */ +#ifdef CONFIG_KERNEL_MODE_LINUX +#define _PAGE_STICKY 0x200 +#endif #define _PAGE_PROTNONE 0x080 /* If not present */ @@ -261,7 +275,11 @@ #define pmd_none(x) (!pmd_val(x)) #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT) #define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0) +#ifndef CONFIG_KERNEL_MODE_LINUX #define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE) +#else +#define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_STICKY)) != _KERNPG_TABLE) +#endif /* * Permanent address of a page. Obviously must never be @@ -279,6 +297,12 @@ static inline int pte_dirty(pte_t pte) { return (pte).pte_low & _PAGE_DIRTY; } static inline int pte_young(pte_t pte) { return (pte).pte_low & _PAGE_ACCESSED; } static inline int pte_write(pte_t pte) { return (pte).pte_low & _PAGE_RW; } +#ifdef CONFIG_KERNEL_MODE_LINUX +static inline int pte_user(pte_t pte) { return (pte).pte_low & _PAGE_USER; } +static inline int pte_kernel(pte_t pte) { return !pte_user(pte); } +static inline int pmd_sticky(pmd_t pmd) { return (pmd).pmd & _PAGE_STICKY; } +static inline int pgd_sticky(pgd_t pgd) { return (pgd).pgd & _PAGE_STICKY; } +#endif static inline pte_t pte_rdprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_USER; return pte; } static inline pte_t pte_exprotect(pte_t pte) { (pte).pte_low &= ~_PAGE_USER; return pte; } @@ -290,6 +314,10 @@ static inline pte_t pte_mkdirty(pte_t pte) { (pte).pte_low |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; } static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; } +#ifdef CONFIG_KERNEL_MODE_LINUX +static inline pmd_t pmd_mksticky(pmd_t pmd) { (pmd).pmd |= _PAGE_STICKY; return pmd; } +static inline pgd_t pgd_mksticky(pgd_t pgd) { (pgd).pgd |= _PAGE_STICKY; return pgd; } +#endif static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); } static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); } @@ -357,5 +385,29 @@ #define kern_addr_valid(addr) (1) #define io_remap_page_range remap_page_range + +#ifdef CONFIG_KERNEL_MODE_LINUX +#ifndef __ASSEMBLY__ +static inline int address_exists_in_pgd(pgd_t* pgd, unsigned long address) +{ + pmd_t* pmd; + pte_t* pte; + + if (pgd == NULL || !pgd_present(*pgd)) + return 0; + + pmd = pmd_offset(pgd, address); + if (!pmd_present(*pmd)) + return 0; + + if (pmd_val(*pmd) & (1 << 7)) + return 1; + + pte = pte_offset(pmd, address); + + return pte_present(*pte); +} +#endif /* !__ASSEMBLY__ */ +#endif #endif /* _I386_PGTABLE_H */ diff -urN linux.orig/include/asm-i386/processor.h linux/include/asm-i386/processor.h --- linux.orig/include/asm-i386/processor.h Wed Apr 24 14:07:25 2002 +++ linux/include/asm-i386/processor.h Wed Apr 24 14:08:36 2002 @@ -71,6 +71,9 @@ extern struct cpuinfo_x86 boot_cpu_data; extern struct tss_struct init_tss[NR_CPUS]; +#ifdef CONFIG_KERNEL_MODE_LINUX +extern struct tss_struct dfts[NR_CPUS]; +#endif #ifdef CONFIG_SMP extern struct cpuinfo_x86 cpu_data[]; @@ -265,12 +268,27 @@ /* * User space process size: 3GB (default). */ -#define TASK_SIZE (PAGE_OFFSET) +#ifndef CONFIG_KERNEL_MODE_LINUX +#define TASK_SIZE PAGE_OFFSET +#else +/* XXX : These constants are also defined in "arch/i386/kernel/entry.S" */ +#define TASK_SIZE (PAGE_OFFSET-sizeof(union task_union)) +#define FIX_TASK_START TASK_SIZE +#define FIX_TASK_END (FIX_TASK_START + sizeof(union task_union)) + +#define FIX_KERNEL_STACK_POINTER ((size_t)&(((struct task_struct*)FIX_TASK_START)->thread.esp0)) + +#define __SW_KERNEL_CS (0xffff0000 | __KERNEL_CS) +#endif /* This decides where the kernel will search for a free chunk of vm * space during mmap's. */ +#ifndef CONFIG_KERNEL_MODE_LINUX #define TASK_UNMAPPED_BASE (TASK_SIZE / 3) +#else +#define TASK_UNMAPPED_BASE (PAGE_OFFSET / 3) +#endif /* * Size of io_bitmap in longwords: 32 is ports 0-0x3ff. @@ -409,6 +427,36 @@ {~0, } /* ioperm */ \ } +#ifdef CONFIG_KERNEL_MODE_LINUX + +extern void double_fault_task(void); +extern unsigned long null_for_dft; + +#define INIT_DFT { \ + 0,0, /* back_link, __blh */ \ + 0, /* esp0 */ \ + __KERNEL_DS, 0, /* ss0 */ \ + 0,0,0,0,0,0, /* stack1, stack2 */ \ + 0, /* cr3 */ \ + (unsigned long)double_fault_task, /* eip */ \ + 0, /* eflags */ \ + 0,0,0,0, /* eax,ecx,edx,ebx */ \ + (unsigned long)(&null_for_dft + 1), /* esp */ \ + 0,0,0, /* ebp,esi,edi */ \ + __KERNEL_DS,0, /* es */ \ + __KERNEL_CS,0, /* cs */ \ + __KERNEL_DS,0, /* ss */ \ + __KERNEL_DS,0, /* ds */ \ + __KERNEL_DS,0, /* fs */ \ + __KERNEL_DS,0, /* gs */ \ + __LDT(0),0, /* ldt */ \ + 0, INVALID_IO_BITMAP_OFFSET, /* tace, bitmap */ \ + {~0, } /* ioperm */ \ +} + +#endif + +#ifndef CONFIG_KERNEL_MODE_LINUX #define start_thread(regs, new_eip, new_esp) do { \ __asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \ set_fs(USER_DS); \ @@ -419,6 +467,31 @@ regs->eip = new_eip; \ regs->esp = new_esp; \ } while (0) +#else +#define start_thread(regs, new_eip, new_esp) do { \ + __asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \ + set_fs(USER_DS); \ + regs->xds = __USER_DS; \ + regs->xes = __USER_DS; \ + regs->xss = __USER_DS; \ + regs->xcs = __USER_CS; \ + regs->eip = new_eip; \ + regs->esp = new_esp; \ + regs->xcs &= 0x0000ffff; \ +} while (0) + +#define start_kernel_thread(regs, new_eip, new_esp) do { \ + __asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \ + set_fs(KERNEL_DS); \ + regs->xds = __KERNEL_DS; \ + regs->xes = __KERNEL_DS; \ + regs->xss = __KERNEL_DS; \ + regs->xcs = __KERNEL_CS; \ + regs->eip = new_eip; \ + regs->esp = new_esp; \ + regs->xcs |= 0xffff0000; \ +} while (0) +#endif /* Forward declaration, a strange C thing */ struct task_struct; diff -urN linux.orig/include/asm-i386/ptrace.h linux/include/asm-i386/ptrace.h --- linux.orig/include/asm-i386/ptrace.h Wed Apr 24 14:07:25 2002 +++ linux/include/asm-i386/ptrace.h Wed Apr 24 14:08:36 2002 @@ -55,7 +55,11 @@ #define PTRACE_O_TRACESYSGOOD 0x00000001 #ifdef __KERNEL__ +#ifndef CONFIG_KERNEL_MODE_LINUX #define user_mode(regs) ((VM_MASK & (regs)->eflags) || (3 & (regs)->xcs)) +#else +#define user_mode(regs) ((VM_MASK & (regs)->eflags) || (3 & (regs)->xcs) || (0xffff0000 & (regs)->xcs)) +#endif #define instruction_pointer(regs) ((regs)->eip) extern void show_regs(struct pt_regs *); #endif diff -urN linux.orig/include/asm-i386/sigcontext.h linux/include/asm-i386/sigcontext.h --- linux.orig/include/asm-i386/sigcontext.h Wed Apr 24 14:07:25 2002 +++ linux/include/asm-i386/sigcontext.h Wed Apr 24 14:08:36 2002 @@ -70,7 +70,11 @@ unsigned long trapno; unsigned long err; unsigned long eip; +#ifndef CONFIG_KERNEL_MODE_LINUX unsigned short cs, __csh; +#else + unsigned long xcs; +#endif unsigned long eflags; unsigned long esp_at_signal; unsigned short ss, __ssh; diff -urN linux.orig/include/linux/mm.h linux/include/linux/mm.h --- linux.orig/include/linux/mm.h Wed Apr 24 14:07:24 2002 +++ linux/include/linux/mm.h Wed Apr 24 14:08:36 2002 @@ -397,6 +397,9 @@ extern void show_free_areas_node(pg_data_t *pgdat); extern void clear_page_tables(struct mm_struct *, unsigned long, int); +#ifdef CONFIG_KERNEL_MODE_LINUX +extern void clear_user_page_tables(struct mm_struct*, unsigned long, int); +#endif extern int fail_writepage(struct page *); struct page * shmem_nopage(struct vm_area_struct * vma, unsigned long address, int unused); @@ -469,6 +472,9 @@ extern void __insert_vm_struct(struct mm_struct *, struct vm_area_struct *); extern void build_mmap_rb(struct mm_struct *); extern void exit_mmap(struct mm_struct *); +#ifdef CONFIG_KERNEL_MODE_LINUX +extern void exit_user_mmap(struct mm_struct*); +#endif extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); diff -urN linux.orig/mm/memory.c linux/mm/memory.c --- linux.orig/mm/memory.c Wed Apr 24 14:07:24 2002 +++ linux/mm/memory.c Wed Apr 24 14:08:36 2002 @@ -83,7 +83,7 @@ free_page_and_swap_cache(page); } - +#if !defined(CONFIG_KERNEL_MODE_LINUX) || !defined(__i386__) /* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. @@ -103,7 +103,29 @@ pmd_clear(dir); pte_free(pte); } +#else /* CONFIG_KERNEL_MODE && __i386__ */ +static inline void free_one_pmd(pmd_t * dir, int stick) +{ + pte_t * pte; + + if (pmd_none(*dir)) + return; + if (pmd_bad(*dir)) { + pmd_ERROR(*dir); + pmd_clear(dir); + return; + } + if (stick && pmd_sticky(*dir)) + return; + + pte = pte_offset(dir, 0); + pmd_clear(dir); + pte_free(pte); +} + +#endif +#if !defined(CONFIG_KERNEL_MODE_LINUX) || !defined(__i386__) static inline void free_one_pgd(pgd_t * dir) { int j; @@ -124,6 +146,33 @@ } pmd_free(pmd); } +#else /* CONFIG_KERNEL_MODE && __i386__ */ +static inline void free_one_pgd(pgd_t * dir, int stick) +{ + int j; + pmd_t * pmd; + + if (pgd_none(*dir)) + return; + if (pgd_bad(*dir)) { + pgd_ERROR(*dir); + pgd_clear(dir); + return; + } + pmd = pmd_offset(dir, 0); + for (j = 0; j < PTRS_PER_PMD ; j++) { + prefetchw(pmd+j+(PREFETCH_STRIDE/16)); + free_one_pmd(pmd+j, stick); + } + + if (stick && pgd_sticky(*dir)) + return; + + pgd_clear(dir); + pmd_free(pmd); +} + +#endif /* Low and high watermarks for page table cache. The system should try to have pgt_water[0] <= cache elements <= pgt_water[1] @@ -136,7 +185,7 @@ return do_check_pgt_cache(pgt_cache_water[0], pgt_cache_water[1]); } - +#if !defined(CONFIG_KERNEL_MODE_LINUX) || !defined(__i386__) /* * This function clears all user-level page tables of a process - this * is needed by execve(), so that old pages aren't in the way. @@ -156,6 +205,43 @@ /* keep the page table cache within bounds */ check_pgt_cache(); } +#ifdef CONFIG_KERNEL_MODE_LINUX +#define clear_user_page_tables(mm, first, nr) clear_page_tables((mm),(first),(nr)) +#endif + +#else /* CONFIG_KERNEL_MODE && __i386 */ +#include + +static inline void clear_page_tables_common(struct mm_struct *mm, unsigned long first, int nr, int stick) +{ + pgd_t * page_dir = mm->pgd; + + spin_lock(&mm->page_table_lock); + + if (!stick) + clear_sticky_pte(mm); + page_dir += first; + do { + free_one_pgd(page_dir, stick); + page_dir++; + } while (--nr); + spin_unlock(&mm->page_table_lock); + + /* keep the page table cache within bounds */ + check_pgt_cache(); +} + +void clear_page_tables(struct mm_struct* mm, unsigned long first, int nr) +{ + clear_page_tables_common(mm, first, nr, 0); +} + +void clear_user_page_tables(struct mm_struct* mm, unsigned long first, int nr) +{ + clear_page_tables_common(mm, first, nr, 1); +} + +#endif #define PTE_TABLE_MASK ((PTRS_PER_PTE-1) * sizeof(pte_t)) #define PMD_TABLE_MASK ((PTRS_PER_PMD-1) * sizeof(pmd_t)) diff -urN linux.orig/mm/mmap.c linux/mm/mmap.c --- linux.orig/mm/mmap.c Wed Apr 24 14:07:24 2002 +++ linux/mm/mmap.c Wed Apr 24 14:08:36 2002 @@ -884,7 +884,11 @@ start_index = pgd_index(first); end_index = pgd_index(last); if (end_index > start_index) { +#ifndef CONFIG_KERNEL_MODE_LINUX clear_page_tables(mm, start_index, end_index - start_index); +#else + clear_user_page_tables(mm, start_index, end_index - start_index); +#endif flush_tlb_pgtables(mm, first & PGDIR_MASK, last & PGDIR_MASK); } } @@ -1099,6 +1103,7 @@ } } +#ifndef CONFIG_KERNEL_MODE_LINUX /* Release all mmaps. */ void exit_mmap(struct mm_struct * mm) { @@ -1141,6 +1146,60 @@ clear_page_tables(mm, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); } +#else +static inline void exit_mmap_common(struct mm_struct * mm) +{ + struct vm_area_struct * mpnt; + + release_segments(mm); + spin_lock(&mm->page_table_lock); + mpnt = mm->mmap; + mm->mmap = mm->mmap_cache = NULL; + mm->mm_rb = RB_ROOT; + mm->rss = 0; + spin_unlock(&mm->page_table_lock); + mm->total_vm = 0; + mm->locked_vm = 0; + + flush_cache_mm(mm); + while (mpnt) { + struct vm_area_struct * next = mpnt->vm_next; + unsigned long start = mpnt->vm_start; + unsigned long end = mpnt->vm_end; + unsigned long size = end - start; + + if (mpnt->vm_ops) { + if (mpnt->vm_ops->close) + mpnt->vm_ops->close(mpnt); + } + mm->map_count--; + remove_shared_vm_struct(mpnt); + zap_page_range(mm, start, size); + if (mpnt->vm_file) + fput(mpnt->vm_file); + kmem_cache_free(vm_area_cachep, mpnt); + mpnt = next; + } + flush_tlb_mm(mm); + + /* This is just debugging */ + if (mm->map_count) + BUG(); + +} + +void exit_mmap(struct mm_struct* mm) +{ + exit_mmap_common(mm); + clear_page_tables(mm, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); +} + +void exit_user_mmap(struct mm_struct* mm) +{ + exit_mmap_common(mm); + clear_user_page_tables(mm, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD); +} +#endif /* Insert vm structure into process list sorted by address * and into the inode's i_mmap ring. If vm_file is non-NULL