MacOS and iOS Internals, Volume II : Kernel Mode

chapter 1

Welcome to the Machine: Hardware

Devices

Mac Models Numbers and code names

1
2
sysctl hw.model
ioreg -l -f | grep IOPlatformExpertDevice

Processors
“Rosetta”
Processor Code Names
“A”-series chips
“n+n” “p-cores” “e-cores”

Ports

Serial ports
Firewire -> IEEE1394 fwkdp(1)
ThunderBolt -> Intel’s ThunderBolt standard
-> MiniDisplay Port and PCIe
USB usbkdp(1)

1
2
ioreg -p IOUSB
system_profiler SPUSBDataType

USB Restricted Mode

iDevice Connectors

30-pin
Lightning

NVRAM

MacOS: GUID Namespaces
*OS: the nvrm namespace
MacOS: THe System Management BIOS
-> Intel architectures
*OS: SysCfg

The Device Tree

Dedicated Processors

Common Code: RTKit
The *OS Side: RTBuddy
AOP/AGX

chapter 2

Use the source, Luke: The XNU Codebase

The XNU Source

Kernel Address SANitizer (KASAN)

Compiling the kernel

AvailabilityVersions
DTrace
libplatform
libdispatch (firehose)
Early during startup, this structure is saved by the Platform Expert, and PE* APIs -
specifically, PE_parse_boot_argn or PE_parse_boot_arg_str - can be used to query the
string and retrieve numeric or string arguments.

Kernel Debugging

Kernel Debug Protocol (kdp)

1
2
3
4
5
/*
* Well-known UDP port, debugger side.
* FIXME: This is what the 68K guys use, but beats me how they chose it...
*/
#define KDP_REMOTE_PORT 41139 /* pick one and register it */

Don’t panic

PESavePanicInfo

The Panic report

/Library/Logs/DiagnosticReports[/Retired]/Kernel_YYYY-MM-DD-HHMMSS_Hostname.panic

Kernel Core Dumps

kern_dump() kdumpd(8)

Coredump helpers

kern_register_coredump_helper

chapter 3

EXTEND: Kernel Extensions
XNU is no different in this regard: What in Windows are drivers and in Linux kernel modules are in Darwin kernel extensions. But similarities end very quickly, as the
architectureal support and design of the extensions is quite different.
kextload/kextutil/kext_logging/kextd/kextlibs

1
kextstat | grep com.apple.kpi

The Kernel Programming Interface

MACFramework

The kernelcache

/System/Library/PrelinkedKernels
Kernelcache structure
__PRELINK_INFO.__info

Kext Loading: The user mode perspective

Kext Security Requirements
/System/Library/Extensions
/Library/Extensions
Kext code signing

1
2
sqlite3 /var/db/SystemPolicyConfiguration/KextPolicy -header "select * from settings"
sqlite3 /var/db/SystemPolicyConfiguration/KextPolicy -header "select * from kext_policy"

Kextd HOST_KEXTD_PORT
MacOS 13: logkextloaded
MacOS 14: BridgeOS kext_audit
The OSKext* APIs
kext_request
Multikexts

Kext Loading: The kernel perspective

vm_map_copyout
OSKext::load
Darwin 13 -> mac_kext_check_load
kxld Kernel extension loader
-> kxld_link_file()
Unloading a kext

kext metadata management

MacOS 15: System Extensions and DriverKit

Darwin 19, however, provides an alternative - allowing developers to create,
what are in effect, user-mode extensions and drivers, through two new frameworks

  • SystemExtensions and DriverKit.
    The idea is not unlike that of Windows’ User Mode Driver Framewrok(UMDF), in which
    kernel code calls out to some user process, in order to perform some operations.
    NECP(Network Exetnsions model)
    NKE(Network Kernel Extensions)
    Darwin’s port of FUSE(Filesystem in USEr mode)
    Apple classifies Driver Extensions as those extensions which seek to replace(now legacy)
    IOKit drivers, and System Extensions for all other traditions in-kernel functionlity,
    such as Network Extensions (for packet filtering, tunneling, etc), and Endpoint Security Extensions.
    System Extensions -> sysextd
    Driver Extensions
    As with IOKit, developers can use C++, but unlike it, this is a full C++17 compatible runtime, rather than IOKit’s restricted C++.

chapter 4

Some Assembly Required: Kernel Primitives & Paradigms

Data Structures

Queues (Mach)* osfmk/kern/queue.h
struct queue_entry
Linked Lists & Queues (BSD) bsd/sys/queue.h
Tree data structures
splay trees(slef-adjusting binary search)
Red-Black trees

Concurrent resource access

Atomic operations
hwlocks
Spinlocks -> busy wait
Read-Write Locks
Mutex locks
Lock Groups
Lock Debugging/Tracing -> /dev/lockstat (legacy) or lockstat provider

Per-CPU data

osfmk/machine/cpu_data.h

Processor execution modes

Intel Ring0/Ring3
ARM64 Exception Levels
EL0 is user mode, EL1 is kernel mode, EL2 is reserved for the hypervisor (if any),
and EL3 for secure monitor (if any).

Mode Traversal

Voluntary Traversals
Involuntary traversals
Intel: SYSENTER(Vol)
Intel: IDT(Invol)
ARM: Exception Vectors

Returning to user mode

thread_exception_return()

Context Switching

machine_switch_context()
osfmk/arch/cswitch.s
kernel_bootstrap_thread
osfmk/kern/startup.c

Accessing user mode memory

Unlike kernel memory, which is normally wired(resident), user-space memory
may be swappwd. If that is the case, access will trigger a page fault. [bcopy]
copyin* and copyout
vm_fault()

Memory Access Protections

Intel architectures define Secure Mode Access Prevention (SMAP), and ARM(v8.1 and later)
architectures similary have Privileged Access Never (PAN).

Interrupt Handling

x86_64 uses the Advanced Programmer Interrupt Controller (APIC), which (as of Nehalem) is known as x2APIC.
ARM recignizes two types of interrupts - the regular interrupt requests (IRQ) and “fast
interrupts requests” (FIQ).

  • Enabling/disabling interrupts -> Asynchronous Software Traps
  • Machine Level handling of interrupts
    x86_64 hndl_allintrs -> osfmk/x86_64/idt64.s
    arm exception vectors -> [fleh/sleh]_[irq/fiq]
    1
    sudo powermetrics --samplers interrupts
  • XNU’s Handling of Interrupts
    Intel -> interrupt() osfmk/i386/trap.c
    ARM -> fleh_[irq/fiq] osfmk/arm[64]/locore.s
    -> sleh_[irq/fiq] osfmk/arm64/sleh.c

System call personalities

  • The BSD Personality
    Auditing/KDebug/Arguments/Noreturn
  • The Mach Personality
  • Machine Sepcific Syscalls
    platform_syscall
  • Hypervisor support (MacOS)

chapter 5

Alone in the Dark: The Boot Process

MacOS: EFI

Basic Concepts
Unlike BIOS, EFI is in some respects a mini operating system.

  • The Boot Services
  • The Runtime Services
    1
    nm /System/Library/Extensions/AppleEFIRuntime.kext/Contents/MacOS/AppleEFIRuntime -UCgj
    EFI Protocols
    -> Clover bootloader
    The EFI System Partition
    Software capsules
    EFI Binaries
    As Microsoft owned the dominat platform at the time, it made sense to choose
    Windows Portable Executable (PE) as the binary format.
    MacOS’s boot.efi
    MacOS’s boot.efi is rare biard - a PE32+ binary among all the other Mach-Os.
    1
    file /usr/standalone/i386/boot.efi
    Blessed Art Thou
    1
    sudo bless --info --verbose # only Intel Arch, not for M1

*OS: iBoot

MacBook Pro (2018) and later: iBoot + EFI

Secure Boot

Kernel Boot Process

x86_64: _start -> _vstart
ARM64: _start -> start_first_cpu
i386_init() and arm_init()
kernel_early_bootstrap() -> osfmk/kern/startup.c
machine_startup/kernel_bootstrap
kernel_bootstrap_log

1
2
3
4
5
6
7
8
// osfmk/arm/start.s
LOAD_ADDR(lr, arm_init)
// osfmk/arm/arm_init.c
__startup_func
void arm_init(boot_args *args)
// osfmk/kern/startup.c
__startup_func
void kernel_startup_bootstrap(void)

kernel_bootstrap_thread()
-> idle_thread_create -> kernel_thread_create
-> idle_thread -> processor_idle -> thread_run
-> thread_invoke -> thread_dispatch(self, thread)/call_continuation

1
2
3
int thread_run(thread_t self, thread_continue_t continuation, void *parameter, thread_t new_thread);
thread_run(processor->idle_thread,
idle_thread, NULL, new_thread);
1
2
3
4
5
6
// LOAD_ADDR(lr, arm_init_cpu)
slave_main(NULL);
// ->
processor_start_thread
thread_block(idle_thread);
-> processor_start

SMP Considerations

man hostinfo
processor_start -> cpu_start -> slave_main
x86_64 i386_init_slave_fast
ARM64 arm_init_cpu()

Kernel Threads

kernel_task -> pid 0

Kernel Shutdown

reboot/halt/shutdown mac_system_check_reboot
reboot_kernel -> host_reboot -> halt_all_cpus/PEHaltRestart

chapter 6

BS’’D: The BSD Layer

A Tour of BSD

NeXT wanted to conform to it as well, which required adding another layer,
on top of Mach, for the POSIX compatible APIs. Rather than implement something
from scratch, the choice was made to adopt FreeBSD implementation.
FreeBSD 6.0
bsd_init()/bsd_init_kprintf()
throttle/kmeminit/dev_kmen_init
kauth_init/procinit/tty_init/mac_policy_initbsd
ulock_initialize
audit_init
aio_init/pipeinit/sys v ipc locks
pthread_init/select_waitq_init
Memorystatus
sysctl_mib_init
bsd_autoconf
dtrace_postinit
network inits
root filesystem mounted
siginit

Launching launchd(8)

bsd_utaskbootstrap
cloneproc()
bsdinit_task()

Processes

Mach defines tasks, but the BSD layer provides the highter level constructs
that are processes.

  • The struct proc
    Process Control Block(PCB)
  • The kernproc
  • Process lists
    1
    2
    extern struct proclist allproc;         /* List of all processes. */
    extern struct proclist zombproc; /* List of zombie processes. */
  • Process data in user mode
    sysctl kern.proc
    proc_info

(U)Threads

The struct uthread [bsd/sys/user.h]

  • Syscall information
  • Exception information
  • Continuation support select/kevent/wait
  • The wait channel
  • Pointers
  • Flags UT_* flags
  • Signal handling information
  • VFS context
  • Audit record
    1
    2
    3
    #if CONFIG_AUDIT
    struct kaudit_record *uu_ar; /* audit record */
    #endif
  • Throttling info
  • DTrace information
  • Document tombstone information
  • Exit reason
  • Thread name
  • Thread list connectors
    Note that BSD level threads have no identifier which can be globally visible in user mode.
    There is the underlying Mach thread’s ID, but there is no BSD style API to retrieve it.

Pthread shims and callbacks (Darwin 13)

pthread_kext_register pthread.kext

Work Queue threads (the kernel-side portion of GCD)

in-kernel thread-pool
workq_open()/workq_kernreturn() bsd/pthread/pthread_workqueue.c

1
struct workqueue

Parked thread block on workq_unpark_continue(), a continuation which allows quick resumption.
workq_reqthreads -> workq_pop_idle_thread
workq_add_new_idle_thread -> workq_create_threadstack/thread_create_workq_waiting

BSD *sleep nad wakeup*

1
2
3
4
extern int      msleep(void *chan, lck_mtx_t *mtx, int pri, const char *wmesg, struct timespec * ts );
extern int msleep0(void *chan, lck_mtx_t *mtx, int pri, const char *wmesg, int timo, int (*continuation)(int));
extern void wakeup(void *chan);
extern void wakeup_one(caddr_t chan);

All the sleep variants require a kernel address, referred to as a wait channel,
which is used as a token to wake up the sleepers.
The chan and wmesg arguments are stored on the uuthread’s uu_wchan and uu_wmesg.

Process Lifecycle

fork/vfork/posix_spawn
[__mac_]execve/posix_spawn

  • Image Activation: exec_activate_image()
  • Mach-O Image Activator: exec_mach_imgact
  • Loading Mach-O: load_machfile()
    pmap_create/vm_map_create
  • Parsing Mach-O: parse_machfile()
  • Post Load: exec_mach_imgact()

Process Termination

Exit reasons -> exit_with_reason (Darwin 16)
os_reason bsd/sys/reason.h

Core dumps

kern.coredump

Crash Reports

EXC_CRASH Mach exception
task_exception_notify [osfmk/kern/exception.c]
-> exception_triage -> exception_triage_thread
-> exception_deliver

  • Corpses

File Descriptors

  • The struct filedesc [bsd/sys/filedecs.h]
  • The struct fileproc
    1
    2
    3
    4
    5
    6
    7
    8
    __options_decl(fileproc_flags_t, uint16_t, {
    FP_NONE = 0,
    FP_CLOEXEC = 0x01,
    FP_CLOFORK = 0x02,
    FP_INSELECT = 0x04,
    FP_AIOISSUED = 0x08,
    FP_SELCONFLICT = 0x10, /* select conflict on an individual fp */
    });
  • The struct fileglob

File Types

  • POSIX Shared Memory
    shm_open/mmap
  • KQueues
    XNU supports dynamic kqueues, which are maintained at the filedec level in the fd_kqhash table.
    struct knote [bsd/sys/event.h]
  • Pipes
    struct pipe [bsd/sys/pipe.h]/[bsd/kern/sys_pipe.c]
    A pipe dies when its read end is closed, in which case the writer gets a SIGPIPE when attempting a write (unless suppressed).

File I/O

open/openat[_nocancel] -> openat_internal [bsd/vfs/vfs_syscalls.c]

1
2
3
4
5
6
int
openat(proc_t p, struct openat_args *uap, int32_t *retval)
{
__pthread_testcancel(1);
return openat_nocancel(p, (struct openat_nocancel_args *)uap, retval);
}

read[_nocancel] -> bsd/kern/sys_generic.c
The struct uio
User mode I/O requests are standardized into struct uio, which represents the metadata
detailing an I/O request.
uio_create/uio_createwithbuffer
Handling uios
readv/writev -> iovec

Asynchronous I/O

POSIX aio* interfaces -> bsd/kern/kern_aio.c
aio_read/write / aio_fsync
-> aio_queue_async_request
aio_max_requests_per_process

BSD Memory Zones

BSD provides the notion of memory zones: Zones are preallocated arrays of objects of an
identical size.
kmzones [bsd/kern/kern_malloc.c]
vm_allocation_site Darwin15

sysctl

sysctl_register_oid

1
2
3
4
5
6
sysctl net
sysctl net.inet.tcp
sysctl net.inet.tcp.pcbcount
sysctl -X net.inet.tcp.pcblist
sysctl -X net.inet.tcp.pcblist_n
sysctl -X | wc -l

kern/vm/net/debug/hw/machdep/user
sysctlbyname -> name2oid
__DATA.__sysctl_set

DTrace

dtrace_init <- bsd_autoconf
dtrace_cpu_state_changed
Providers -> dtrace_register
dtrace/profile/syscall/mach_trap/lockstat/sdt/fbt
Probes -> dtrace_probe_create
Case Study: The fbt provider
fbt_provide_probe
The function inspects the instruction stream at the address, trying to find the familiar
PUSH RBP in Intel, and an STP FP, LR, .. (the frame pointeer and link register) in ARM.

chapter 7

Fee, Fi-fo, File - the Virtual Filesystem Switch

VFS Concepts

  • Filesystems
    nfs/devfs/nullfs/mockfs/routefs [bsd/vfs/vfs_conf.c]
    1
    man lsvfs
  • Mounts /System/Library/FileSystems
    The system maintains all its mounts in the mountlist.
    1
    extern TAILQ_HEAD(mntlist, mount) mountlist;
    f_mntonname (name of mount point) and f_mntfromname (mounted filesystem)
  • vnodes
    A vnode is a representation of a file or special object, independent of the underlying
    the system. HFS+ and APFS use the number as a B-Tree node identifier.
  • The ubc_info (V_REG vnodes)
    The Unifide Buffer Cache (UBC) is a concept first introduced into NetBSD.
    1
    struct ubc_info
  • Buffers
    struct buf [bsd/sys/buf_internal.j]
  • File System Attributes
    [bsd/vfs/vfs_attrlist.c]

Apple Extensions

  • Resource Forks
    com.apple.ResourceFork
  • File compression
    com.apple.decmpfs
    decmpfs_file_is_compressed
  • Restricted (MacOS)
    com.apple.rootless
  • Data Vault (Darwin 17)
    com.apple.rootless.datavault.controller
  • Data Protection
    com.apple.system.cprotect
  • FSEvents
  • Document IDs
  • Object IDs
  • Disk Conditioning (Darwin 17)
  • Triggers (MacOS)
  • EVFILT_VNODE kevent(2) notifications
  • /dev/vn## (conditional)
  • File Providers
    nspace_resolver_init <- vfsinit
    1
    man fileproviderctl

VFS KPIs

KPI -> Kernel Programming Interface
bsd/vfs/vfs_vnops.c

  • The vfs_context_t
  • Manipulating file in kernel mode
    namei [bsd/vfs/vfs_lookup.c]
    vnode_open [bsd/vfs/vfs_subr.c]
  • Direct File I/O
    kern_open_file_for_direct_io()
  • Vnode lifecycle
    File I/O, however, is very frequent. So sooner or later any limit will be hit,
    but vnodes never get freed - instead, they are recycled.

VFS SPIs

SPI -> Service Provider Interface

  • Registering Filesystems
    vfs_fsadd [bsd/vfs/kpi_vfs.c]
  • VFS operations
    struct vfsops [bsd/sys/mount.h]
  • Vnode operations

Case Studies

The flow of fo_read

  • /dev (devfs)
  • The [b|c]devsw entries
    Block/Char Device
  • specfs nodes
    v_type of VBLK or VCHR
  • The fdesc quasi-filesystem
    /dev/fd /dev/[stdin/stdout/stderr]
    [bsd/miscfs/devfs/devfs_fdesc_support.c]
  • NFS (MacOS)
    /sbin/nfsd
    /usr/libexec/automountd
    /sbin/nfsiod
  • NFS server operations
    nfssvc/getfh/fhopen
  • NFS client operations
    1
    man nfsstat
  • Filesystems in USEr mode (FUSE)
    Because FUSE does require a kernel component, it is not applicable in the *OS variants,
    wherein Apple uses DMG mounts (by registering loop block devices) instead.

chapter 8

Space Oddity: APFS

A Bird’s Eye View of APFS

The APFS partition type is identified by a well-known GUID.
B-Tree [The RootFS Tree/The Extent Tree]

Filesystem Features

  • Full 64-bit filesystem
  • Volume Management
  • Encryption
    MacOS was one of the first operationg systems to provide full disk encryption, when Apple
    introducde FileVault in MacOS 10.7.
    apfs_meta_crypto
  • Fast Directory Sizing
    du -> dir size
    APFS provides a significant speed up, by storing the directory usage statistics as
    additional metadata (an APFS_TYPE_DIR_STATS record) for the directory object.
  • Sparse File support
  • Atomic safe-save
    rename[at]x_np
  • File/Directory Cloning
    clonefileat (#462)
  • Copy-on-Write
    This also makes APFS a “flash friendly” filesystem.
    Suprisingly, however, APPLE chose not to provide an undelete tool, instead offering
    a different model, of snapshots.
  • Snapshots
    fs_snapshot (#518)
    1
    man fs_snapshot_create # macOS 10.13
  • Defragmentation
    Darwin 18
  • Volume Groups and Firm Links (Darwin 19+)
  • Purgeable Files (Darwin 19+)

File System Internals

Unallocated/Used by a file object/Used by APFS itself

  • APFS Objects
  • APFS object structure
  • B-Trees
    The B-tree used by APFS are actually B+ trees - a refinement on classic B-trees, by
    restricting values to leaf nodes only. Thus, non-leaf nodes (the root and deeper levels) hold only keys and identifiers of child nodes.
    APFS nodes further have no sibling pointers, which further compacts space needed, but impacts sequential reading of values: When the end of the node is reached, the next value in its sibling record must be located by starting the search at the root
  • The B-Tree Node Format
    BTNODE_ROOT (0x1) BTNODE_LEAF (0x2)

Containers & Volumes

  • Volumes
    Each volume maintains three trees - filesystem, snapshot metadata and extent.
  • Filesystem Trees

The Space Manager

  • Chunk Info Blocks (CIBs)
  • CIB Address Blocks (CABs)
  • Reaping Objects

APFS.kext

com.apple.filesystem.apfs
/System/Library/Extensions/apfs.kext
closed source

  • fsctl(2) codes
  • UserClient Methods

chapter 9

Tempus Fugit: Mach Scheduling

The High Level View

Mach Tasks

struct task [osfmk/kern/task.h]

  • The task lock
  • Statistics
  • Priority, maxmimum priority and importance
  • The vm_map
  • Linkage
  • Threads
  • Task port space
  • Task special ports
  • Task registered ports
  • Task exception ports
  • The Machine task
  • Security and audit tokens
  • Counts
  • Resource usage
  • The corresponding struct proc
  • Corpse information
  • I/O statistics
  • Flags
  • Purgeable VM objects
  • Coalitions
  • Associated hypervisor Virtual Machine (MacOS)
  • Seclude memory
  • External Modification statistics
  • Effective and requested scheduling policies
  • IOUserClients
  • Task watching (*OS) task_watchers

The kernel_task

Mach Threads

For all their size, Mach tasks (like UNIX processes) are merely resource containers.
It is their threads which are the scheduleable entities.
struct thread [osfmk/kern/thread.h]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
struct thread_ro {
struct thread *tro_owner;
#if MACH_BSD
struct ucred *tro_cred;
struct proc *tro_proc;
struct proc_ro *tro_proc_ro;
#endif
struct task *tro_task;
thread_ro_flags_t tro_flags;

struct ipc_port *tro_self_port;
struct ipc_port *tro_settable_self_port; /* send right */
struct ipc_port *tro_ports[THREAD_SELF_PORT_COUNT]; /* no right */

struct exception_action *tro_exc_actions;
};
struct thread{
// ......
/* Task membership */
#if __x86_64__ || __arm__
struct task *t_task;
#endif
struct thread_ro *t_tro;
// ......
};
  • Execution State
  • Linkage
  • Wait data
  • Ports
  • Priority
  • Scheudling information
  • Continuation
  • Affinity values
  • Page fault recovery handler
  • Thread call state
  • Guard execption codes
  • Turnstile
  • The BSD uuthread object
  • DTrace data
  • Per-thread statistics
  • Ledger details
  • Associated voucher
  • Tag
  • The machine dependent thread object

Thread creation

thread_create[_running]
Threads are normally created suspended, but using the running variant allows the caller to set the initial register state of the process and immdeidately schedule it for execution.
thread_create && thread_start
kernel_thread_create/kernel_thread_start[_priority]
machine_thread_create [osfmk/arm64/pcb.c]

Thread termination

thread_terminate [osfmk/kern/thread_act.c]
It then puts the thread into a block, to continue on thread_terminate_continue.
The continuation, however, will never be reached (if it were to be reached, the kernel
would panic).

Processor Management

processor_set_default

Mach Scheduling Enhancements

For IPC to be efficient, the scheduler must be highly effective - as Mach strives to be.

  • Handoff
    Mach supports handoff in addition to the standard yield.
    注:switch direct, not yield??
    thread_handoff_[internal/parameter]
    thread_switch (user mode)
  • Continuations
    A continuation is a function, along with an optional parameter, which is provided as
    an argument to kernel_thread_create(), or to thread_block[_reason].
    struct thread_snapshot -> uint64_t continuation;
    1
    2
    #define ith_continuation    saved.receive.continuation
    #define sth_continuation saved.sema.continuation
    struct uthread -> uu_continuation (BSD layer)

Asynchronous Software Traps (AST)

  • Handling ASTs
  • AST reasons [osfmk/kern/ast.h]
    1
    2
    #define AST_SCHEDULING  (AST_PREEMPTION | AST_YIELD | AST_HANDOFF)
    // processor_idle/thread_block_reason call ast_off(AST_SCHEDULING);

Mach Schedulers

[osfmk/kern/sched.h]
Darwin version before Darwin 17 use multiq, but Darwin 18 shifts to qualq.
*OS 13 variant use a new scheduler called AMP, which takes into account the core type
(Performance or efficiency) as well.The kern.sched sysctl(8) MIB will show the currently
active scheduler.
Not note in this book
amp -> clutch()

1
2
3
4
5
#if CONFIG_SCHED_EDGE
#include <kern/sched_amp_common.h>
#endif /* CONFIG_SCHED_EDGE */
.sched_name = "clutch",
.sched_name = "edge",

sched_clutch.c
clutch/edage

1
sysctl kern.sched

macOS 12.6
Mac mini (Late 2014)
kern.sched: dualq
Mac mini (M1, 2020)
kern.sched: edge

[osfmk/arm/proc_reg.h]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#if CONFIG_CLUTCH

#if __ARM_AMP__

/* Enable the Edge scheduler for all J129 platforms */
#if XNU_TARGET_OS_OSX
#define CONFIG_SCHED_CLUTCH 1
#define CONFIG_SCHED_EDGE 1
#endif /* XNU_TARGET_OS_OSX */

#else /* __ARM_AMP__ */
#define CONFIG_SCHED_CLUTCH 1
#endif /* __ARM_AMP__ */

#endif /* CONFIG_CLUTCH */

All Mach schedulers “plug in” to the scheduler primitives defined in osfmk/kern/sched_prim.c.

  • The Run Queue
    1
    struct run_queue;
  • Priorities
    Threads are queued in one of the NRQS queues, in FIFO ordering.
    BASEPRI_DEFAULT -> nice(1)
  • Load Average/Mach Factor, and Priority Shifts
    A key metric in any UNIX system is its load average, which is reported by commands such
    as w(1) and uptime (1).
    1
    sysctl vm.loadavg
    The Mach factor can be retrieved using hostinfo(1).
    XNU also calculates a more fine grained scheduler load, which it uses to implement priortiy shifts.
    update_priority/sched_usage
  • Scheduling buckets and the EWMA
    sched_bucket_t
    XNU’s averages was further tweaked in Darwin 18 to an Exponentially Weighted Moving Average
    algorithm(EWMA).
    [osfmk/kern/sched_average.c]
  • Scheduler dispatch
    [osfmk/kern/sched_prim.c]
    1
    struct sched_dispatch_table;
    • thread_select
    • thread_invoke
    • thread_dispatch
    • qunatum_expire
    • update_priority (osfmk/kern/priority.c)
    • sched_maintenance_thread
  • Multicore considerations
    The calls for rebalancing, by moving queued threads from busy processor(s) to the less busy
    ones(s), based on the respective run queue lengths.
    1
    2
    3
    sysctl kern | grep kern.sched
    sysctl kern.sched_enable_smt
    sysctl kern.sched_allow_NO_SMT_threads
  • Darwin 17 additions
    • Real time threading support
    • Multi processor support
    • Thread yield checks
  • Darwin 19 additions
    • Run counts
    • Thread buckets
    • Multiple processor set support
  • Effectuating policy changes

Deferred Calls

[osfmk/kern/call_entry.h]

  • Timer calls
    [osfmk/kern/timer_call.h]
    1
    typedef struct timer_call;
  • Timer coalescing
    One of Darwin 13’s most important “under the hood” changes was the introduction of
    Timer Coalescing. When timers start up too frequently, the CPU can enjoy less idle periods
  • and waking up the CPU can actually take more power than just leaving it on for a slightly
    longer period.
    timer_call_enter_with_leeway
    Note: Windows 8 and later have a similar mechanism in the EX_TIMERS and EX*Timer routines,
    with “No-wake timers”. Linux 2.6.22 and later timer_lists offer TIMER_DEFERRABLE).
    Ref: https://learn.microsoft.com/zh-cn/windows-hardware/drivers/ddi/wdm/ns-wdm-_ext_set_parameters_v0 (LONGLONG NoWakeTolerance;)
  • Scheduling timers
  • Thread calls
  • Servicing thread calls

Scheduler assisted synchronization

  • Wait Queues
    Mach follows this pattern as well, with the waitq and waitq_set structures.The waitqs can be
    found embedded in Mach ports, semaphores and (as of Darwin 18) turnstiles, and they also
    support BSD’s select(2) and AIO implementaiotns. The waitq_sets back select(2) as well,
    along with kqueues and Mach ipc_mqueues.
    struct waitq [osfmk/kern/waitq.h]
  • selection callbacks
    waitq_select_
    1
    2
    3
    4
    5
    #pragma mark global wait queues

    static __startup_data struct waitq g_boot_waitq;
    static SECURITY_READ_ONLY_LATE(struct waitq *) global_waitqs = &g_boot_waitq;
    static SECURITY_READ_ONLY_LATE(uint32_t) g_num_waitqs = 1;
  • Ulocks (Darwin 16+)
    __ulock_wait/__ulock_wake
    As the double underscores imply, user mode is not intened to use these system calls
    directly, instead working with libplatform.dylib’s higher level os_unfair_lock_t.
    [bsd/kern/sys_ulock.c]
    sys_ulock_wait (#515) -> ulock_wait
    sys_ulock_wake (#516) -> ulock_wake
  • Turnstiles (Darwin 18+)
    The concept first appeared in Solaris, and was then adopted by FreeBSD, and well explained
    in the BSD bible.
    Theory
    optimize short term locks and the scheduling of waiters wehn the locks become available.
    Darwin implmentation of Turnstiles.
    turnstiles_init()
    [osfmk/kern/turnstile.h]
    Ref:https://book.douban.com/subject/3666232/
    https://greenteapress.com/wp/semaphores/
    https://blog.csdn.net/booksyhay/article/details/82692362
    [信号量小书 第三章 基本同步模式]
    https://www.likecs.com/show-204583284.html#3.7.6%20%E9%A2%84%E8%A3%85%E6%97%8B%E8%BD%AC%E6%A0%85%E9%97%A8%EF%BC%88Preloaded%20turnstile%EF%BC%89
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    typedef enum __attribute__((packed)) turnstile_type {
    TURNSTILE_NONE = 0,
    TURNSTILE_KERNEL_MUTEX = 1,
    TURNSTILE_ULOCK = 2,
    TURNSTILE_PTHREAD_MUTEX = 3,
    TURNSTILE_SYNC_IPC = 4,
    TURNSTILE_WORKLOOPS = 5,
    TURNSTILE_WORKQS = 6,
    TURNSTILE_KNOTE = 7,
    TURNSTILE_SLEEP_INHERITOR = 8,
    TURNSTILE_TOTAL_TYPES = 9,
    } turnstile_type_t;
  • Benefits of Turnstiles
    “thundering herd” problem
    priority inversion
  • KDebug codes
    DBG_TURNSTILE
  • Gates (Darwin 19)

Ledgers

  • ledger (#373)
  • Initialization
  • Maintenance

Selective Forced Idle (SFI)

Darwin 13
The main user-mode client of the SFI facility is the thermald.

chapter 10

Mixed Messages: Mach IPC

The High Level View

Mach is, first and foremost, a kernel optimized for message passing.
ipc_space_t

Task ipc_space_t

struct ipc_space [osfmk/ipc/ipc_space.h]
ipc_space_create

The ipc_port

[osfmk/ipc/ipc_port.h]
ipc_port_make_send

  • Case Study: resolving a port name to the underlying object address
    [osfmk/ipc/ipc_object.c]
  • Port lifecycle
  • Port allocation
    [osfmk/ipc/mach_port.c]
  • Rights and Names
  • Reference counting
    mach_msg -> mach_msg_trap
    -> mach_msg_trap
    -> ipc_kmsg_send/ipc_mqueue_receive
  • Port deallocation
  • Handling messages
    [osfmk/ipc/ipc_mqueue.h]

mach_msg revisited

  • Sending Mach messages
    ipc_kmsg_send
  • ipc_mqueue_send()
  • Receiving Mach messages
    ipc_mqueue_receive
  • Destriying messages
    [osfmk/ipc/ipc_kmsg.c]
  • Message Descriptors
  • Port right descriptors
    ipc_kmsg_copyin_port_descriptor
  • Port set (OOL ports) descriptors
  • OOL memory descriptors
  • Descriptors as a vehicle for malicious attacks

Vouchers

Darwin 14 [osfmk/ipc/ipc_voucher.h]

  • User-mode API
    host_create_mach_voucher_trap
    mach_voucher_extract_attr_recipe_trap
  • Implementation
    IKOT_VOUCHER

Multinode

  • Multinode requirements
    mach_host_other()
  • FLIPC
    Fast Local InterProcess Communication (FLIPC)
    Mach Node [osfmk/kern/mach_node.c] mach_node_register
    FLIPC [osfmk/ipc/flipc.c]

chapter 11

Mapped out: Mach Memory Management

A Bird’s Eye View

Mach’s Virtual Memory subsystem
vm_map -> virtual memory
pmap -> physical memory

The vm_map Layer

  • The struct _vm_map
    [osfmk/vm/vm_map.h]
    vm_map_create[_options]
  • vm_objects
    [osfmk/vm/vm_object.h]
  • vm_pages
    [osfmk/vm/vm_page.h]
    pmap_startup/pmap_free_pages
    pmap_steal_memory
    vm_page_lookup()
  • User mode interface
    host_virtual_physical_table_info
    vm_mapped_pages_info
    mach_vm_page_info
    vm_page_info_basic <mach/vm_region.h>
  • vm_map_enter and friends
    [mach_]vm_allocate()
    <mach/vm_statistics.h>
  • Allocating memory (highlighs)
    VM_PROT_EXECUTE
    com.apple.security.cs.allow-jit
    kernel_memory_allocate
  • The vm_map_copy object
    1
    struct vm_map_copy;
  • Copying memory
    [mach_]vm_copy()

Pagers

Pager types supported in Darwin 18
Vnode/Device/Apple Protect/swapfile
Compressor/4K(4K emulation on 16K)/shared

  • The Pager object
    struct memory_object
  • Pager Lifecycle callbacks
  • The vnode pager
  • The swapfile pager
  • The compressor pager
    [osfmk/vm/WKdm_new.h]
    Wilson & Kaplan
    vm_compressor_algorithms.h
  • Lifecycle
  • The Device Pager
    [osfmk/vm/device_vm.c]
  • The 4K Pager (*OS)
  • The sahred region pager (Darwin 18)
  • The Apple Protect pager
    memory encryption
    [osfmk/vm/vm_apple_protect.c]
  • MacOS: Dont Steal Mac OS X.kext
    dsmos_page_transform_hook
    [osfmk/kern/page_decypt.c]
  • *OS: Fairplay encryption
  • Page Lists (UPLs)
    struct upl [osfmk/vm/vm_pageout.h]
  • Creating UPLs
    ubc_create_upl()
  • Handling UPLs
    [osfmk/mach/upl.defs]

The pmap Layer

[osfmk/vm/pmap.h]
pmap_create

  • Page Tables
    In Intel, the special CR3 register holds the base of the page tables for a giver process.
    In ARM architecures, the Translation Table Base Registers (TTBRs) are used instead. ARM64
    providers a differnt TTBR for every execption level, so TTBR_EL0 is employde by user mode,
    and TTBR_EL1 by the kernel.
    Page Table Entrye(PTE)
    pmap_pte(pmap, va)
  • WIMG
    Write-through, Cache-Inhibition, Memory Coherence and Guarde writes
  • I/O Mappings
    ml_io_map()
  • Intel PTEs
  • The ARM PTEs

chapre 12

Ceci n’est pas une “heap”: Kernel Memory Management
We detail how the kernel manages its own vm_map - the kernel_map - through kmem_alloc*
and kalloc*.

Kernel Memory Allocation

  • The kernel_map
    [osfmk/vm/vm_kern.c]
    VM_MIN_KERNEL_AND_KEXT_ADDRESS
    VM_MAX_KERNEL_AND_KEXT_ADDRESS
  • kmem_alloc() and friends
  • kernel_memory_allocate
    vm_map_find_space
  • kmem_suballoc
  • kmem_realloc
  • kalloc
  • kalloc.###zones
    zalloc_canblock_tag/vm_allocation_site
  • The Kalloc DLUT
    Direct LookUp Table (DLUT)
    [osfmk/kern/kalloc.c]
  • The slow path
  • OSMalloc*
    [libkern/libkern/OSMalloc.h]
    The main advantage of using OSMalloc is its support of memory tags.

The Zone Allocator

like Linux’s Slabs
zalloc() [osfmk/kern/zalloc.c]

1
man zprint
  • Zone Management
  • The zone_metadata_region
  • The zone metadata
  • Element Free Lists
  • Garbage Collection
    consider_zone_gc()/zone_gc()
    vm_pageout_garbage_collect
  • GC and UAF
    mach_zone_force_gc
  • Battling zone corruption
  • The Guard Mode Zone Allocator (MacOS)
    like libgmalloc(3) (Guard Malloc) in user mode
  • The Zone Cache (Darwin 18+)

Memorystauts (MacOS) anad Jetsam (*OS)

  • Purgeable memory
    task_purgable_info
    [mach_]vm_purgable_control
    memory_entry_purgable_control

Kernel Memory Layout

  • The kernel_map regions
  • The Kernel Slide
    Kernel Address Space Layout Randomization (KALSR) [Darwin 12]
    1
    sysctl kern.slide
    vm_kernel_slider
    vm_kernel_addrhash_salt

chapter 13

All in the Family: IOKit
IORegistry IOCatalogue

A High level view of IOKit

  • The IOKit.framework
    IO Master Port
    device_service_create [osfmk/device/device_init.c]
  • IOKit error codes
    [iokit/IOKit/IOReturn.h]
    IOService::stringFromReturn

The IORegistry

IORegistryPlanes

1
ioreg -l -w 0 -f | grep IORegistryPlanes

IORegistryExplorer.app -> XCode’s Additional Tools

  • User Mode APIs
  • Iterators
    IOIterator

The IOCatalog (ue)

  • Matching Dictionaries
  • Notifiations
    kIO…Notification
    IONotificationPortCreate()

Interlude: Libkern Base Classes

  • OSObject
  • OSMetaClass(Base)
  • APIs
  • DefaultStrutors
  • Members, methods and the Fragile Base Class proble
  • Object types
  • OSStrings and OSSymbols
  • OSCollections
  • Serializaition
  • XML Serialization
  • Binary Serialization

The Class Menagerie

  • IOKIt Built-in classes
  • IORegistryEntry
  • IOService
  • IO*MemoryDescriptor
  • IO*MemoryCursor
  • IOWorkLoop
  • IO*EventSource and IOCommand
  • IOCommandQueue
  • IOKit Families

Driver Life Cycle

  • Driver Matching
    IOKitPersonalities
  • Case Study: VMWare Fusion VMIOPlug
  • Driver activity and the IOWorkLoop
  • Messaging
  • Matching Notifications
  • Interrupt Handling

IOUserClients

  • IOUserClient lifecycle
  • Driver Properties
  • Notifications
    IOConnectSetNotificationPort
  • Mapped Memory
  • External Traps
  • Extenal Methods
  • IOCFPlugInTypes

Darwin 19: DriverKit

  • IOUserServer
  • IORPC

chapter 14

Stacking Up: Kernel Networking

The High Level View

Layer V: Sockets

  • The struct socket
  • Socket Creation
    socreate_internal() [bsd/kern/uipc_socket.c]
  • sockbufs
  • mbufs
    struct mbuf;
    [bsd/sys/mbuf.h]
    XNU also supports an “mbuf watchdog”, toggled through kern.ipc.mb_watchdog.
  • Sockets in kernel mode
    [bsd/kern/kpi_socket.h]
    sock_connectwait
    sock_socket -> socreate

Layer IV: Domains & Protocols

  • Domains
  • Protocols
  • Case Study: PF_SYSTEM sockets
  • SYSPROTO_EVENT
  • SYSPROTO_CONTROL

Layer III: Network Protocols

  • Incoming packets
  • The struct ifnet
  • Interface lifecycle
  • Case Study: The UTUN interface

Network Data Processing

  • Sending Data
  • IPv4/IPv6 packet output
  • DLIL output
  • Receiving data
  • DLIL frame reception
  • IPv4/IPv6 packaet input

Firewalling & Filtering mechanisms

  • Socket Filters
  • Content Filters (Darwin 14+)
  • IP filters
  • PF
    BSD PF
    pfctl(8)
  • Interface Filters
  • BPF

Network Extension Control Policies

  • NECP file descriptors
    necp_open() [bsd/net/necp_client.c]
  • NECP Session FDs
    necp_session_opne()
  • Policy evaluation