深入解析MacOS和iOS 卷二 笔记
MacOS and iOS Internals, Volume II : Kernel Mode
chapter 1
Welcome to the Machine: Hardware
Devices
Mac Models Numbers and code names
1 | sysctl hw.model |
Processors
“Rosetta”
Processor Code Names
“A”-series chips
“n+n” “p-cores” “e-cores”
Ports
Serial ports
Firewire -> IEEE1394 fwkdp(1)
ThunderBolt -> Intel’s ThunderBolt standard
-> MiniDisplay Port and PCIe
USB usbkdp(1)
1 | ioreg -p IOUSB |
USB Restricted Mode
iDevice Connectors
30-pin
Lightning
NVRAM
MacOS: GUID Namespaces*OS
: the nvrm namespace
MacOS: THe System Management BIOS
-> Intel architectures*OS
: SysCfg
The Device Tree
Dedicated Processors
Common Code: RTKit
The *OS
Side: RTBuddy
AOP/AGX
chapter 2
Use the source, Luke: The XNU Codebase
The XNU Source
Kernel Address SANitizer (KASAN)
Compiling the kernel
AvailabilityVersions
DTrace
libplatform
libdispatch (firehose)
Early during startup, this structure is saved by the Platform Expert, and PE* APIs -
specifically, PE_parse_boot_argn or PE_parse_boot_arg_str - can be used to query the
string and retrieve numeric or string arguments.
Kernel Debugging
Kernel Debug Protocol (kdp)
1 | /* |
Don’t panic
PESavePanicInfo
The Panic report
/Library/Logs/DiagnosticReports[/Retired]/Kernel_YYYY-MM-DD-HHMMSS_Hostname.panic
Kernel Core Dumps
kern_dump() kdumpd(8)
Coredump helpers
kern_register_coredump_helper
chapter 3
EXTEND: Kernel Extensions
XNU is no different in this regard: What in Windows are drivers and in Linux kernel modules are in Darwin kernel extensions. But similarities end very quickly, as the
architectureal support and design of the extensions is quite different.
kextload/kextutil/kext_logging/kextd/kextlibs
1 | kextstat | grep com.apple.kpi |
The Kernel Programming Interface
MACFramework
The kernelcache
/System/Library/PrelinkedKernels
Kernelcache structure
__PRELINK_INFO.__info
Kext Loading: The user mode perspective
Kext Security Requirements
/System/Library/Extensions
/Library/Extensions
Kext code signing
1 | sqlite3 /var/db/SystemPolicyConfiguration/KextPolicy -header "select * from settings" |
Kextd HOST_KEXTD_PORT
MacOS 13: logkextloaded
MacOS 14: BridgeOS kext_audit
The OSKext* APIs
kext_request
Multikexts
Kext Loading: The kernel perspective
vm_map_copyout
OSKext::load
Darwin 13 -> mac_kext_check_load
kxld Kernel extension loader
-> kxld_link_file()
Unloading a kext
kext metadata management
MacOS 15: System Extensions and DriverKit
Darwin 19, however, provides an alternative - allowing developers to create,
what are in effect, user-mode extensions and drivers, through two new frameworks
- SystemExtensions and DriverKit.
The idea is not unlike that of Windows’ User Mode Driver Framewrok(UMDF), in which
kernel code calls out to some user process, in order to perform some operations.
NECP(Network Exetnsions model)
NKE(Network Kernel Extensions)
Darwin’s port of FUSE(Filesystem in USEr mode)
Apple classifies Driver Extensions as those extensions which seek to replace(now legacy)
IOKit drivers, and System Extensions for all other traditions in-kernel functionlity,
such as Network Extensions (for packet filtering, tunneling, etc), and Endpoint Security Extensions.
System Extensions -> sysextd
Driver Extensions
As with IOKit, developers can use C++, but unlike it, this is a full C++17 compatible runtime, rather than IOKit’s restricted C++.
chapter 4
Some Assembly Required: Kernel Primitives & Paradigms
Data Structures
Queues (Mach)* osfmk/kern/queue.h
struct queue_entry
Linked Lists & Queues (BSD) bsd/sys/queue.h
Tree data structures
splay trees(slef-adjusting binary search)
Red-Black trees
Concurrent resource access
Atomic operations
hwlocks
Spinlocks -> busy wait
Read-Write Locks
Mutex locks
Lock Groups
Lock Debugging/Tracing -> /dev/lockstat (legacy) or lockstat provider
Per-CPU data
osfmk/machine/cpu_data.h
Processor execution modes
Intel Ring0/Ring3
ARM64 Exception Levels
EL0 is user mode, EL1 is kernel mode, EL2 is reserved for the hypervisor (if any),
and EL3 for secure monitor (if any).
Mode Traversal
Voluntary Traversals
Involuntary traversals
Intel: SYSENTER(Vol)
Intel: IDT(Invol)
ARM: Exception Vectors
Returning to user mode
thread_exception_return()
Context Switching
machine_switch_context()
osfmk/arch/cswitch.s
kernel_bootstrap_thread
osfmk/kern/startup.c
Accessing user mode memory
Unlike kernel memory, which is normally wired(resident), user-space memory
may be swappwd. If that is the case, access will trigger a page fault. [bcopy]
copyin* and copyout
vm_fault()
Memory Access Protections
Intel architectures define Secure Mode Access Prevention (SMAP), and ARM(v8.1 and later)
architectures similary have Privileged Access Never (PAN).
Interrupt Handling
x86_64 uses the Advanced Programmer Interrupt Controller (APIC), which (as of Nehalem) is known as x2APIC.
ARM recignizes two types of interrupts - the regular interrupt requests (IRQ) and “fast
interrupts requests” (FIQ).
- Enabling/disabling interrupts -> Asynchronous Software Traps
- Machine Level handling of interrupts
x86_64 hndl_allintrs -> osfmk/x86_64/idt64.s
arm exception vectors ->[fleh/sleh]_[irq/fiq]
1
sudo powermetrics --samplers interrupts
- XNU’s Handling of Interrupts
Intel -> interrupt() osfmk/i386/trap.c
ARM -> fleh_[irq/fiq] osfmk/arm[64]/locore.s
-> sleh_[irq/fiq] osfmk/arm64/sleh.c
System call personalities
- The BSD Personality
Auditing/KDebug/Arguments/Noreturn - The Mach Personality
- Machine Sepcific Syscalls
platform_syscall - Hypervisor support (MacOS)
chapter 5
Alone in the Dark: The Boot Process
MacOS: EFI
Basic Concepts
Unlike BIOS, EFI is in some respects a mini operating system.
- The Boot Services
- The Runtime ServicesEFI Protocols
1
nm /System/Library/Extensions/AppleEFIRuntime.kext/Contents/MacOS/AppleEFIRuntime -UCgj
-> Clover bootloader
The EFI System Partition
Software capsules
EFI Binaries
As Microsoft owned the dominat platform at the time, it made sense to choose
Windows Portable Executable (PE) as the binary format.
MacOS’s boot.efi
MacOS’s boot.efi is rare biard - a PE32+ binary among all the other Mach-Os.Blessed Art Thou1
file /usr/standalone/i386/boot.efi
1
sudo bless --info --verbose # only Intel Arch, not for M1
*OS
: iBoot
MacBook Pro (2018) and later: iBoot + EFI
Secure Boot
Kernel Boot Process
x86_64: _start -> _vstart
ARM64: _start -> start_first_cpu
i386_init() and arm_init()
kernel_early_bootstrap() -> osfmk/kern/startup.c
machine_startup/kernel_bootstrap
kernel_bootstrap_log
1 | // osfmk/arm/start.s |
kernel_bootstrap_thread()
-> idle_thread_create -> kernel_thread_create
-> idle_thread -> processor_idle -> thread_run
-> thread_invoke -> thread_dispatch(self, thread)/call_continuation
1 | int thread_run(thread_t self, thread_continue_t continuation, void *parameter, thread_t new_thread); |
1 | // LOAD_ADDR(lr, arm_init_cpu) |
SMP Considerations
man hostinfo
processor_start -> cpu_start -> slave_main
x86_64 i386_init_slave_fast
ARM64 arm_init_cpu()
Kernel Threads
kernel_task -> pid 0
Kernel Shutdown
reboot/halt/shutdown mac_system_check_reboot
reboot_kernel -> host_reboot -> halt_all_cpus/PEHaltRestart
chapter 6
BS’’D: The BSD Layer
A Tour of BSD
NeXT wanted to conform to it as well, which required adding another layer,
on top of Mach, for the POSIX compatible APIs. Rather than implement something
from scratch, the choice was made to adopt FreeBSD implementation.
FreeBSD 6.0
bsd_init()/bsd_init_kprintf()
throttle/kmeminit/dev_kmen_init
kauth_init/procinit/tty_init/mac_policy_initbsd
ulock_initialize
audit_init
aio_init/pipeinit/sys v ipc locks
pthread_init/select_waitq_init
Memorystatus
sysctl_mib_init
bsd_autoconf
dtrace_postinit
network inits
root filesystem mounted
siginit
Launching launchd(8)
bsd_utaskbootstrap
cloneproc()
bsdinit_task()
Processes
Mach defines tasks, but the BSD layer provides the highter level constructs
that are processes.
- The struct proc
Process Control Block(PCB) - The kernproc
- Process lists
1
2extern struct proclist allproc; /* List of all processes. */
extern struct proclist zombproc; /* List of zombie processes. */ - Process data in user mode
sysctl kern.proc
proc_info
(U)Threads
The struct uthread [bsd/sys/user.h]
- Syscall information
- Exception information
- Continuation support select/kevent/wait
- The wait channel
- Pointers
- Flags UT_* flags
- Signal handling information
- VFS context
- Audit record
1
2
3
struct kaudit_record *uu_ar; /* audit record */ - Throttling info
- DTrace information
- Document tombstone information
- Exit reason
- Thread name
- Thread list connectors
Note that BSD level threads have no identifier which can be globally visible in user mode.
There is the underlying Mach thread’s ID, but there is no BSD style API to retrieve it.
Pthread shims and callbacks (Darwin 13)
pthread_kext_register pthread.kext
Work Queue threads (the kernel-side portion of GCD)
in-kernel thread-pool
workq_open()/workq_kernreturn() bsd/pthread/pthread_workqueue.c
1 | struct workqueue |
Parked thread block on workq_unpark_continue(), a continuation which allows quick resumption.
workq_reqthreads -> workq_pop_idle_thread
workq_add_new_idle_thread -> workq_create_threadstack/thread_create_workq_waiting
BSD *sleep
nad wakeup*
1 | extern int msleep(void *chan, lck_mtx_t *mtx, int pri, const char *wmesg, struct timespec * ts ); |
All the sleep variants require a kernel address, referred to as a wait channel,
which is used as a token to wake up the sleepers.
The chan and wmesg arguments are stored on the uuthread’s uu_wchan and uu_wmesg.
Process Lifecycle
fork/vfork/posix_spawn[__mac_]execve/posix_spawn
- Image Activation: exec_activate_image()
- Mach-O Image Activator: exec_mach_imgact
- Loading Mach-O: load_machfile()
pmap_create/vm_map_create - Parsing Mach-O: parse_machfile()
- Post Load: exec_mach_imgact()
Process Termination
Exit reasons -> exit_with_reason (Darwin 16)
os_reason bsd/sys/reason.h
Core dumps
kern.coredump
Crash Reports
EXC_CRASH Mach exception
task_exception_notify [osfmk/kern/exception.c]
-> exception_triage -> exception_triage_thread
-> exception_deliver
- Corpses
File Descriptors
- The struct filedesc [bsd/sys/filedecs.h]
- The struct fileproc
1
2
3
4
5
6
7
8__options_decl(fileproc_flags_t, uint16_t, {
FP_NONE = 0,
FP_CLOEXEC = 0x01,
FP_CLOFORK = 0x02,
FP_INSELECT = 0x04,
FP_AIOISSUED = 0x08,
FP_SELCONFLICT = 0x10, /* select conflict on an individual fp */
}); - The struct fileglob
File Types
- POSIX Shared Memory
shm_open/mmap - KQueues
XNU supports dynamic kqueues, which are maintained at the filedec level in the fd_kqhash table.
struct knote [bsd/sys/event.h] - Pipes
struct pipe [bsd/sys/pipe.h]/[bsd/kern/sys_pipe.c]
A pipe dies when its read end is closed, in which case the writer gets a SIGPIPE when attempting a write (unless suppressed).
File I/O
open/openat[_nocancel] -> openat_internal [bsd/vfs/vfs_syscalls.c]
1 | int |
read[_nocancel] -> bsd/kern/sys_generic.c
The struct uio
User mode I/O requests are standardized into struct uio, which represents the metadata
detailing an I/O request.
uio_create/uio_createwithbuffer
Handling uios
readv/writev -> iovec
Asynchronous I/O
POSIX aio* interfaces -> bsd/kern/kern_aio.c
aio_read/write / aio_fsync
-> aio_queue_async_request
aio_max_requests_per_process
BSD Memory Zones
BSD provides the notion of memory zones: Zones are preallocated arrays of objects of an
identical size.
kmzones [bsd/kern/kern_malloc.c]
vm_allocation_site Darwin15
sysctl
sysctl_register_oid
1 | sysctl net |
kern/vm/net/debug/hw/machdep/user
sysctlbyname -> name2oid
__DATA.__sysctl_set
DTrace
dtrace_init <- bsd_autoconf
dtrace_cpu_state_changed
Providers -> dtrace_register
dtrace/profile/syscall/mach_trap/lockstat/sdt/fbt
Probes -> dtrace_probe_create
Case Study: The fbt provider
fbt_provide_probe
The function inspects the instruction stream at the address, trying to find the familiar
PUSH RBP in Intel, and an STP FP, LR, .. (the frame pointeer and link register) in ARM.
chapter 7
Fee, Fi-fo, File - the Virtual Filesystem Switch
VFS Concepts
- Filesystems
nfs/devfs/nullfs/mockfs/routefs [bsd/vfs/vfs_conf.c]1
man lsvfs
- Mounts /System/Library/FileSystems
The system maintains all its mounts in the mountlist.f_mntonname (name of mount point) and f_mntfromname (mounted filesystem)1
extern TAILQ_HEAD(mntlist, mount) mountlist;
- vnodes
A vnode is a representation of a file or special object, independent of the underlying
the system. HFS+ and APFS use the number as a B-Tree node identifier. - The ubc_info (V_REG vnodes)
The Unifide Buffer Cache (UBC) is a concept first introduced into NetBSD.1
struct ubc_info
- Buffers
struct buf [bsd/sys/buf_internal.j] - File System Attributes
[bsd/vfs/vfs_attrlist.c]
Apple Extensions
- Resource Forks
com.apple.ResourceFork - File compression
com.apple.decmpfs
decmpfs_file_is_compressed - Restricted (MacOS)
com.apple.rootless - Data Vault (Darwin 17)
com.apple.rootless.datavault.controller - Data Protection
com.apple.system.cprotect - FSEvents
- Document IDs
- Object IDs
- Disk Conditioning (Darwin 17)
- Triggers (MacOS)
- EVFILT_VNODE kevent(2) notifications
- /dev/vn## (conditional)
- File Providers
nspace_resolver_init <- vfsinit1
man fileproviderctl
VFS KPIs
KPI -> Kernel Programming Interface
bsd/vfs/vfs_vnops.c
- The vfs_context_t
- Manipulating file in kernel mode
namei [bsd/vfs/vfs_lookup.c]
vnode_open [bsd/vfs/vfs_subr.c] - Direct File I/O
kern_open_file_for_direct_io() - Vnode lifecycle
File I/O, however, is very frequent. So sooner or later any limit will be hit,
but vnodes never get freed - instead, they are recycled.
VFS SPIs
SPI -> Service Provider Interface
- Registering Filesystems
vfs_fsadd [bsd/vfs/kpi_vfs.c] - VFS operations
struct vfsops [bsd/sys/mount.h] - Vnode operations
Case Studies
The flow of fo_read
- /dev (devfs)
- The [b|c]devsw entries
Block/Char Device - specfs nodes
v_type of VBLK or VCHR - The fdesc quasi-filesystem
/dev/fd /dev/[stdin/stdout/stderr]
[bsd/miscfs/devfs/devfs_fdesc_support.c] - NFS (MacOS)
/sbin/nfsd
/usr/libexec/automountd
/sbin/nfsiod - NFS server operations
nfssvc/getfh/fhopen - NFS client operations
1
man nfsstat
- Filesystems in USEr mode (FUSE)
Because FUSE does require a kernel component, it is not applicable in the*OS
variants,
wherein Apple uses DMG mounts (by registering loop block devices) instead.
chapter 8
Space Oddity: APFS
A Bird’s Eye View of APFS
The APFS partition type is identified by a well-known GUID.
B-Tree [The RootFS Tree/The Extent Tree]
Filesystem Features
- Full 64-bit filesystem
- Volume Management
- Encryption
MacOS was one of the first operationg systems to provide full disk encryption, when Apple
introducde FileVault in MacOS 10.7.
apfs_meta_crypto - Fast Directory Sizing
du -> dir size
APFS provides a significant speed up, by storing the directory usage statistics as
additional metadata (an APFS_TYPE_DIR_STATS record) for the directory object. - Sparse File support
- Atomic safe-save
rename[at]x_np - File/Directory Cloning
clonefileat (#462) - Copy-on-Write
This also makes APFS a “flash friendly” filesystem.
Suprisingly, however, APPLE chose not to provide an undelete tool, instead offering
a different model, of snapshots. - Snapshots
fs_snapshot (#518)1
man fs_snapshot_create # macOS 10.13
- Defragmentation
Darwin 18 - Volume Groups and Firm Links (Darwin 19+)
- Purgeable Files (Darwin 19+)
File System Internals
Unallocated/Used by a file object/Used by APFS itself
- APFS Objects
- APFS object structure
- B-Trees
The B-tree used by APFS are actually B+ trees - a refinement on classic B-trees, by
restricting values to leaf nodes only. Thus, non-leaf nodes (the root and deeper levels) hold only keys and identifiers of child nodes.
APFS nodes further have no sibling pointers, which further compacts space needed, but impacts sequential reading of values: When the end of the node is reached, the next value in its sibling record must be located by starting the search at the root - The B-Tree Node Format
BTNODE_ROOT (0x1) BTNODE_LEAF (0x2)
Containers & Volumes
- Volumes
Each volume maintains three trees - filesystem, snapshot metadata and extent. - Filesystem Trees
The Space Manager
- Chunk Info Blocks (CIBs)
- CIB Address Blocks (CABs)
- Reaping Objects
APFS.kext
com.apple.filesystem.apfs
/System/Library/Extensions/apfs.kext
closed source
- fsctl(2) codes
- UserClient Methods
chapter 9
Tempus Fugit: Mach Scheduling
The High Level View
Mach Tasks
struct task [osfmk/kern/task.h]
- The task lock
- Statistics
- Priority, maxmimum priority and importance
- The vm_map
- Linkage
- Threads
- Task port space
- Task special ports
- Task registered ports
- Task exception ports
- The Machine task
- Security and audit tokens
- Counts
- Resource usage
- The corresponding struct proc
- Corpse information
- I/O statistics
- Flags
- Purgeable VM objects
- Coalitions
- Associated hypervisor Virtual Machine (MacOS)
- Seclude memory
- External Modification statistics
- Effective and requested scheduling policies
- IOUserClients
- Task watching (
*OS
) task_watchers
The kernel_task
Mach Threads
For all their size, Mach tasks (like UNIX processes) are merely resource containers.
It is their threads which are the scheduleable entities.
struct thread [osfmk/kern/thread.h]
1 | struct thread_ro { |
- Execution State
- Linkage
- Wait data
- Ports
- Priority
- Scheudling information
- Continuation
- Affinity values
- Page fault recovery handler
- Thread call state
- Guard execption codes
- Turnstile
- The BSD uuthread object
- DTrace data
- Per-thread statistics
- Ledger details
- Associated voucher
- Tag
- The machine dependent thread object
Thread creation
thread_create[_running]
Threads are normally created suspended, but using the running variant allows the caller to set the initial register state of the process and immdeidately schedule it for execution.
thread_create && thread_start
kernel_thread_create/kernel_thread_start[_priority]
machine_thread_create [osfmk/arm64/pcb.c]
Thread termination
thread_terminate [osfmk/kern/thread_act.c]
It then puts the thread into a block, to continue on thread_terminate_continue.
The continuation, however, will never be reached (if it were to be reached, the kernel
would panic).
Processor Management
processor_set_default
Mach Scheduling Enhancements
For IPC to be efficient, the scheduler must be highly effective - as Mach strives to be.
- Handoff
Mach supports handoff in addition to the standard yield.
注:switch direct, not yield??
thread_handoff_[internal/parameter]
thread_switch (user mode) - Continuations
A continuation is a function, along with an optional parameter, which is provided as
an argument to kernel_thread_create(), or to thread_block[_reason].
struct thread_snapshot -> uint64_t continuation;struct uthread -> uu_continuation (BSD layer)1
2
Asynchronous Software Traps (AST)
- Handling ASTs
- AST reasons [osfmk/kern/ast.h]
1
2
// processor_idle/thread_block_reason call ast_off(AST_SCHEDULING);
Mach Schedulers
[osfmk/kern/sched.h]
Darwin version before Darwin 17 use multiq, but Darwin 18 shifts to qualq.*OS 13
variant use a new scheduler called AMP, which takes into account the core type
(Performance or efficiency) as well.The kern.sched sysctl(8) MIB will show the currently
active scheduler.
Not note in this book
amp -> clutch()
1 |
|
sched_clutch.c
clutch/edage
1 | sysctl kern.sched |
macOS 12.6
Mac mini (Late 2014)
kern.sched: dualq
Mac mini (M1, 2020)
kern.sched: edge
1 |
|
All Mach schedulers “plug in” to the scheduler primitives defined in osfmk/kern/sched_prim.c.
- The Run Queue
1
struct run_queue;
- Priorities
Threads are queued in one of the NRQS queues, in FIFO ordering.
BASEPRI_DEFAULT -> nice(1) - Load Average/Mach Factor, and Priority Shifts
A key metric in any UNIX system is its load average, which is reported by commands such
as w(1) and uptime (1).The Mach factor can be retrieved using hostinfo(1).1
sysctl vm.loadavg
XNU also calculates a more fine grained scheduler load, which it uses to implement priortiy shifts.
update_priority/sched_usage - Scheduling buckets and the EWMA
sched_bucket_t
XNU’s averages was further tweaked in Darwin 18 to an Exponentially Weighted Moving Average
algorithm(EWMA).
[osfmk/kern/sched_average.c] - Scheduler dispatch
[osfmk/kern/sched_prim.c]1
struct sched_dispatch_table;
- thread_select
- thread_invoke
- thread_dispatch
- qunatum_expire
- update_priority (osfmk/kern/priority.c)
- sched_maintenance_thread
- Multicore considerations
The calls for rebalancing, by moving queued threads from busy processor(s) to the less busy
ones(s), based on the respective run queue lengths.1
2
3sysctl kern | grep kern.sched
sysctl kern.sched_enable_smt
sysctl kern.sched_allow_NO_SMT_threads - Darwin 17 additions
- Real time threading support
- Multi processor support
- Thread yield checks
- Darwin 19 additions
- Run counts
- Thread buckets
- Multiple processor set support
- Effectuating policy changes
Deferred Calls
[osfmk/kern/call_entry.h]
- Timer calls
[osfmk/kern/timer_call.h]1
typedef struct timer_call;
- Timer coalescing
One of Darwin 13’s most important “under the hood” changes was the introduction of
Timer Coalescing. When timers start up too frequently, the CPU can enjoy less idle periods - and waking up the CPU can actually take more power than just leaving it on for a slightly
longer period.
timer_call_enter_with_leeway
Note: Windows 8 and later have a similar mechanism in the EX_TIMERS andEX*Timer
routines,
with “No-wake timers”. Linux 2.6.22 and later timer_lists offer TIMER_DEFERRABLE).
Ref: https://learn.microsoft.com/zh-cn/windows-hardware/drivers/ddi/wdm/ns-wdm-_ext_set_parameters_v0 (LONGLONG NoWakeTolerance;) - Scheduling timers
- Thread calls
- Servicing thread calls
Scheduler assisted synchronization
- Wait Queues
Mach follows this pattern as well, with the waitq and waitq_set structures.The waitqs can be
found embedded in Mach ports, semaphores and (as of Darwin 18) turnstiles, and they also
support BSD’s select(2) and AIO implementaiotns. The waitq_sets back select(2) as well,
along with kqueues and Mach ipc_mqueues.
struct waitq [osfmk/kern/waitq.h] - selection callbacks
waitq_select_1
2
3
4
5
static __startup_data struct waitq g_boot_waitq;
static SECURITY_READ_ONLY_LATE(struct waitq *) global_waitqs = &g_boot_waitq;
static SECURITY_READ_ONLY_LATE(uint32_t) g_num_waitqs = 1; - Ulocks (Darwin 16+)
__ulock_wait/__ulock_wake
As the double underscores imply, user mode is not intened to use these system calls
directly, instead working with libplatform.dylib’s higher level os_unfair_lock_t.
[bsd/kern/sys_ulock.c]
sys_ulock_wait (#515) -> ulock_wait
sys_ulock_wake (#516) -> ulock_wake - Turnstiles (Darwin 18+)
The concept first appeared in Solaris, and was then adopted by FreeBSD, and well explained
in the BSD bible.
Theory
optimize short term locks and the scheduling of waiters wehn the locks become available.
Darwin implmentation of Turnstiles.
turnstiles_init()
[osfmk/kern/turnstile.h]
Ref:https://book.douban.com/subject/3666232/
https://greenteapress.com/wp/semaphores/
https://blog.csdn.net/booksyhay/article/details/82692362
[信号量小书 第三章 基本同步模式]
https://www.likecs.com/show-204583284.html#3.7.6%20%E9%A2%84%E8%A3%85%E6%97%8B%E8%BD%AC%E6%A0%85%E9%97%A8%EF%BC%88Preloaded%20turnstile%EF%BC%891
2
3
4
5
6
7
8
9
10
11
12typedef enum __attribute__((packed)) turnstile_type {
TURNSTILE_NONE = 0,
TURNSTILE_KERNEL_MUTEX = 1,
TURNSTILE_ULOCK = 2,
TURNSTILE_PTHREAD_MUTEX = 3,
TURNSTILE_SYNC_IPC = 4,
TURNSTILE_WORKLOOPS = 5,
TURNSTILE_WORKQS = 6,
TURNSTILE_KNOTE = 7,
TURNSTILE_SLEEP_INHERITOR = 8,
TURNSTILE_TOTAL_TYPES = 9,
} turnstile_type_t;
- Benefits of Turnstiles
“thundering herd” problem
priority inversion - KDebug codes
DBG_TURNSTILE
- Gates (Darwin 19)
Ledgers
- ledger (#373)
- Initialization
- Maintenance
Selective Forced Idle (SFI)
Darwin 13
The main user-mode client of the SFI facility is the thermald.
chapter 10
Mixed Messages: Mach IPC
The High Level View
Mach is, first and foremost, a kernel optimized for message passing.
ipc_space_t
Task ipc_space_t
struct ipc_space [osfmk/ipc/ipc_space.h]
ipc_space_create
The ipc_port
[osfmk/ipc/ipc_port.h]
ipc_port_make_send
- Case Study: resolving a port name to the underlying object address
[osfmk/ipc/ipc_object.c] - Port lifecycle
- Port allocation
[osfmk/ipc/mach_port.c]
- Rights and Names
- Reference counting
mach_msg -> mach_msg_trap
-> mach_msg_trap
-> ipc_kmsg_send/ipc_mqueue_receive - Port deallocation
- Handling messages
[osfmk/ipc/ipc_mqueue.h]
mach_msg revisited
- Sending Mach messages
ipc_kmsg_send
- ipc_mqueue_send()
- Receiving Mach messages
ipc_mqueue_receive - Destriying messages
[osfmk/ipc/ipc_kmsg.c] - Message Descriptors
- Port right descriptors
ipc_kmsg_copyin_port_descriptor - Port set (OOL ports) descriptors
- OOL memory descriptors
- Descriptors as a vehicle for malicious attacks
Vouchers
Darwin 14 [osfmk/ipc/ipc_voucher.h]
- User-mode API
host_create_mach_voucher_trap
mach_voucher_extract_attr_recipe_trap - Implementation
IKOT_VOUCHER
Multinode
- Multinode requirements
mach_host_other() - FLIPC
Fast Local InterProcess Communication (FLIPC)
Mach Node [osfmk/kern/mach_node.c] mach_node_register
FLIPC [osfmk/ipc/flipc.c]
chapter 11
Mapped out: Mach Memory Management
A Bird’s Eye View
Mach’s Virtual Memory subsystem
vm_map -> virtual memory
pmap -> physical memory
The vm_map Layer
- The
struct _vm_map
[osfmk/vm/vm_map.h]
vm_map_create[_options] - vm_objects
[osfmk/vm/vm_object.h] - vm_pages
[osfmk/vm/vm_page.h]
pmap_startup/pmap_free_pages
pmap_steal_memory
vm_page_lookup() - User mode interface
host_virtual_physical_table_info
vm_mapped_pages_info
mach_vm_page_info
vm_page_info_basic<mach/vm_region.h>
- vm_map_enter and friends
[mach_]vm_allocate()<mach/vm_statistics.h>
- Allocating memory (highlighs)
VM_PROT_EXECUTE
com.apple.security.cs.allow-jit
kernel_memory_allocate - The vm_map_copy object
1
struct vm_map_copy;
- Copying memory
[mach_]vm_copy()
Pagers
Pager types supported in Darwin 18
Vnode/Device/Apple Protect/swapfile
Compressor/4K(4K emulation on 16K)/shared
- The Pager object
struct memory_object - Pager Lifecycle callbacks
- The vnode pager
- The swapfile pager
- The compressor pager
[osfmk/vm/WKdm_new.h]
Wilson & Kaplan
vm_compressor_algorithms.h
- Lifecycle
- The Device Pager
[osfmk/vm/device_vm.c] - The 4K Pager (
*OS
) - The sahred region pager (Darwin 18)
- The Apple Protect pager
memory encryption
[osfmk/vm/vm_apple_protect.c] - MacOS: Dont Steal Mac OS X.kext
dsmos_page_transform_hook
[osfmk/kern/page_decypt.c] *OS
: Fairplay encryption- Page Lists (UPLs)
struct upl [osfmk/vm/vm_pageout.h]
- Creating UPLs
ubc_create_upl() - Handling UPLs
[osfmk/mach/upl.defs]
The pmap Layer
[osfmk/vm/pmap.h]
pmap_create
- Page Tables
In Intel, the special CR3 register holds the base of the page tables for a giver process.
In ARM architecures, the Translation Table Base Registers (TTBRs) are used instead. ARM64
providers a differnt TTBR for every execption level, so TTBR_EL0 is employde by user mode,
and TTBR_EL1 by the kernel.
Page Table Entrye(PTE)
pmap_pte(pmap, va) - WIMG
Write-through, Cache-Inhibition, Memory Coherence and Guarde writes - I/O Mappings
ml_io_map() - Intel PTEs
- The ARM PTEs
chapre 12
Ceci n’est pas une “heap”: Kernel Memory Management
We detail how the kernel manages its own vm_map - the kernel_map - through kmem_alloc*
and kalloc*
.
Kernel Memory Allocation
- The kernel_map
[osfmk/vm/vm_kern.c]
VM_MIN_KERNEL_AND_KEXT_ADDRESS
VM_MAX_KERNEL_AND_KEXT_ADDRESS - kmem_alloc() and friends
- kernel_memory_allocate
vm_map_find_space - kmem_suballoc
- kmem_realloc
- kalloc
- kalloc.###zones
zalloc_canblock_tag/vm_allocation_site
- The Kalloc DLUT
Direct LookUp Table (DLUT)
[osfmk/kern/kalloc.c] - The slow path
- OSMalloc*
[libkern/libkern/OSMalloc.h]
The main advantage of using OSMalloc is its support of memory tags.
The Zone Allocator
like Linux’s Slabs
zalloc() [osfmk/kern/zalloc.c]
1 | man zprint |
- Zone Management
- The zone_metadata_region
- The zone metadata
- Element Free Lists
- Garbage Collection
consider_zone_gc()/zone_gc()
vm_pageout_garbage_collect - GC and UAF
mach_zone_force_gc - Battling zone corruption
- The Guard Mode Zone Allocator (MacOS)
like libgmalloc(3) (Guard Malloc) in user mode - The Zone Cache (Darwin 18+)
Memorystauts (MacOS) anad Jetsam (*OS
)
- Purgeable memory
task_purgable_info
[mach_]vm_purgable_control
memory_entry_purgable_control
Kernel Memory Layout
- The kernel_map regions
- The Kernel Slide
Kernel Address Space Layout Randomization (KALSR) [Darwin 12]vm_kernel_slider1
sysctl kern.slide
vm_kernel_addrhash_salt
chapter 13
All in the Family: IOKit
IORegistry IOCatalogue
A High level view of IOKit
- The IOKit.framework
IO Master Port
device_service_create [osfmk/device/device_init.c] - IOKit error codes
[iokit/IOKit/IOReturn.h]
IOService::stringFromReturn
The IORegistry
IORegistryPlanes
1 | ioreg -l -w 0 -f | grep IORegistryPlanes |
IORegistryExplorer.app -> XCode’s Additional Tools
- User Mode APIs
- Iterators
IOIterator
The IOCatalog (ue)
- Matching Dictionaries
- Notifiations
kIO…Notification
IONotificationPortCreate()
Interlude: Libkern Base Classes
- OSObject
- OSMetaClass(Base)
- APIs
- DefaultStrutors
- Members, methods and the Fragile Base Class proble
- Object types
- OSStrings and OSSymbols
- OSCollections
- Serializaition
- XML Serialization
- Binary Serialization
The Class Menagerie
- IOKIt Built-in classes
- IORegistryEntry
- IOService
IO*MemoryDescriptor
IO*MemoryCursor
- IOWorkLoop
IO*EventSource
and IOCommand- IOCommandQueue
- IOKit Families
Driver Life Cycle
- Driver Matching
IOKitPersonalities - Case Study: VMWare Fusion VMIOPlug
- Driver activity and the IOWorkLoop
- Messaging
- Matching Notifications
- Interrupt Handling
IOUserClients
- IOUserClient lifecycle
- Driver Properties
- Notifications
IOConnectSetNotificationPort - Mapped Memory
- External Traps
- Extenal Methods
- IOCFPlugInTypes
Darwin 19: DriverKit
- IOUserServer
- IORPC
chapter 14
Stacking Up: Kernel Networking
The High Level View
Layer V: Sockets
- The struct socket
- Socket Creation
socreate_internal() [bsd/kern/uipc_socket.c] - sockbufs
- mbufs
struct mbuf;
[bsd/sys/mbuf.h]
XNU also supports an “mbuf watchdog”, toggled through kern.ipc.mb_watchdog. - Sockets in kernel mode
[bsd/kern/kpi_socket.h]
sock_connectwait
sock_socket -> socreate
Layer IV: Domains & Protocols
- Domains
- Protocols
- Case Study: PF_SYSTEM sockets
- SYSPROTO_EVENT
- SYSPROTO_CONTROL
Layer III: Network Protocols
- Incoming packets
Layer II: Interfaces - The Data Link Interface Layer
- The struct ifnet
- Interface lifecycle
- Case Study: The UTUN interface
Network Data Processing
- Sending Data
- IPv4/IPv6 packet output
- DLIL output
- Receiving data
- DLIL frame reception
- IPv4/IPv6 packaet input
Firewalling & Filtering mechanisms
- Socket Filters
- Content Filters (Darwin 14+)
- IP filters
- PF
BSD PF
pfctl(8) - Interface Filters
- BPF
Network Extension Control Policies
- NECP file descriptors
necp_open() [bsd/net/necp_client.c] - NECP Session FDs
necp_session_opne() - Policy evaluation