fix(installer): harden disk selection and partitioning phase

The disk phase was the dominant source of incomplete installs. Six
concrete failure modes addressed in one pass:

1. Live-ISO USB excluded from the disk picker. select_disk previously
   filtered loop|ram|zram|sr but not the device the installer booted
   from; picking it would format the boot media mid-install. New
   detect_live_iso_devices walks /, /iso, /run/initramfs/live,
   /nix/.ro-store, /nix/store and resolves each backing device to its
   parent disk via lsblk -no PKNAME. Override with
   NOMARCHY_INSTALL_ALLOW_ISO_TARGET=1 for the developer case.

2. 10 GiB minimum-capacity preflight. Disko fails late and obscurely
   on undersized media; surface it while the picker is still open.

3. prewipe_target_drive rewritten:
   - Enumerates every active dm-crypt mapping via dmsetup ls and
     closes those whose backing device is on the target drive. The
     old version only knew about the hardcoded names "crypted" /
     "crypted_main" so an aborted multi-disk run or a non-Nomarchy
     install would leave a holder open and silently break the wipe.
   - Drops `|| true` from wipefs / sgdisk / dd. After the LUKS and
     swap teardown above, a real failure means something is still
     holding the device — surface that instead of papering over it.
   - udevadm settle bounded to 30s so a flapping USB can't hang.
   - Post-wipe sanity check: refuse to hand the disk to disko if
     anything is still mounted off it.

4. run_disko_with_retry wraps the disko call. On failure, shows the
   last 30 lines of output via gum style and offers Retry /
   View full log / Abort. set -e is suspended for the disko call so
   the exit code can be inspected. The previous bare `disko --mode
   disko` aborted the whole installer with output scrolled past.

5. Sed-templated disko-golden.nix + disko-btrfs-multi.nix pair
   replaced by a single disko-config.nix Nix function of
   { mainDrive, extraDrives ? [] } called via --argstr / --arg.
   Templating Nix via shell-escaped string substitution caused at
   least one production bug (3aadc36 fixed embedded-newline
   escaping); function arguments are the right shape and eliminate
   the entire class of escaping concerns. Single-disk path is
   `extraDrives = []`; multi-disk gets BTRFS `-d single -m raid1`
   plus the additional /dev/mapper/* devices. Hosts that shipped
   /etc/disko-golden.nix now ship /etc/disko-config.nix.

6. EXIT trap added so the tmpfs LUKS key file (/dev/shm/nomarchy-
   luks.key) is removed even if the script aborts between key-write
   and the explicit unset. Replaced redundant `shred -u` on tmpfs
   with `rm -f` (already in RAM).

Verification: bash -n on install.sh, nix-instantiate parse + strict
eval on disko-config.nix in both single and multi shapes, full
nix flake check --no-build evaluating all three NixOS configurations
(default, nomarchy-installer, nomarchy-live) plus the installerVm.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Bernardo Magri
2026-04-30 19:42:00 +01:00
parent 386da51178
commit f318585dc4
9 changed files with 317 additions and 254 deletions

View File

@@ -233,6 +233,38 @@ check_environment() {
# STEP 2: DISK SELECTION
# ============================================================================
# Resolve the block device(s) backing the running live ISO so the disk
# picker can hide them. Picking the live USB by mistake destroys the
# installer's own boot media mid-run — always the worst-case outcome.
# We walk the live-ISO mountpoints (NixOS live ISO uses /iso for the
# squashfs source plus an overlay at /), resolve each to its parent
# disk via `lsblk -no PKNAME`, and emit a deduped list of /dev/<disk>
# entries on stdout. Nothing emitted = no live-ISO devices detected
# (e.g. running the installer from a regular shell during development).
detect_live_iso_devices() {
local seen=" "
local mp src parent
for mp in / /iso /run/initramfs/live /nix/.ro-store /nix/store; do
src=$(findmnt -no SOURCE "$mp" 2>/dev/null) || continue
[[ "$src" == /dev/* ]] || continue
parent=$(lsblk -no PKNAME "$src" 2>/dev/null | head -n1)
if [[ -n "$parent" ]]; then
parent="/dev/$parent"
else
parent="$src"
fi
case "$seen" in
*" $parent "*) ;;
*) seen+="$parent "; printf '%s\n' "$parent" ;;
esac
done
}
# Minimum total capacity across all picked drives. 10 GiB is the smallest
# size where the install completes without immediate disk-pressure failures
# (1 GiB ESP + ~5 GiB nix closure + working set).
_MIN_INSTALL_BYTES=$((10 * 1024 * 1024 * 1024))
select_disk() {
section "Disk Selection"
@@ -245,8 +277,30 @@ select_disk() {
# Columns: NAME, SIZE, TYPE (NVMe/USB/SSD/HDD), VENDOR, MODEL, SERIAL.
# Empty fields render as "--" so column -t can still align them.
local raw rows=""
# Filter out pseudo-devices and the live-ISO boot media. The boot-media
# filter is the important one: without it the user can pick the USB
# they booted from and the installer will format its own boot device
# mid-run. NOMARCHY_INSTALL_ALLOW_ISO_TARGET=1 disables this guard
# for the rare case someone genuinely wants to install onto the same
# device (e.g. a developer testing in a VM without a second disk).
local exclude_re='^(/dev/(loop|ram|zram|sr))'
local live_devices=()
if [[ "${NOMARCHY_INSTALL_ALLOW_ISO_TARGET:-0}" != "1" ]]; then
mapfile -t live_devices < <(detect_live_iso_devices)
local d
for d in "${live_devices[@]}"; do
[[ -n "$d" ]] || continue
# Anchor to end-of-line so /dev/sda doesn't also match /dev/sdaa.
exclude_re+="|^${d}$"
done
if (( ${#live_devices[@]} > 0 )); then
info "Excluding live-ISO device(s) from picker: ${live_devices[*]}"
fi
fi
raw=$(lsblk -d -n -p -o NAME,SIZE,ROTA,TRAN,VENDOR,MODEL,SERIAL 2>/dev/null \
| grep -vE '^(/dev/(loop|ram|zram|sr))')
| grep -vE "$exclude_re")
while IFS= read -r line; do
if [[ -z "$line" ]]; then continue; fi
@@ -312,6 +366,21 @@ select_disk() {
fi
if [[ "$DRY_RUN" != "true" ]]; then
# Total-capacity preflight. Disko fails late and obscurely on
# undersized media; surface it here while the picker is still open.
local total_bytes=0 sz d
for d in $TARGET_DRIVE; do
sz=$(lsblk -bdno SIZE "$d" 2>/dev/null) || sz=0
total_bytes=$((total_bytes + sz))
done
if (( total_bytes < _MIN_INSTALL_BYTES )); then
local human
human=$(numfmt --to=iec --suffix=B "$total_bytes" 2>/dev/null || echo "${total_bytes} B")
error "Total target capacity is $human; Nomarchy needs at least 10 GiB."
TARGET_DRIVE=""
return 130
fi
echo ""
nrun gum style --foreground 9 --bold "⚠ WARNING: All data on $TARGET_DRIVE will be DESTROYED!"
echo ""
@@ -951,32 +1020,114 @@ prewipe_target_drive() {
info "Pre-wiping $drive (clearing stale signatures)..."
# Tear down anything a prior aborted run left active.
# Tear down anything a prior aborted run left active. Order matters:
# mount holders -> swap -> LUKS mappings -> wipe.
umount -R /mnt 2>/dev/null || true
cryptsetup close crypted 2>/dev/null || true
swapoff -a 2>/dev/null || true
# Enumerate every active dm-crypt mapping and close those whose backing
# device is on this drive. The previous version only knew about the
# hardcoded names "crypted" and "crypted_main"; an aborted multi-disk
# run, a manual experiment, or a non-Nomarchy install would leave a
# mapping with a different name holding the device busy and silently
# break the wipe.
if command -v dmsetup >/dev/null 2>&1; then
local name backing
while read -r name _; do
[[ -n "$name" && "$name" != "No" ]] || continue # "No devices found"
backing=$(cryptsetup status "$name" 2>/dev/null \
| awk '/^[[:space:]]*device:/ { print $2; exit }') || continue
[[ -n "$backing" ]] || continue
if [[ "$backing" == "$drive" || "$backing" == "${drive}"* ]]; then
info " Closing stale LUKS mapping: $name (backed by $backing)"
cryptsetup close "$name"
fi
done < <(dmsetup ls --target crypt 2>/dev/null)
fi
# Wipe partition signatures. No `|| true` — the LUKS/swap teardown
# above should have released every holder; if wipefs still fails the
# device is genuinely busy and we want to surface that, not silently
# paper over it and let disko fail later with a confusing blkid error.
local part
if compgen -G "${drive}*" >/dev/null; then
if compgen -G "${drive}?*" >/dev/null; then
for part in "${drive}"?*; do
[[ -b "$part" ]] || continue
wipefs -af "$part" >/dev/null 2>&1 || true
wipefs -af "$part" >/dev/null
done
fi
wipefs -af "$drive" >/dev/null 2>&1 || true
sgdisk --zap-all "$drive" >/dev/null 2>&1 || true
wipefs -af "$drive" >/dev/null
sgdisk --zap-all "$drive" >/dev/null
# 16 MiB covers LUKS2 binary headers (04 MiB) and the BTRFS first
# superblock (64 KiB) — wipefs alone misses damaged variants of these.
dd if=/dev/zero of="$drive" bs=1M count=16 conv=fsync status=none 2>/dev/null || true
dd if=/dev/zero of="$drive" bs=1M count=16 conv=fsync status=none
partprobe "$drive" 2>/dev/null || true
udevadm settle
# Bound the settle so a flapping USB device can't hang the installer.
udevadm settle --timeout=30 || info "udevadm settle timed out; continuing."
# Sanity check: nothing should still be mounted off this drive after
# the wipe. If something is, refuse to hand the disk to disko.
if lsblk -no MOUNTPOINTS "$drive" 2>/dev/null | grep -qE '\S'; then
error "Drive $drive still has active mountpoints after pre-wipe."
error "Investigate with: lsblk $drive ; mount | grep $drive"
return 1
fi
success "Pre-wipe complete"
}
_LUKS_KEY_PATH="/dev/shm/nomarchy-luks.key"
# Wrap the disko invocation so a failure surfaces the last few lines of
# output and offers Retry / View full log / Abort. set -e is suspended for
# the disko call so we can inspect its exit code; restored on every path.
run_disko_with_retry() {
local main_drive="$1"
local extras_nix="$2"
local disko_file="$NOMARCHY_REPO/installer/disko-config.nix"
local log
log=$(mktemp --suffix=.disko.log)
while true; do
local rc=0
set +e
disko --mode disko \
--argstr mainDrive "$main_drive" \
--arg extraDrives "$extras_nix" \
"$disko_file" 2>&1 | tee "$log"
rc=${PIPESTATUS[0]}
set -e
if [[ $rc -eq 0 ]]; then
rm -f "$log"
return 0
fi
error "disko failed (exit $rc). Last lines of output:"
tail -n 30 "$log" | nrun gum style --foreground 9 --border normal --padding "0 1"
local choice
choice=$(printf 'Retry\nView full log\nAbort\n' \
| nrun gum choose --header "Disk partitioning failed. What now?")
case "$choice" in
Retry)
info "Re-running pre-wipe and retrying disko..."
local d
for d in $TARGET_DRIVE; do prewipe_target_drive "$d"; done
;;
"View full log")
nrun gum pager < "$log" || less -RFX "$log" || cat "$log"
;;
*)
rm -f "$log"
return $rc
;;
esac
done
}
execute_installation() {
if [[ "$DRY_RUN" == "true" ]]; then
execute_dry_run
@@ -991,67 +1142,33 @@ execute_installation() {
prewipe_target_drive "$d"
done
local disko_file tmp_disko
tmp_disko=$(mktemp --suffix=.nix)
# Build the extraDrives Nix-list literal for disko-config.nix. Empty
# list = single-disk path. The list is well-formed by construction
# here (each element is a /dev/* path the user already picked) so
# there's no escaping concern — unlike the previous sed-templated Nix.
local drives=($TARGET_DRIVE)
if [[ ${#drives[@]} -gt 1 ]]; then
disko_file="$NOMARCHY_REPO/installer/disko-btrfs-multi.nix"
local main_drive="${drives[0]}"
local btrfs_devs=""
local additional_disks=""
local main_drive="${drives[0]}"
local extras_nix="[]"
if (( ${#drives[@]} > 1 )); then
extras_nix="["
local i
for (( i=1; i<${#drives[@]}; i++ )); do
local d="${drives[$i]}"
local name="extra_$i"
local luks_name="crypted_$name"
btrfs_devs+=", \"/dev/mapper/$luks_name\""
additional_disks+=" $name = {
type = \"disk\";
device = \"$d\";
content = {
type = \"gpt\";
partitions = {
luks = {
size = \"100%\";
content = {
type = \"luks\";
name = \"$luks_name\";
settings = {
allowDiscards = true;
passwordFile = \"/dev/shm/nomarchy-luks.key\";
};
content = {
type = \"btrfs\";
};
};
};
};
};
};
"
extras_nix+=" \"${drives[$i]}\""
done
# Escape newlines for sed
local escaped_disks
escaped_disks=$(printf '%s\n' "$additional_disks" | sed ':a;N;$!ba;s/\n/\\n/g')
sed "s|@MAIN_DRIVE@|${main_drive}|g; s|@BTRFS_DEVICES@|${btrfs_devs}|g; s|@ADDITIONAL_DISKS@|${escaped_disks}|g" "$disko_file" > "$tmp_disko"
else
disko_file="$NOMARCHY_REPO/installer/disko-golden.nix"
sed "s|@TARGET_DRIVE@|${TARGET_DRIVE}|g" "$disko_file" > "$tmp_disko"
extras_nix+=" ]"
fi
# Provide the LUKS passphrase via tmpfs so the secret never touches a
# spinning disk. /dev/shm is tmpfs on the live ISO. We restrict perms
# to root and shred the file (overwrite) on the way out, even though
# it's already in RAM — defense in depth.
local luks_key="/dev/shm/nomarchy-luks.key"
install -m 600 /dev/null "$luks_key"
printf '%s' "$LUKS_PASSWORD" > "$luks_key"
disko --mode disko "$tmp_disko"
shred -u "$luks_key" 2>/dev/null || rm -f "$luks_key"
# spinning disk. /dev/shm is tmpfs on the live ISO. The EXIT trap
# below guarantees the file is removed even if the script aborts
# between writing the key and the unset below.
install -m 600 /dev/null "$_LUKS_KEY_PATH"
trap 'rm -f "$_LUKS_KEY_PATH" 2>/dev/null || true' EXIT
printf '%s' "$LUKS_PASSWORD" > "$_LUKS_KEY_PATH"
run_disko_with_retry "$main_drive" "$extras_nix" || exit 1
rm -f "$_LUKS_KEY_PATH"
unset LUKS_PASSWORD
success "Disk partitioned"