A gentle introduction to making and mounting Linux filesystems

For some time in my life as a Linux user, I have felt confused about the meaning of "mounting" filesystems. I copy and pasted the commands alright, but it always felt somewhat arbitrary. If you feel this way, or are curious to learn more about playing with devices, filesystems, and mounts, then this article may be of interest to you.

When using a Linux system, the main abstraction we interact with is a single unified hierarchy of files growing tree-like from the root or /. What this hides is that there can be multiple storage devices managed by multiple filesystems. The indirection of having the kernel associate "mount points" in the filesystem hierarchy to the filesystem mounted there decouples the contents of a filesystem on a device from its location in the hierarchy. This significantly increases the flexibility of how we can organize our file systems. This becomes especially desirable when we start to factor in things like NFS or NBD which can be shared by multiple client systems. If the location it was accessible at were somehow baked into each filesystem, that would make using such systems relatively more cumbersome.

Mounting also serves a second, simpler purpse. Fundamentally, filesystems serve to manage block devices--devices that can be accessed in fixed size blocks. A filesystem is an abstraction implemented in the bits of the block device, and as we write to it, it persists as a structure on the device. Whether or not the computer is even turned on, the bit pattern representing the filesystem is present. However, until we mount the filesystem, it is in a latent state. Mounting triggers the kernel to create all sorts of ephemeral state like hints at where free space might be, a list of pages to writeback, background threads for deferred operations and much more. Mounting transitions the filesystem to a living state where its files are available for use via filesystem APIs.

Basic Tour

With mounting's dual purposes in mind, let's consider a simple example system with a handful of devices and filesystems from a few different perspectives. This will hopefully shed light on the more practical implications of this model of filesystems. We will modify the state of the devices, mounts and filesystems on the system, while examining how these changes play out in the views we have of the system's state.

the block devices

Suppose that the block devices on a system (a qemu VM on my development machine) currently look like:

$ lsblk
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
vda  253:0    0   10G  0 disk /

The only device is "vda" and it is mounted at the root of the filesystem. So (setting aside fake filesystems like /proc) every file we interact with is on this device. We can't create a new filesystem on top of this existing one, so to experiment, we will need to create some new devices. Just to show off the options for doing so, we can create a loop device (a fake device backed by a file on some filesystem), and carve it up into two logical volumes with LVM (Logical Volume Management, a userspace system taking advantage of Linux's "device mapper" to create virtual block devices on top of other block devices.)

# create a "slightly bigger than 2G" file to serve as storage for our loop device
$ truncate -s 2052M /tmp/backing-file
# create a loop device using the file
$ losetup -f backing-file
# inspect the loop device we created
$ losetup
NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE         DIO LOG-SEC
/dev/loop0         0      0         0  0 /tmp/backing-file   0     512
# observe that it appears as a new block device
$ lsblk
NAME  MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0   7:0    0    2G  0 loop
vda   253:0    0   10G  0 disk /
# create an LVM volume group to further cut up this new device
$ vgcreate vg0 /dev/loop0
# create two 1G LVM logical volumes in the volume group
$ lvcreate -l1G vg0 -n lv0
$ lvcreate -l1G vg0 -n lv1
$ lsblk
$ lsblk
Tue 5  3:49AM
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0       7:0    0    2G  0 loop
├─vg0-lv0 252:0    0    1G  0 lvm
└─vg0-lv1 252:1    0    1G  0 lvm
vda       253:0    0   10G  0 disk /

the "raw" filesystems

Now that we have some blank devices, we can create filesystems on them.

$ mkfs.btrfs -f /dev/vg0/lv0
btrfs-progs v5.14.1
See http://btrfs.wiki.kernel.org for more information.

Detected a SSD, turning off metadata duplication.  Mkfs with -m dup if you want to force metadata duplication.
Label:              (null)
UUID:               4eb2c574-fdd2-4399-b2ca-f773b489f251
Node size:          16384
Sector size:        4096
Filesystem size:    1.00GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         single            8.00MiB
  System:           single            4.00MiB
SSD detected:       yes
Zoned device:       no
Incompat features:  extref, skinny-metadata
Runtime features:
Checksum:           crc32c
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1     1.00GiB  /dev/vg0/lv0
$ mkfs.ext4 /dev/vg0/lv1
mke2fs 1.46.4 (18-Aug-2021)
Discarding device blocks: done
Creating filesystem with 262144 4k blocks and 65536 inodes
Filesystem UUID: b11223b2-14dc-4b8b-b9c9-10b2787cfbd0
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376

Allocating group tables: done
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

We haven't yet mounted these filesystems, so they are currently pretty bare. They will contain just the basic structures put on the disk by mfks. However, they are valid filesystems and we can interact with them directly, just not with any files on them, yet. For example, we can see some high level information about the btrfs filesystem on /dev/vg0/lv0:

btrfs filesystem show /dev/vg0/lv0
Label: none  uuid: 4eb2c574-fdd2-4399-b2ca-f773b489f251
        Total devices 1 FS bytes used 128.00KiB
        devid    1 size 1.00GiB used 20.00MiB path /dev/mapper/vg0-lv0

mounts

With our filesystems instantiated on the two devices, we can start to play with mounting. First let's look at the current mounted filesystems by reading the special file /proc/mounts. Another useful tool for visualizing them in a more human friendly way is findmnt:

# see all mounts, mostly virtual file systems implemented by Linux.
$ cat /proc/mounts
/dev/root / ext4 rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=4062532k,nr_inodes=1015633,mode=755 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sys /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
run /run tmpfs rw,nosuid,nodev,relatime,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0
0
shm /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0
none /sys/fs/cgroup cgroup2 rw,relatime 0 0
$ findmnt
TARGET                          SOURCE   FSTYPE   OPTIONS
/                               /dev/vda ext4     rw,relatime
├─/dev                          devtmpfs devtmpfs
rw,relatime,size=4062532k,nr_inodes=1015633,mode=755
│ ├─/dev/pts                    devpts   devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000
│ └─/dev/shm                    shm      tmpfs    rw,nosuid,nodev,relatime
├─/proc                         proc     proc
rw,nosuid,nodev,noexec,relatime
├─/sys                          sys      sysfs
rw,nosuid,nodev,noexec,relatime
│ ├─/sys/fs/cgroup              none     cgroup2  rw,relatime
│ └─/sys/kernel/debug           debugfs  debugfs  rw,relatime
│   └─/sys/kernel/debug/tracing tracefs  tracefs  rw,relatime
├─/run                          run      tmpfs
rw,nosuid,nodev,relatime,mode=755
└─/tmp                          tmpfs    tmpfs    rw,nosuid,nodev,relatime
$ findmnt --real
TARGET SOURCE   FSTYPE OPTIONS
/      /dev/vda ext4   rw,relatime
$ findmnt -t ext4
TARGET SOURCE   FSTYPE OPTIONS
/      /dev/vda ext4   rw,relatime
$ findmnt -t btrfs

Let's mount our two new filesystems and see what happens. A mount roots the files stored by the filesystem at a point in the file hierarchy, called the mount point. So if a file system has a directory d with files f1 and f2 in it, and is mounted at /mnt/m1, then we can see those files with ls /mnt/m1/d. So to mount the filesystems, we need to pick a mount point, typically in /mnt for filesystems besides the root filesystem.

# make the mountpoints (in the ext4 root filesystem)
$ mkdir -p /mnt/m0
$ mkdir -p /mnt/m1
# mount the filesystems
$ sudo mount -o noatime /dev/vg0/lv0 /mnt/m0
$ sudo mount /dev/vg0/lv1 /mnt/m1

And now, the new mounts look like this:

$ cat /proc/mounts
/dev/root / ext4 rw,relatime 0 0
devtmpfs /dev devtmpfs rw,relatime,size=4062532k,nr_inodes=1015633,mode=755 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sys /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
run /run tmpfs rw,nosuid,nodev,relatime,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0
0
shm /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0
tmpfs /tmp tmpfs rw,nosuid,nodev,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
tracefs /sys/kernel/debug/tracing tracefs rw,relatime 0 0
none /sys/fs/cgroup cgroup2 rw,relatime 0 0
/dev/mapper/vg0-lv0 /mnt/m0 btrfs
rw,noatime,ssd,space_cache,subvolid=5,subvol=/ 0 0
/dev/mapper/vg0-lv1 /mnt/m1 ext4 rw,relatime 0 0
$ findmnt
TARGET                          SOURCE              FSTYPE   OPTIONS
/                               /dev/vda            ext4     rw,relatime
├─/dev                          devtmpfs            devtmpfs
rw,relatime,size=4062532k,nr_inodes=1015633,mode=755
│ ├─/dev/pts                    devpts              devpts
rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000
│ └─/dev/shm                    shm                 tmpfs
rw,nosuid,nodev,relatime
├─/proc                         proc                proc
rw,nosuid,nodev,noexec,relatime
├─/sys                          sys                 sysfs
rw,nosuid,nodev,noexec,relatime
│ ├─/sys/fs/cgroup              none                cgroup2  rw,relatime
│ └─/sys/kernel/debug           debugfs             debugfs  rw,relatime
│   └─/sys/kernel/debug/tracing tracefs             tracefs  rw,relatime
├─/run                          run                 tmpfs
rw,nosuid,nodev,relatime,mode=755
├─/mnt/m1                       /dev/mapper/vg0-lv1 ext4     rw,relatime
├─/mnt/m0                       /dev/mapper/vg0-lv0 btrfs
rw,noatime,ssd,space_cache,subvolid=5,subvol=/
└─/tmp                          tmpfs               tmpfs
rw,nosuid,nodev,relatime
$ findmnt --real
TARGET    SOURCE              FSTYPE OPTIONS
/         /dev/vda            ext4   rw,relatime
├─/mnt/m1 /dev/mapper/vg0-lv1 ext4   rw,relatime
└─/mnt/m0 /dev/mapper/vg0-lv0 btrfs
rw,noatime,ssd,space_cache,subvolid=5,subvol=/
$ findmnt -t ext4
TARGET    SOURCE              FSTYPE OPTIONS
/         /dev/vda            ext4   rw,relatime
└─/mnt/m1 /dev/mapper/vg0-lv1 ext4   rw,relatime
$ findmnt -t btrfs
TARGET  SOURCE              FSTYPE OPTIONS
/mnt/m0 /dev/mapper/vg0-lv0 btrfs
rw,noatime,ssd,space_cache,subvolid=5,subvol=/

the file hierarchy

Finally, let's see how mounting and the file hierarchy interact by writing some actual files on the filesystems, then moving the mounts around.

# nothing in there yet
$ ls /mnt/m0
$ ls /mnt/m1
lost+found
# write 10 directories with 10 files each into both.
$ for fs in $(seq 0 1); do for i in $(seq 0 9); do for j in $(seq 0 9); do mkdir -p /mnt/m$fs/d$i; echo "$fs $i $j" > /mnt/m$fs/d$i/f$j; done; done; done
# read a couple files
$ cat /mnt/m0/d4/f3
0 4 3
$ cat /mnt/m1/d2/f8
1 2 8
# unmount the ext4 fs
$ umount /mnt/m1
# try to read the file, with the mount point gone, Linux will once again look
# in the root filesystem, where this path does not exist.
$ cat /mnt/m1/d2/f8
cat: /mnt/m1/d2/f8: No such file or directory
$ ls /mnt/m1
# mount the ext4 fs in a new spot
$ mkdir /mnt/m0/m1
$ mount /dev/vg0/lv1 /mnt/m0/m1
# the file is back, relative to the new mount point!
$ cat /mnt/m0/m1/d2/f8
1 2 8

Notice how the structure of the filesystem on /dev/vg0/lv1 persisted on the device while the filesytem was unmounted (and inaccessible to normal filesystem apis), and reappeared at the new location when it was once again mounted.

Finally, to hammer home the point that the filesystem exists when it isn't mounted, we will find where a file's data lives on the device, unmount the file system, and read it directly off the device:

# list the extents of the file; units are 512 bytes (historical block size)
$ xfs_io -c fiemap /mnt/m1/d0/f0
0: [0..7]: 1314816..1314823
# unmount the filesystem
$ umount /mnt/m1
# can't read it directly, via the filesystem any more
$ cat /mnt/m1/d0/f0
cat: /mnt/m1/d0/f0: No such file or directory
# read the 6 bytes of the file from that offset in the device via the device file
$ dd if=/dev/vg0/lv1 bs=1 count=6 skip=$((1314816 * 512)) 2>/dev/null
1 0 0

A little deeper

With the basics under our belt, we can now consider some less straightforward aspects of the mounting abstraction, and hopefully they will make decent sense in context. The exceptions prove the rule, after all.

interesting failure cases

If we try to mount a device without making a filesystem on it first, the kernel doesn't know what to do with the device, and it fails.

mount: /mnt/m1: wrong fs type, bad option, bad superblock on
/dev/mapper/vg0-lv1, missing codepage or helper program, or other error.

If we keep a file in the filesystem busy, perhaps by having it open for reading or writing by this silly application:

#include <fcntl.h>
#include <unistd.h>

int main() {
        int fd = open("/mnt/m0/d0/f0", O_RDONLY);
        while (1) {
                sleep(60);
        }
        return 0;
}

then we are unable to unmount the filesystem, until the file is closed, since we can't safely tear down the live state of the filesystem while a userspace process is still using it.

$ ./keep-open &
[1] 4186
$ umount /mnt/m0
umount: /mnt/m0: target is busy.
# so kill the thing keeping it busy
$ kill 4186
# now we can unmount
$ umount /mnt/m0

the root filesystem

I intentionally crafted an example where we created some toy filesystems from scratch that we could mount and unmount at will. However, you might wisely be wondering how this works for the root filesystem. We can't exactly run /usr/bin/mount when / is not even mounted yet! This is resolved by the bootloader and kernel cooperating during boot to give the root filesystem special treatment. An extra wrinkle is that most systems now boot with an initrd, so in practice the process is:

boot into initrd fs on a ram disk block device.
the initrd prepares the real root mount at some mountpoint like "/newroot"
the initrd starts the real system by using a special "switch root" operation to change the root to the system's "real" root.

remount

As we saw above, you can't unmount a file system that's actively in use, but luckily, filesystems have a special codepath for "remounting". This is typically done to change the mount options on a live filesystem. Naturally, not every transition is necessarily supported, as it can be difficult to enable a feature on a mounted filesystem with live data structures.

$ < /proc/mounts grep /mnt/m0
/dev/mapper/vg0-lv0 /mnt/m0 btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/ 0 0
$ mount -o remount,ro /mnt/m0
$ < /proc/mounts grep /mnt/m0
/dev/mapper/vg0-lv0 /mnt/m0 btrfs ro,relatime,ssd,space_cache,subvolid=5,subvol=/ 0 0
$ mount -o remount,rw /mnt/m0
$ < /proc/mounts grep /mnt/m0
/dev/mapper/vg0-lv0 /mnt/m0 btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/ 0 0

bind mounts

We can create additional mounts of the same single live filesystem at various places in the hierarchy. This is called bind mounting.

$ mount /dev/vg0/lv0 /mnt/m0
$ cat /mnt/m0/d4/f3
0 4 3
$ mkdir /mnt/m0-2
$ mount /dev/vg0/lv0 /mnt/m0-2
$ cat /mnt/m0-2/d4/f3
0 4 3
$ touch /mnt/m0-2/foo
$ ls /mnt/m0/foo
/mnt/m0/foo

Additionally, we can do this with mounted files directly using mount --bind

$ mkdir /mnt/d0-2
$ mount --bind /mnt/m0/d0 /mnt/d0-2
$ touch /mnt/m0/d0/foo
$ ls /mnt/d0-2/foo
/mnt/d0-2/foo