In the face of rapidly rising data volumes, it is increasingly clear that Ext3, the current default Linux file system, is reaching its limits. A maximum file system, and thus volume size, of 16 TB can already be a tight squeeze for large RAID arrays; Ext3’s 32-bit block numbers and 4 KB data blocks mean, however, that there’s no way around this limit. A major refurbishment is therefore due.
Development of Ext4 started in 2006 with two changes to the Ext3 file system: block number size was increased to 48 bits and indirect block addressing – in which the data blocks making up a file are stored in a long list made up of individual block numbers – was replaced by extents, consisting of ranges of data blocks. Because this involved changing the structure of the data stored on the disk, the programmers decided that rather than introducing these patches into Ext3, it was time to create a new version of the file system – Ext4 – based on the Ext3 code.
The result of three years of Ext4 development has been significant advances from Ext3 which increase the volume limit to 1024 PB. This should be sufficient for many years to come. Extents, long implemented in other file systems such as XFS, should improve the efficiency of managing large files. There are also a whole range of under-the-bonnet changes intended to improve Ext4 performance compared to Ext3.
The kernel development team adopted the Ext4 code in version 2.6.19 to give it the opportunity to come to maturity in the kernel. Ext4 was marked as experimental in versions up to and including 2.6.27, but since Linux 2.6.28 the new file system is now considered stable. Not that this rules out the odd bug or other unpleasant surprise. The latest Ubuntu 9.04 can already be installed on Ext4 and the forthcoming Fedora 11 release will use Ext4 as its default file system.
Ext4 works with 48-bit block numbers, whilst the default block size remains 4 KB. This allows file system sizes composed of up to 2^48 4 KB blocks – equivalent to an exabyte (1024 PB) – compared to the 16 TB maximum in Ext3.
The most important under-the-bonnet improvement in Ext4 is the use of extents in place of the indirect block addressing used in Ext3.
Data structure of an extent in Ext4
Ext4 uses 32-bits to record the number of blocks within a file, which limits the maximum file size in Ext4 file systems to 232 4 KB blocks, equivalent to 16 TB.
Ext4 uses the 60 bytes in the inode which Ext3 uses to store 15 32-bit block numbers to store four extents and one header extent, each 12 bytes in size. This allows files of up to 512 MB to be managed directly from the inode. This also illustrates another, very practical advantage of 48-bit block numbers – if 64 bits were used for both the position of the extent within the file and the start block, the size of an extent would increase to 18 bytes. Since the extent header occupies 12 bytes, this would allow just two, as opposed to four, extents to be stored within the inode.
If a file is larger than 512 MB, Ext4 builds an extent tree.
Further enhancements to the file system ensure that, irrespective of persistent preallocation, Ext4 files are wherever possible stored in one piece. Write access is initially buffered, so that the block allocator, which reserves data blocks for write operations in both Ext3 and Ext4, no longer needs to be called immediately for each 4 KB block of data (delayed allocation). Instead it allows multiple blocks to be allocated simultaneously. During large writes, this means that many blocks can be allocated in one go and ideally as a single extent (multi-block allocation).
This change reduces file system overhead – and with it both system load and I/O bottlenecks – when an application writes large volumes of data, and prevents file fragmentation. Temporary files created for short periods only can spend their whole brief lives in the cache, never getting written to the drive. When mounting Ext4, the Ext4 code generates a list of free extents for each block group which remains in memory and is used by the block allocator to optimise distribution of files on the drive.
Some of the new features are aimed at improving file system reliability. The journal now adds a checksum to each transaction. This both allows detection of data incorrectly written to the journal and simplifies commits for completed transactions within the journal. Checksums are also used in block group descriptors.
By default Ext4 uses the barrier mechanism offered by newer hard drives. Barriers affect the way write access is cached and sorted. The drive controller performs all writes prior to a barrier before starting on the writes behind the barrier. This makes it possible to, for example, ensure that all writes associated with a single transaction are performed on the file system before the commit is written to the journal. This mechanism can be disabled using the barriers=0 mount option.
Ext4’s extents allow more extensive consistency checking than Ext3’s block lists. For example, it is possible to check whether a file’s extents overlap. Extent headers in an extent tree also record the tree depth, which must be consistent across the entire tree and the extents recorded in an extent index must cover the same part of a file as the index. By contrast, the indirect block addressing used in Ext3 means that it is not possible to distinguish a block containing block numbers from random data, so that only very rudimentary consistency checking is possible.
Ext4 performs a complete fsck significantly faster than Ext3. If the uninit_bg option is set (as it is by default in Ubuntu 9.04), mkfs.ext4 does not initialise all block groups. This not only accelerates file system creation, it also ensures that e2fsck only needs to check initialised inodes. Fsck time is thus solely dependent on the number of files and not on the total number of inodes present (and thus file system size).
Limits and performance
Ext4 breaks a number of barriers. It allows unlimited numbers of sub-directories within a directory – in Ext3 this was limited to 32,000. Inodes now have a default size of 256 bytes, compared to 128 bytes in Ext3. The uses Ext4 makes of the extra space include recording access times in nanoseconds rather than seconds and recording extended attributes directly in the inode.