35% Faster Than The Filesystem (2024)

35% Faster Than The Filesystem

►Table Of Contents

1. Summary

1.1. Caveats

1.2. Related Studies

2. How These Measurements Are Made

2.1. Read Performance Measurements

2.2. Write Performance Measurements

2.3. Variations

3. General Findings

4. Additional Notes

4.1. Compiling And Testing on Android

1.1. Caveats

¹The 35% figure above is approximate. Actual timings varydepending on hardware, operating system, and thedetails of the experiment, and due to random performance fluctuationson real-world hardware. See the text below for more detail.Try the experiments yourself. Report significant deviations onthe SQLite forum.

The 35% figure is based on running tests on every machinethat the author has easily at hand.Some reviewers of this article report that SQLite has higher latency than direct I/O on their systems. We do not yet understandthe difference. We also see indications that SQLite does notperform as well as direct I/O when experiments are run usinga cold filesystem cache.

So let your take-away be this: read/write latency forSQLite is competitive with read/write latency of individual files ondisk. Often SQLite is faster. Sometimes SQLite is almostas fast. Either way, this article disproves the commonassumption that a relational database must be slower than directfilesystem I/O.

A 2022 study(alternative link on GitHub) found thatSQLite is roughly twice as fast at real-world workloads compared to Btrfs and Ext4 on Linux.

Jim Grayand others studied the read performance of BLOBsversus file I/O for Microsoft SQL Server and found that reading BLOBs out of the database was faster for BLOB sizes less than between 250KiB and 1MiB.(Paper).In that study, the database still stores the filename of the content evenif the content is held in a separate file. So the database is consultedfor every BLOB, even if it is only to extract the filename. In thisarticle, the key for the BLOB is the filename, so no preliminary databaseaccess is required. Because the database is never used at all whenreading content from individual files in this article, the thresholdat which direct file I/O becomes faster is smaller than it is in Gray'spaper.

The Internal Versus External BLOBs article on this website is anearlier investigation (circa 2011) that uses the same approach as theJim Gray paper — storing the blob filenames as entries in thedatabase — but for SQLite instead of SQL Server.

I/O performance is measured using thekvtest.c programfrom the SQLite source tree.To compile this test program, first gather the kvtest.c source fileinto a directory with the SQLite amalgamation sourcefiles "sqlite3.c" and "sqlite3.h". Then on unix, run a command likethe following:

gcc -Os -I. -DSQLITE_DIRECT_OVERFLOW_READ \ kvtest.c sqlite3.c -o kvtest -ldl -lpthread

Or on Windows with MSVC:

cl -I. -DSQLITE_DIRECT_OVERFLOW_READ kvtest.c sqlite3.c

Instructions for compiling for Androidare shown below.

Use the resulting "kvtest" program togenerate a test database with 100,000 random uncompressibleblobs, each with a randomsize between 8,000 and 12,000 bytesusing a command like this:

./kvtest init test1.db --count 100k --size 10k --variance 2k

If desired, you can verify the new database by running this command:

./kvtest stat test1.db

Next, make copies of all the blobs into individual files in a directoryusing a command like this:

./kvtest export test1.db test1.dir

At this point, you can measure the amount of disk space used bythe test1.db database and the space used by the test1.dir directoryand all of its content. On a standard Ubuntu Linux desktop, thedatabase file will be 1,024,512,000 bytes in size and the test1.dirdirectory will use 1,228,800,000 bytes of space (according to "du -k"),about 20% more than the database.

The "test1.dir" directory created above puts all the blobs into a singlefolder. It was conjectured that some operating systems would perform poorly when a single directory contains 100,000 objects. To test this,the kvtest program can also store the blobs in a hierarchy of folders with nomore than 100 files and/or subdirectories per folder. The alternativeon-disk representation of the blobs can be created using the --treecommand-line option to the "export" command, like this:

./kvtest export test1.db test1.tree --tree

The test1.dir directory will contain 100,000 fileswith names like "000000", "000001", "000002" and so forth but thetest1.tree directory will contain the same files in subdirectories like"00/00/00", "00/00/01", and so on. The test1.dir and test1.testdirectories take up approximately the same amount of space, thoughtest1.test is very slightly larger due to the extra directory entries.

All of the experiments that follow operate the same with either "test1.dir" or "test1.tree". Very little performance difference ismeasured in either case, regardless of operating system.

Measure the performance for reading blobs from the database and fromindividual files using these commands:

./kvtest run test1.db --count 100k --blob-api./kvtest run test1.dir --count 100k --blob-api./kvtest run test1.tree --count 100k --blob-api

Depending on your hardware and operating system, you should see that reads from the test1.db database file are about 35% faster than reads from individual files in the test1.dir or test1.tree folders. Results can varysignificantly from one run to the next due to caching, so it is advisableto run tests multiple times and take an average or a worst case or a bestcase, depending on your requirements.

The --blob-api option on the database read test causes kvtest to usethe sqlite3_blob_read() feature of SQLite to load the content of theblobs, rather than running pure SQL statements. This helps SQLite to runa little faster on read tests. You can omit that option to compare theperformance of SQLite running SQL statements.In that case, the SQLite still out-performs direct reads, thoughby not as much as when using sqlite3_blob_read().The --blob-api option is ignored for tests that read from individual diskfiles.

Measure write performance by adding the --update option. This causesthe blobs are overwritten in place with another random blob ofexactly the same size.

./kvtest run test1.db --count 100k --update./kvtest run test1.dir --count 100k --update./kvtest run test1.tree --count 100k --update

The writing test above is not completely fair, since SQLite is doingpower-safe transactions whereas the direct-to-disk writing is not.To put the tests on a more equal footing, add either the --nosyncoption to the SQLite writes to disable calling fsync() orFlushFileBuffers() to force content to disk, or using the --fsync optionfor the direct-to-disk tests to force them to invoke fsync() orFlushFileBuffers() when updating disk files.

By default, kvtest runs the database I/O measurements all withina single transaction. Use the --multitrans option to run each blobread or write in a separate transaction. The --multitrans option makesSQLite much slower, and uncompetitive with direct disk I/O. Thisoption proves, yet again, that to get the most performance out ofSQLite, you should group as much database interaction as possible withina single transaction.

There are many other testing options, which can be seen by runningthe command:

./kvtest help

2.1. Read Performance Measurements

The chart below shows data collected using kvtest.c on five differentsystems:

Win7: A circa-2009 Dell Inspiron laptop, Pentium dual-core at 2.30GHz, 4GiB RAM, Windows7.
Win10: A 2016 Lenovo YOGA 910, Intel i7-7500 at 2.70GHz, 16GiB RAM, Windows10.
Mac: A 2015 MacBook Pro, 3.1GHz intel Core i7, 16GiB RAM, MacOS 10.12.5
Ubuntu: Desktop built from Intel i7-4770K at 3.50GHz, 32GiB RAM, Ubuntu 16.04.2 LTS
Android: Galaxy S3, ARMv7, 2GiB RAM

All machines use SSD except Win7 which has ahard-drive. The test database is 100K blobs with sizes uniformlydistributed between 8K and 12K, for a total of about 1 gigabyteof content. The database page sizeis 4KiB. The -DSQLITE_DIRECT_OVERFLOW_READ compile-time option wasused for all of these tests.Tests were run multiple times.The first run was used to warm up the cache and its timings were discarded.

The chart below shows average time to read a blob directly from thefilesystem versus the time needed to read the same blob from the SQLite database.The actual timings vary considerably from one system to another (the Ubuntu desktop is muchfaster than the Galaxy S3 phone, for example). This chart shows the ratio of thetimes needed to read blobs from a file divided by the time needed tofrom the database. The left-most column in the chart is the normalizedtime to read from the database, for reference.

In this chart, an SQL statement ("SELECT v FROM kv WHERE k=?1") is prepared once. Then for each blob, the blob key value is bound to the ?1 parameter and the statement is evaluated to extract theblob content.

The chart shows that on Windows10, content can be read from the SQLitedatabase about 5 times faster than it can be read directly from disk.On Android, SQLite is only about 35% faster than reading from disk.

Chart 1: SQLite read latency relative to direct filesystem reads.
100K blobs, avg 10KB each, random order using SQL

The performance can be improved slightly by bypassing the SQL layerand reading the blob content directly using thesqlite3_blob_read() interface, as shown in the next chart:

Chart 2: SQLite read latency relative to direct filesystem reads.
100K blobs, avg size 10KB, random order
using sqlite3_blob_read().

Further performance improves can be made by using thememory-mapped I/O feature of SQLite. In the next chart, theentire 1GB database file is memory mapped and blobs are read(in random order) using the sqlite3_blob_read() interface.With these optimizations, SQLite is twice as fast as Androidor MacOS-X and over 10 times faster than Windows.

Chart 3: SQLite read latency relative to direct filesystem reads.
100K blobs, avg size 10KB, random order
using sqlite3_blob_read() from a memory-mapped database.

The third chart shows that reading blob content out of SQLite can betwice as fast as reading from individual files on disk for Mac andAndroid, and an amazing ten times faster for Windows.

2.2. Write Performance Measurements

Writes are slower.On all systems, using both direct I/O and SQLite, write performance isbetween 5 and 15 times slower than reads.

Write performance measurements were made by replacing (overwriting)an entire blob with a different blob. All of the blobs in theseexperiment are random and incompressible. Because writes are so muchslower than reads, only 10,000 of the 100,000 blobs in the databaseare replaced. The blobs to be replaced are selected at random andare in no particular order.

The direct-to-disk writes are accomplished using fopen()/fwrite()/fclose().By default, and in all the results shown below, the OS filesystem buffers arenever flushed to persistent storage using fsync() orFlushFileBuffers(). In other words, there is no attempt to make thedirect-to-disk writes transactional or power-safe.We found that invoking fsync() or FlushFileBuffers() on each filewritten causes direct-to-disk storageto be about 10 times or more slower than writes to SQLite.

The next chart compares SQLite database updates in WAL modeagainst raw direct-to-disk overwrites of separate files on disk.The PRAGMA synchronous setting is NORMAL.All database writes are in a single transaction.The timer for the database writes is stopped after the transactioncommits, but before a checkpoint is run.Note that the SQLite writes, unlike the direct-to-disk writes,are transactional and power-safe, though because the synchronoussetting is NORMAL instead of FULL, the transactions are not durable.

Chart 4: SQLite write latency relative to direct filesystem writes.
10K blobs, avg size 10KB, random order,
WAL mode with synchronous NORMAL,
exclusive of checkpoint time

The android performance numbers for the write experiments are omittedbecause the performance tests on the Galaxy S3 are so random. Twoconsecutive runs of the exact same experiment would give wildly differenttimes. And, to be fair, the performance of SQLite on android is slightlyslower than writing directly to disk.

The next chart shows the performance of SQLite versus direct-to-diskwhen transactions are disabled (PRAGMA journal_mode=OFF)and PRAGMA synchronous is set to OFF. These settings put SQLite on anequal footing with direct-to-disk writes, which is to say they make thedata prone to corruption due to system crashes and power failures.

Chart 5: SQLite write latency relative to direct filesystem writes.
10K blobs, avg size 10KB, random order,
journaling disabled, synchronous OFF.

In all of the write tests, it is important to disable anti-virus softwareprior to running the direct-to-disk performance tests. We found thatanti-virus software slows down direct-to-disk by an order of magnitudewhereas it impacts SQLite writes very little. This is probably due to thefact that direct-to-disk changes thousands of separate files which all needto be checked by anti-virus, whereas SQLite writes only changes the singledatabase file.

2.3. Variations

The -DSQLITE_DIRECT_OVERFLOW_READ compile-time option causes SQLiteto bypass its page cache when reading content from overflow pages. Thishelps database reads of 10K blobs run a little faster, but not all that muchfaster. SQLite still holds a speed advantage over direct filesystem readswithout the SQLITE_DIRECT_OVERFLOW_READ compile-time option.

Other compile-time options such as using -O3 instead of -Os orusing -DSQLITE_THREADSAFE=0 and/or some of the otherrecommended compile-time options might help SQLite to run even fasterrelative to direct filesystem reads.

The size of the blobs in the test data affects performance.The filesystem will generally be faster for larger blobs, sincethe overhead of open() and close() is amortized over more bytes of I/O,whereas the database will be more efficient in both speed and spaceas the average blob size decreases.

SQLite is competitive with, and usually faster than, blobs stored inseparate files on disk, for both reading and writing.
SQLite is much faster than direct writes to disk on Windowswhen anti-virus protection is turned on. Since anti-virus softwareis and should be on by default in Windows, that means that SQLiteis generally much faster than direct disk writes on Windows.
Reading is about an order of magnitude faster than writing, for allsystems and for both SQLite and direct-to-disk I/O.
I/O performance varies widely depending on operating system and hardware.Make your own measurements before drawing conclusions.
Some other SQL database engines advise developers to store blobs in separatefiles and then store the filename in the database. In that case, wherethe database must first be consulted to find the filename before openingand reading the file, simply storing the entire blob in the databasegives much faster read and write performance with SQLite.See the Internal Versus External BLOBs article for more information.

4.1. Compiling And Testing on Android

The kvtest program is compiled and run on Android as follows.First install the Android SDK and NDK. Then prepare a scriptnamed "android-gcc" that looks approximately like this:

#!/bin/sh#NDK=/home/drh/Android/Sdk/ndk-bundleSYSROOT=$NDK/platforms/android-16/arch-armABIN=$NDK/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/binGCC=$ABIN/arm-linux-androideabi-gcc$GCC --sysroot=$SYSROOT -fPIC -pie $*

Make that script executable and put it on your $PATH. Thencompile the kvtest program as follows:

android-gcc -Os -I. kvtest.c sqlite3.c -o kvtest-android

Next, move the resulting kvtest-android executable to the Androiddevice:

adb push kvtest-android /data/local/tmp

Finally use "adb shell" to get a shell prompt on the Android device,cd into the /data/local/tmp directory, and begin running the testsas with any other unix host.

This page last modified on 2023-12-05 14:43:20 UTC