Martin Kleppmann’s book Designing Data-Intensive Applications covers lots of topics relevant to modern software development - from reliability to database storage engines, and replication and distribution strategies. Chapter 3 in particular, starts by introducing an incredibly basic log file database (key-value store) implementation:
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
For a database that pushes every write to disk, this has near best case write performance, at O(1), as it’s just appending to a file. The read performance, at O(n), is fine for very small databases, but slows linearly with the number of writes. It must always read the whole file as it’s looking for the last occurrence of a key in the file.
Database internals have always seemed like a bit of a black box to me, in terms of how they actually store the data. Of course it has to be stored somewhere, either in memory or on disk. In memory feels simpler - it’s just like a hash map right? If it’s on disk then it needs to be in a file, but what’s the file format look like? Do you read the whole file into memory? How do you deal with reading and writing, do you need to lock the file?
So of course the intuition is that databases are incredibly complicated systems, which they are! But when you look at the building blocks from first principles, each part is individually quite simple. In the above log database implementation, we see that writing to the database is just appending a line to the file - that’s it. And reading, while incredibly basic, is just finding the last time a key was mentioned.
Reading and Writing
In Go, we can model the above database functions as follows:
type DB interface {
Write(string, string) error
Read(string) (string, error)
}
This isn’t too far off the io.ReadWriter interface from the go stdlib, but we’re specifically writing strings, and two of them, for the simplicity of the examples here.
There’s quite a bit of extra boilerplate we need to write in a Go program, compared to the Bash example, but a basic version looks something like this:
type Log struct {
LogPath string
}
type Entry struct {
Key string `json:"key"`
Value string `json:"value"`
}
func (l *Log) Write(key string, value string) error {
file, err := os.OpenFile(l.LogPath, os.O_WRONLY|os.O_APPEND, 0644)
if err != nil {
return fmt.Errorf("error opening file: %v", err)
}
defer file.Close()
marshalled, err := json.Marshal(Entry{
Key: key,
Value: value,
})
if err != nil {
return fmt.Errorf("error marshalling json: %v", err)
}
marshalled = append(marshalled, '\n')
_, err = file.Write(marshalled)
if err != nil {
return fmt.Errorf("error writing data: %v", err)
}
return nil
}
func (l *Log) Read(key string) (string, error) {
file, err := os.Open(l.LogPath)
if err != nil {
return "", fmt.Errorf("error opening file: %v", err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
latest := ""
for scanner.Scan() {
var entry Entry
if err := json.Unmarshal(scanner.Bytes(), &entry); err != nil {
return "", fmt.Errorf("error unmarshalling json: %v", err)
}
if entry.Key == key {
latest = entry.Value
}
}
return latest, nil
}
func NewLog(logPath string) *Log {
db := &Log{
LogPath: logPath,
}
_, err := os.Stat(logPath)
if os.IsNotExist(err) {
file, err := os.Create(logPath)
if err != nil {
log.Fatal(err)
}
defer file.Close()
return db
}
return db
}
We’re writing JSON here, rather than comma separated key pairs like in the bash example, both for ease of parsing, and as it lets us accept somewhat more complex input. I’ve also included a simple setup function, so we can start using the database.
In main.go:
// Error handling omitted for brevity
func main() {
var db db.DB = log.NewLog("test.log")
db.Write("Hello", "world")
db.Write("Goodbye", "world")
value, _ := db.Read("Hello")
println("Hello:", value)
value, _ = db.Read("Goodbye")
println("Goodbye:", value)
}
Running that gives us:
Hello: world
Goodbye: world
// File contents:
{"key":"Hello","value":"world"}
{"key":"Goodbye","value":"world"}
So now we’re basically up to scratch with the two bash functions, if anything a little ahead! Showing you how much legwork >> and grep are doing for that version. But now we come up against a couple of issues:
- As we saw earlier, writes are blazing fast, but reads grow linearly with the number of keys. We’ll address this later by adding an index to the database.
- When we write a key multiple times, we leave behind all of the old writes and they cause the file to become much larger than it needs to be.
To address the second point, we’ll talk about the process of compaction.
Compaction
As we mentioned above, one of the tradeoffs of appending to the file and thus maintaining O(1) complexity, is bloat in the database file from repeated writes of the same key.
So our ‘laziness’ with the writes pays off in speed, but means we leave an awful lot of mess behind. This means we need to periodically run a compaction process, where we go through the database file and delete records that aren’t needed anymore, similar to the garbage collector in languages like Go or Java.
For example, if someone couldn’t decide on their favourite animal:
{"key":"favourite_animal","value":"cat"}
{"key":"favourite_animal","value":"dog"}
{"key":"favourite_animal","value":"rat"}
{"key":"favourite_animal","value":"gerbil"}
{"key":"favourite_animal","value":"hamster"}
{"key":"favourite_animal","value":"fish"}
We only have one key in the database, but we’ve got data for it six times! You can imagine in a more real example how much of a problem this would cause if we’re regularly updating something, like the number of views on a page or video.
To clean this up, we need to remove all but the most recent entry for each key. As we always write to the end of the log file, we can be sure that this will be the last time that we see any given key while scanning the file from start to end. So the compacted database log file would be:
{"key":"favourite_animal","value":"fish"}
We can use a map in Go to do this, scanning through the file line by line, and updating the map with every entry that we see:
func (l *Log) compact() error {
readFile, err := os.Open(l.LogPath)
if err != nil {
return fmt.Errorf("error opening file: %v", err)
}
defer readFile.Close()
scanner := bufio.NewScanner(readFile)
seen := make(map[string]string)
for scanner.Scan() {
var entry Entry
if err := json.Unmarshal(scanner.Bytes(), &entry); err != nil {
return fmt.Errorf("error unmarshalling json: %v", err)
}
seen[entry.Key] = entry.Value
}
tmpFile, err := os.Create(l.LogPath + ".tmp")
if err != nil {
return err
}
defer os.Remove(tmpFile.Name())
defer tmpFile.Close()
for k, v := range seen {
entry, err := json.Marshal(Entry{
Key: k,
Value: v,
})
if err != nil {
return fmt.Errorf("error marshalling json: %v", err)
}
entry = append(entry, '\n')
_, err = tmpFile.Write(entry)
if err != nil {
return fmt.Errorf("error writing data: %v", err)
}
}
err = os.Rename(tmpFile.Name(), l.LogPath)
if err != nil {
return fmt.Errorf("error renaming file: %v", err)
}
return nil
}
There’s quite a lot of boilerplate with the file handling here, but the key parts are:
- Open the file for reading.
- Scan through the whole file, line by line, and update a
mapwith each record. - Iterate the map, and write out the remaining keys to a temp file.
- We won’t re-write the keys in the same order as they were due to Go’s map implementation, but this doesn’t really matter as there will only be one entry for each key.
- If we cared about keeping the ordering, then we could keep track of each key we see in a slice, and iterate that (indexing the map) instead.
- Rename the temp file to overwrite the original.
Writing to the temp file and atomically renaming is key - without it we risk losing data if there’s an error or crash during compaction. If we naively overwrote the original file and something happened while part way through, we’d lose whatever wasn’t written, and might end up with partially written records too. This pattern is useful anytime you’d want to replace the contents of a file safely, not just in databases.
You may have spotted that our compaction algorithm is O(n), as we scan every row - so we’re having to pay off the debt of our fast writes in the end. We don’t need to do it for every write however - in our toy example we just compact the log file on open, but in more real world databases this process would run in the background periodically.
func NewLog(logPath string) *Log {
db := &Log{
LogPath: logPath,
}
// ...snip
// === New - compact on open ====
if err = db.compact(); err != nil {
log.Fatal(err)
}
// ==============================
return db
}
In order to be able to keep serving reads and writes while compaction is happening, and to prevent files from growing too large, the log files would be segmented once they reach a certain size. That is, we’d stop writing to the existing file, and start using a new one. Reads would then need to scan across all files, starting with the most recent one. Compacting of these older segments wouldn’t affect writes at all, and reads only need to pause for the atomic file rename.
So we’ve dealt with the problem of the database file continuing to grow, but our reads are still pretty slow. We’ll solve that by adding an index to our database.
Indexes
The concept of an index is really simple - unlike the log file that we write to, which contains the key and the value, the index contains the key and information about where to look up the value. In the context of our log file, that could be the line number. However, that’s not very efficient - scanning to the nth line of a text file is almost as much work as it is to fully read n records of the file, which is what we already do for our reads!
Instead, for each key we can store a byte offset of where the entry is in the file. The byte offset is like a ‘pointer’ to where the data is in the file, allowing us to look it up directly rather than searching for it. Because we only append to the file, this offset won’t change for old data (except for when we compact the file), and we can update the index to point the most recent entry we wrote when we overwrite a key.
As an aside, this would be a natural point to switch our log file to a binary format rather than line oriented JSON. We’ll be sticking with JSON in this blog post for ease, but a binary format makes more sense if we’re always accessing by byte offset anyway, and is more efficient in both space and parsing.
In Go, an easy way to implement an index is with a map[string]int64 that we keep on the database object. You can think of it a bit like a map that stores its data on disk, rather than in memory, and as we said before the byte offset int64 is a bit like a *string, but on disk rather than in memory.
type Log struct {
LogPath string
Index map[string]int64
}
We’ll then need to make sure we initialise the map when we open the database, and every time we compact. We’ll also update the map every time we make a write, ideally writing to disk before we write to the index, as an orphaned record (not in the index) can be recovered by rebuilding the index, while an index with no underlying data can’t self heal in the same way.
We can do this by scanning the whole database file much like we do for compaction, and then keeping track of the number of bytes we’ve written. If we were explicitly reading n bytes at a time from a binary format this would be a little easier, as we literally could just get the current byte offset of the file we’re reading. In this case, we count the length of each line from the bufio.Scanner, making sure to keep track of the newline.
func (l *Log) initIndex() error {
// Always overwrite the old index
l.Index = make(map[string]int64)
file, err := os.Open(l.LogPath)
if err != nil {
return fmt.Errorf("error opening file: %v", err)
}
defer file.Close()
var offset int64 = 0
scanner := bufio.NewScanner(file)
for scanner.Scan() {
var entry Entry
if err := json.Unmarshal(scanner.Bytes(), &entry); err != nil {
return fmt.Errorf("error unmarshalling json: %v", err)
}
l.Index[entry.Key] = offset
offset += int64(len(scanner.Bytes()) + 1) // +1 byte for \n consumed by scanner
}
return nil
}
func NewLog(logPath string) *Log {
db := &Log{
LogPath: logPath,
Index: make(map[string]int64),
}
// ...snip
// ==== New - Init on open ===
if err = db.initIndex(); err != nil {
log.Fatal(err)
}
// ===========================
return db
}
You might have noticed that we’re scanning through the file just like we do while compacting - so an optimisation would be to do both at once, and write out the index byte offsets as we write the compacted file to disk.
For writing, we need to open the file slightly differently as we can’t get byte offsets of files opened in os.O_APPEND mode, so we instead just open in os.O_WRONLY and manually seek to the end, then record the byte index that we started writing to (the old end of the file).
func (l *Log) Write(key string, value string) error {
file, err := os.OpenFile(l.LogPath, os.O_WRONLY, 0644)
if err != nil {
return fmt.Errorf("error opening file: %v", err)
}
defer file.Close()
offset, err := file.Seek(0, io.SeekEnd)
if err != nil {
return err
}
marshalled, err := json.Marshal(Entry{
Key: key,
Value: value,
})
if err != nil {
return fmt.Errorf("error marshalling json: %v", err)
}
marshalled = append(marshalled, '\n')
_, err = file.Write(marshalled)
if err != nil {
return fmt.Errorf("error writing data: %v", err)
}
l.Index[key] = offset
return nil
}
Now that we’ve gotten our index in place, we can reap the rewards! Instead of our Read function needing to scan the whole file every time to find the most recent occurrence of a key, we can just make a single read at the byte offset in the index! This extra bookkeeping means that our reads are now O(1) as well.
Unlike the rest of the code for indexing, the read code actually becomes simpler than before with no loops. However due to needing to handle some more errors it’s actually slightly longer (thanks go):
func (l *Log) Read(key string) (string, error) {
offset, ok := l.Index[key]
if !ok {
return "", nil
}
file, err := os.Open(l.LogPath)
if err != nil {
return "", fmt.Errorf("error opening file: %v", err)
}
defer file.Close()
_, err = file.Seek(offset, io.SeekStart)
if err != nil {
return "", fmt.Errorf("error seeking: %v", err)
}
reader := bufio.NewReader(file)
line, err := reader.ReadString('\n')
if err != nil {
return "", fmt.Errorf("error reading line: %v", err)
}
var entry Entry
err = json.Unmarshal([]byte(line), &entry)
if err != nil {
return "", fmt.Errorf("error unmarshalling json: %v", err)
}
return entry.Value, nil
}
Adding the index has also brought in another limitation: the memory usage of the application now grows with the number of keys that we have in the database, as aside from compaction, we need to keep them all in memory at all times in order to look up efficiently. We won’t look at fixing this now, as it doesn’t really affect our toy example, but Kleppmann follows up the hash index with other storage engines that trade off the incredible read performance that the map/hash index provides, to ease these memory limitations and also provide other benefits.
Performance
I’ve been talking about O(n) and O(1) performance, but what does that actually mean in practice? Running a quick benchmark, we can see how much faster reading from the database gets because of the index as the file grows:
Reads
| Dataset Size | Implementation | Time (ns/op) | Allocations | Speedup |
|---|---|---|---|---|
| 100 entries | Index | 9,797 | 14 | - |
| 100 entries | Scan Log | 48,217 | 705 | 5x |
| 1,000 entries | Index | 10,086 | 14 | - |
| 1,000 entries | Scan Log | 391,750 | 7,005 | 39x |
| 10,000 entries | Index | 10,253 | 14 | - |
| 10,000 entries | Scan Log | 3,909,175 | 70,007 | 381x |
As you can see, O(n) performance of scanning the whole log file gets linearly slower with the number of entries in the file, while the O(1) index lookups stay near constant. There is a little bit of slowdown that comes from reading larger files even though we’re just seeking in them. We also make far fewer allocations - though this is in part due to the JSON unmarshaling of every entry in the database when scanning the log.
Writes
| Implementation | Time (ns/op) | Allocations | Overhead |
|---|---|---|---|
| Index | 20,075 ± 564 | 9 | +0.35% |
| Scan Log | 20,006 ± 576 | 8 | - |
There’s a tiny cost to write speed, as we need to seek to the end of the file when not using O_APPEND mode, and we write to the map each time. However this is so close that it’s within margin of error. Through running a large number of runs, I could see that there was a measurable difference, but it’s small enough that it doesn’t matter.
Overall
Taking a small performance hit to writes is absolutely worth it for the kind of speedup we get on reads, up to 381x for a database with 10,000 entries, and really shows us how important the index is. Unless we store data in memory, this is close to the best we can get here. Later on in the same chapter of DDIA, Kleppmann discusses databases keeping a memtable, meaning that the most recent segment of data is also kept in memory. This allows near instant reads, compared to any kind of file I/O.
While I’m sure there are all sorts of optimisations and tricks we can do to get the file IO to be faster in either case, it’s impressive to see the kind of speedup we do here. Even with 10,000 entries, individual lookups that scan the whole file take ‘only’ 4ms, so still feel instant, but getting nearly 400 times faster is huge.
Wrapping Things Up
We took Kleppmann’s two bash functions, implemented them in Go, and then applied some basic but impactful optimisations to get this somewhat closer to what a real world key-value store might look like. The benchmarks show that they worked - a 381x speedup on reads is no joke, as obvious as it may seem that a map index to lookup the data will be faster.
Implementing the code for this post was really helpful for me in understanding the simplicity behind the raw building blocks of databases, and how the complexity comes from the fact that we need to layer so many of them together to get something useful, and that there are tradeoffs at every turn when it comes to performance, durability, and simplicity of the system.
Of course what we built here is just a toy, there’s no safety for concurrent access, no networking, and we don’t perform any segmentation or background compaction, or deletion operations, all of which would be interesting to implement another time.
We’re also only able to look up a single key, and not scan a range of keys in an efficient manner. The number of keys is limited by the available memory, rather than disk space. LSM-Trees and SSTables are a related type of storage that aim to provide a performant alternative to an in-memory index like this, and are used in databases like Cassandra and DynamoDB. Look out for a future post where we look into implementing these!