Voron and the FreeDB Dataset
[editor's note: the follow-up post is available here ] i got tired of doing arbitrary performance testing, so i decided to take the freedb dataset and start working with that. freedb is a data set used to look up cd information based on the a nearly unique disk id. this is a good dataset, because it contains a lot of data (over three million albums, and over 40 million songs), and it is production data. that means that it is dirty . this makes it perfect to run all sort of interesting scenarios. the purpose of this post (and maybe the new few) is to show off a few things. first, we want to see how voron behaves with realistic data set. second, we want to show off the way voron works, its api, etc. to start with, i run my freedb parser, pointing it at /dev/null. the idea is to measure what is the cost of just going through the data is. we are using freedb-complete-20130901.tar.bz2 from sep 2013. after 1 minute, we went through 342,224 albums, and after 6 minutes we were at 2,066,871 albums. reading the whole 3,328,488 albums took about a bit over ten minutes. so just the cost of parsing and reading the freedb dataset is pretty expensive. the end result is a list of objects that looks like this: now, let us see how we want to actually use this. we want to be able to: lookup an album by the disk ids lookup all the albums by an artist*. lookup albums by album title*. this gets interesting, because we need to deal with questions such as: “given pearl jam, if i search for pearl, do i get them? do i get it for jam?” for now, we are going to go with case insensitive, but we won’t be doing full text search, we will allow, however, prefix searches. we are using the following abstraction for the destination: public abstract class destination { public abstract void accept(disk d); public abstract void done(); } basically, we read data as fast as we can, and we shove it to the destination, until we are done. here is the voron implementation: public class vorondestination : destination { private readonly storageenvironment _storageenvironment; private writebatch _currentbatch; private readonly jsonserializer _serializer = new jsonserializer(); private int counter = 1; public vorondestination() { _storageenvironment = new storageenvironment(storageenvironmentoptions.forpath("freedb")); using (var tx = _storageenvironment.newtransaction(transactionflags.readwrite)) { _storageenvironment.createtree(tx, "albums"); _storageenvironment.createtree(tx, "ix_artists"); _storageenvironment.createtree(tx, "ix_titles"); tx.commit(); } _currentbatch = new writebatch(); } public override void accept(disk d) { var ms = new memorystream(); _serializer.serialize(new jsontextwriter(new streamwriter(ms)), d); ms.position = 0; var key = new slice(endianbitconverter.big.getbytes(counter++)); _currentbatch.add(key, ms, "albums"); if(d.artist != null) _currentbatch.multiadd(d.artist.tolower(), key, "ix_artists"); if (d.title != null) _currentbatch.multiadd(d.title.tolower(), key, "ix_titles"); if (counter%1000 == 0) { _storageenvironment.writer.write(_currentbatch); _currentbatch = new writebatch(); } } public override void done() { _storageenvironment.writer.write(_currentbatch); } } let us go over this in detail, shall we? in line 10 we create a new storage environment. in this case, we want to just import the data, so we can create the storage inline. on lines 13 – 15, we create the relevant trees. you can think about voron trees in a very similar manner to the way you think about tables. they are a way to separate data into different parts of the storage. note that this still all reside in a single file, so there isn’t a physical separation. note that we created an albums tree, which will contain the actual data. and ix_artists, ix_titles trees. those are indexes into the albums tree. you can see them being used just a little lower. in the accept method, you can see that we use a writebatch, a native voron notion that allows us to batch multiple operations into a single transaction. in this case, for every album, we are making 3 writes. first, we write all of the data, as a json string, into a stream and put it in the albums tree. then we create a simple incrementing integer to be the actual album key. finally, we add the artist and title entries (lower case, so we don’t have to worry about case sensitivity in searches) into the relevant indexes. at 60 seconds, we written 267,998 values to voron. in fact, i explicitly designed it so we can see the relevant metrics. at 495 seconds we have reads 1,995,385 entries from the freedb file, we parsed 1,995,346 of them and written to voron 1,610,998. as you can imagined, each step is running in a dedicated thread, so we can see how they behave on an individual basis. the good thing about this is that i can physically see the various costs, it is actually pretty cool here is the voron directory at 60 seconds: you can see that we have two journal files active (haven’t been applied to the data file yet) and the db.voron file is at 512 mb. the compression buffer is at 32 mb (this is usually twice as big as the biggest transaction, uncompressed). the scratch buffer is used to hold in flight transaction information (until we send it to the data file), and you can see it is sitting on 256mb in size. at 15 minutes, we have the following numbers: 3,035,452 entries read from the file, 3,035,426 parsed and 2,331,998 written to voron. note that we are reading the file & writing to voron on the same disk, so that might impact the read performance. at that time, we can see the following on the disk: note that we increase the size of most of our files by factor of 2, so some of the space in the db.voron file is probably not used. note that we needed more scratch space to handle the in flight information. the entire process took 22 minutes, start to finish. although i have to note that this hasn’t been optimized at all, and i know we are doing a lot of stupid stuff through it. you might have noticed something else, we actually “crashed” closed the voron db, this was done to see what would happen when we open a relatively large db after an unordered shutdown. we’ll actually get to play with the data in my next post. so far this has been pretty much just to see how things are behaving. and… i just realized something, i forgot to actually add an index on disk id . which means that i have to import the data again. but before that, i also wrote the following: public class jsonfiledestination : destination { private readonly gzipstream _stream; private readonly streamwriter _writer; private readonly jsonserializer _serializer = new jsonserializer(); public jsonfiledestination() { _stream = new gzipstream(new filestream("freedb.json.gzip", filemode.createnew, fileaccess.readwrite), compressionlevel.optimal); _writer = new streamwriter(_stream); } public override void accept(disk d) { _serializer.serialize(new jsontextwriter(_writer), d); _writer.writeline(); } public override void done() { _writer.flush(); _stream.dispose(); } } this completed in ten minutes, for 3,328,488 entries. or a rate of about 5,538 per / second. the result is a 845mb gzip file. i had twofold reasons to want to do this. first, this gave me something to compare ourselves to, and more to the point, i can re-use this gzip file for my next tests, without having to go through the expensive parsing of the freedb file. i did just that and ended up with the following: public class voronentriesdestination : entrydestination { private readonly storageenvironment _storageenvironment; private writebatch _currentbatch; private int counter = 1; public voronentriesdestination() { _storageenvironment = new storageenvironment(storageenvironmentoptions.forpath("freedb")); using (var tx = _storageenvironment.newtransaction(transactionflags.readwrite)) { _storageenvironment.createtree(tx, "albums"); _storageenvironment.createtree(tx, "ix_diskids"); _storageenvironment.createtree(tx, "ix_artists"); _storageenvironment.createtree(tx, "ix_titles"); tx.commit(); } _currentbatch = new writebatch(); } public override int accept(string d) { var disk = jobject.parse(d); var ms = new memorystream(); var writer = new streamwriter(ms); writer.write(d); writer.flush(); ms.position = 0; var key = new slice(endianbitconverter.big.getbytes(counter++)); _currentbatch.add(key, ms, "albums"); int count = 1; foreach (var diskid in disk.value("diskids")) { count++; _currentbatch.multiadd(diskid.value(), key, "ix_diskids"); } var artist = disk.value("artist"); if (artist != null) { count++; _currentbatch.multiadd(artist.tolower(), key, "ix_artists"); } var title = disk.value("title"); if (title != null) { count++; _currentbatch.multiadd(title.tolower(), key, "ix_titles"); } if (counter % 100 == 0) { _storageenvironment.writer.write(_currentbatch); _currentbatch = new writebatch(); } return count; } public override void done() { _storageenvironment.writer.write(_currentbatch); _storageenvironment.dispose(); } } now we are actually properly disposing of things, and i also decreased the size of the batch, to see how it would respond. note that it is now being fed directly from the gzip file, at a greatly reduced cost. i also added tracking note only for how many albums we write, but also how many entries . by entries i mean, how many voron entries (which include the values we add to the index). i did find a bug where we would just double the file size without due consideration to its size, so now we are doing smaller file size increases. word of warning : i didn’t realized until after i was done with all the benchmarks, but i actually run all of those in debug configuration, which basically means that it is utterly useless as a performance metric. that is especially true because we have a lot of verifier code that runs in debug mode. so please don’t take those numbers as actual performance metrics, they aren’t valid. time # of albums # of entries 4 minutes 773,398 3,091,146 6 minutes 1,126,998 4,504,550 8 minutes 1,532,858 6,126,413 18 minutes 2,781,698 11,122,799 24 minutes 3,328,488 13,301,496 the status of the file system midway during the run. you can see that now we increase the file is smaller increments. and that we are using more scratch space, probably because we are under very heavy write load. after the run: scratch & compression are only used when the database is running, and deleted on close. the database is 7gb in side, which is quite respectable. now, to working with it, but i’ll save that for my next post, this one is long enough already.
February 20, 2014
·
3,324 Views
·
0 Likes
Comments
Sep 28, 2022 · Marcy Tillman
I don't really like that approach, because where the data is sitting is a critical aspect.
If you are putting the data behind an API call, that means that a lot of functionality is lost.
You can't do your own filtering or aggregation, can't join between your own data and published one, etc. It also ties your own availability to the API endpoint. And eventually, you create a true mesh, where a single node failing brings down the whole system.
Independent and isolated pieces are far healthier in my eyes.
Apr 29, 2019 · Jordan Baker
No app should have thousands of operations.
You need to split it to independent pieces a lot earlier than that.
Apr 15, 2019 · Jordan Baker
You can host them all in a single database server, as long as they are separated, no issue there.
And yes, that is practical, because you aren't going to have to deal with all of them all the time. That is the point of creating this level of isolation.
Nov 11, 2018 · Michael_Gates
In a web API, for example, you have the basic building blocks (the controllers, for example), but the architecture of the entire solution isn't set or known.
That is also a blank slate from my point of view.
Dec 29, 2017 · Duncan Brown
That puts a lot of the responsability on the client.
Better to do something like have a shared phone number that you can leave a voice message to the couple and they will handle that internally.
In fact, voice mail is a really good analogy. You call the _house_, not the person, and if they pick up, they handle it as usual.
Otherwise, you leave a message and it is handled (according to internal policies) as needed.
Apr 12, 2017 · Sarah Davis
The issue is that while we want to increase our resource utilization, it is up to a point, becuase we want to let other things happen while we are doing this.
There is also some hard limits, because the disk can only accept the data at a certain rate anyway.
That mean that increasing the compression speed will actually cause us to get into a traffic jam in front of the disk.
A better alternative for us was to reduce the compression ratio, which allow us to play with the amount of time that we'll spend compressing depending on how fast the disk is.
Apr 12, 2017 · Sarah Davis
Transaction merging is basically taking all the current operations and running them in a single batch. This gives us the ability to avoid hitting the disk with independent operations, and only hit the disk in a big batch.
This apply to both the execution of the transaction itself (which contain operation from multiple sources) and to the writing of the data to the jouranl.
Writing the data to the data file is handle by another part of the software, and is also done amortized across transactions.