DBM-style flat files are great, but sometimes it’s hard to deal with binary formats. Being able to have data in a textual format can be very handy for transferring between different platforms, modifying in your favorite editor, crunching with standard tools like grep, etc. There are a couple of text-friendly DBM export out there, each with advantages and disadvantages.
Suppose you have a DBM hash that contains two key/value pairs like the following:
one=Hello
two=Goodbye
Good ol’ BerkleyDB comes with a utility called db_dump which lets you export a binary database to a text format and then use the equivalent db_load tool to import the data. It’s easiest to see your data when you use the -p option. Here’s a simple database with two records:
format=print
type=hash
h_nelem=5
db_pagesize=512
HEADER=END
one
Hello
two
Goodbye
db_dump is pretty easy to use as it is, but it becomes a little cumbersome when you’ve got non-printable characters to display (control characters, newlines, and anything that isn’t 7-bit clean). You end up with a dump that looks like this:
Erev Pesach
\d7\a2\d6\b6\d7\a8\d6\b6\d7\91 \d7\a4\d6\bc\d6\b6\d7\a1\d6\b7\d7\97
Tu B'Shvat
\d7\98\d7\95\d6\bc \d7\91\d6\bc\d6\b4\d7\a9\d7\81\d6\b0\d7\91\d6\b8\d7\98
Bamidbar
\d7\91\d6\bc\d6\b0\d7\9e\d6\b4\d7\93\d6\b0\d7\91\d6\bc\d6\b7\d7\a8
You don’t lose any information, but it becomes impossible to work with when you’ve got UTF-8 data and you want to be able to edit it in your favorite Unicode-savvy editor.
Perl hackers are probably familiar with Data::Dumper, which looks like this:
$VAR1 = {
'one' => 'Hello',
'two' => 'Goodbye'
};
Data::Dumper is easier than db_dump to use with your favorite text-centric tools, and it has the advantage that it keeps each key/value pair together on the same line (handy for grep). Unfortunately, it’s very Perl-centric; you’re intended to load the data by calling eval(). I suppose you could write a parser in C that understood that format pretty easily and you could use it in non-Perl programs.
On one of the mailing lists at work today someone mentioned the cdb constant database format. I took a look at the page and was amused to see the cdbdump record format. It’s an interesting alternative to db_dump’s format and works nicely with UTF-8.
+3,5:one->Hello
+3,7:two->Goodbye
It’s a pretty concise format, and it’s totally 8-bit friendly. The key and data may contain any characters, including colons, dashes, newlines, and nulls. As a consequence it’s very easy to write generators and parsers for this format, and they’re typically very efficient. Like Data::Dumper, it keeps key/value together on the same line.
One disadvantage of the cdbdump format is that it uses explicit integer lengths, so it’s not very friendly for editing data in a text editor (every change you make requires that you fixup the beginning of the line).