-rw-r--r-- 2091 cdb-20251021/doc/format.md raw
The text below is copied from the original 19960914 document "A structure for constant databases" by D. J. Bernstein. Other explanations of the format are online from [Yusuke Shinyama](https://www.unixuser.org/~euske/doc/cdbinternals/index.html), [Richard James Howe](https://github.com/howerj/cdb), and [Wikipedia](https://en.wikipedia.org/wiki/Cdb_(software)). The cdb64 format uses 64-bit quantities stored in 8 bytes; conceptually this can handle 16 exabytes, but the current code limits it to 1 exabyte. <hr> A cdb is an associative array: it maps strings (``keys'') to strings (``data''). A cdb contains 256 pointers to linearly probed open hash tables. The hash tables contain pointers to (key,data) pairs. A cdb is stored in a single file on disk: +----------------+---------+-------+-------+-----+---------+ | p0 p1 ... p255 | records | hash0 | hash1 | ... | hash255 | +----------------+---------+-------+-------+-----+---------+ Each of the 256 initial pointers states a position and a length. The position is the starting byte position of the hash table. The length is the number of slots in the hash table. Records are stored sequentially, without special alignment. A record states a key length, a data length, the key, and the data. Each hash table slot states a hash value and a byte position. If the byte position is 0, the slot is empty. Otherwise, the slot points to a record whose key has that hash value. Positions, lengths, and hash values are 32-bit quantities, stored in little-endian form in 4 bytes. Thus a cdb must fit into 4 gigabytes. A record is located as follows. Compute the hash value of the key in the record. The hash value modulo 256 is the number of a hash table. The hash value divided by 256, modulo the length of that table, is a slot number. Probe that slot, the next higher slot, and so on, until you find the record or run into an empty slot. The cdb hash function is ``h = ((h << 5) + h) ^ c'', with a starting hash of 5381.