Float keys are unsupported, partially because I have not needed them, and their use can roughly be emulated with int(f * 1e9) or similar. But mainly it is to defer a choice: should floats order alongside integers? If not, then sorting won’t behave like SQL or Python, causing user surprise. If yes, then should integers be treated as floats? If yes, then keys will always decode to float, causing surprise. If no, then a new encoding is needed, wasting ~2 bytes (terminator, discriminator).
Another option is always treating numbers as float, but converting to int during decode if they may be represented exactly. This may be less surprising, since an int will coerce to float during arithmetic, but may cause once-per-decade bugs: depending on a database key, the expression 123 / db_val might perform integer or float division.
A final option is adding a number_factory= parameter to unpacks(), which still requires picking a good default.
Keys composed of a single value have much the same trade-offs and problems as floats: either a heuristic is employed that always treats 1-tuples as single values, leading to user surprise in some cases, and ugly logic when writing generic code, or waste a byte for each single-valued key.
In the non-heuristic case, further problems emerge: if the user calls get(1), should it return the same result as get((1,))? If yes, two lookups may be required.
If no, then another problem emerges: staticly typed languages. In a language where we might have a Tuple type representing the key tuple, every interface dealing with keys must be duplicated for the single-valued case. Meanwhile the same problems with lookups and comparison in a dynamic language also occur.
Another option is to make the key encoding configurable: this would allow non-tuple keys at a cost to some convenience, but also enable extra uses. For example, allowing a pure-integer key encoding that could be used to efficiently represent a Collection as an SQL table by leveraging the OID type, or to provide exact emulation of the sort order of other databases (e.g. App Engine).
Several difficulties arise with parameterizing key encoding. Firstly, Index relies on acid.keylib.packs() to function. One solution might be to parameterize Index‘s key construction, or force key encodings to accept lists of keys as part of their interface. A second issue is that is that the ‘innocence’ of the key encoding might be needed to implement prefix= queries robustly.
Another option would be to allow alternative key encodings for Collection only, with the restriction that keys must still always be tuples in order to remain compatible with Index. Needs further consideration.
Storing data isn’t hard: it has effectively been solved since at least 1972 when the B-tree appeared, variants of which comprise the core of SQLite, MongoDB, and 90% of all DBMS wheel reinventions existing since. So this library is the product of frustration. While experimenting with compression for a hobby project in late 2012, I again partially implemented what this library should be: a small layer that implements indexing, and gently placates the use of Cold War era technology.
The library is experimental, but eventually it should become a small, convenient way to store data for programs with medium sized datasets. Already with a suitable engine it can offer better durability guarantees, and vastly better performance than larger and more established systems, such as MongoDB. Coupled with LmdbEngine it is even possible to make consistent online backups without resorting to platform tricks, unlike MongoDB.
With carefully selected primitives, more of the problem domain could be supported. For instance, supporting explicit sharding and replication are interesting, and neither feature should require a 300kLOC codebase to implement, or even a 3kLOC codebase. By removing complexity from the task of persisting data, greater room is left to ponder legitimately hard problems, such as serving an application’s data after it outgrows a single computer or magically partitioned DBMS cluster.
By pushing the DBMS into the application, numerous layers of indirection are removed from the lifecycle of a typical request. For example:
Compared to SQL, there are of course numerous downsides to this approach, however they become less relevant as an application’s complexity increases:
Ability to run ad-hoc reporting queries and reporting tools. In a regular application, this capability is eventually lost as the application’s data set exceeds a single SQL database, and as performance constraints prevent running large batch queries against production databases.
Ability to communicate with self-consistent database over the network for free. This flexibility is more useful in the design phase than in a production application. Once a data set exceeds a single SQL database, consistency checking must be moved into the application layer, at which point the only sane design is to query and modify the dataset using an application layer RPC interface, to avoid risk of causing data inconsistencies.
Ability to run complex queries without forethought. This is another early-phase benefit that degrades over time: once a data set exceeds a single database, JOIN capability is lost, and long before that, most complex query features must be abandoned as dataset size and request load increases (e.g. table scans must be avoided).
Ability to separately load provision application and database. Depending on the application, many data-independant aspects can be split off into separate services, for example in a Flickr-style service, moving image thumbnailing and resizing to a task queue.
For any slow data-dependant requests that must be exposed to the user, an SQL database suffers the same inability to spread load: an application making a complicated set of queries will generate load on both app server and database server.
Ability to setup “free replication”. For the time being this is a real show-stopper, however a future version of Acid may supply primitives to address it in a more scale-agnostic manner than possible with a traditional SQL database.
Well understood approaches to schema migration. We can potentially do better by avoiding SQL here, since adding or removing a field need not necessitate touching every record. Some migrations can be “free of charge”, while others, such as adding an index, may more effort. See ticket 52
Various design aspects were discussed in a series of blog posts: