💡 This is an early preview version, subject to breaking changes.
Shoji is a tensor database, suitable for storing and working with very large-scale datasets organized as vectors, matrices and higher-dimensional tensors.
- Multi-petabyte scalable, distributed, high-performance database
- Data modelled as N-dimensional tensors with boolean, string or numeric elements
- Supports both regular and jagged tensors
- Automatic chunking and compression
- Relationships expressed through shared named dimensions
- Read and write data through views created by powerful filter expressions
- Automatic indexing for fast filtering
- Data safety through transactions and ACID properties (atomicity, consistency, isolation, durability)
- Concurrent read/write access
- Elegant, convenient Python API, aligned with numpy
Oh, and it's pretty fast.
In Shoji, data is stored as tensors, and relationships are expressed using shared dimensions.
Dimensions can be named, and named dimensions express relationships and constraints between tensors. Tensors that share a named dimension must have the same length along that dimension (and this relationship is enforced when adding data).
You can think of rows as your data objects, dimensions as object types, and the tensors as object attributes. For example, a set of vectors (e.g. SampleID, Age, Tissue, Date) defined on a samples dimension could be seen as the attributes of samples, and an individual sample would correspond to an individual row across all tensors.
Tensors can also be related to multiple named dimensions. For example, omics data (e.g. gene expression) is often represented as matrices, which can be represented in Shoji as rank-2 tensors with two named dimensions, e.g. cells and genes. Metadata about cells and genes would be stored as rank-1 tensors (vectors) along the cells and genes dimensions, respectively. Similarly, multichannel timelapse image data can be represented as high-rank tensors with dimensions such as x, y, channel, and timepoint. This makes Shoji fundamentally different from tabular (relational) databases, which struggle to represent multidimensional data.
The fundamental operations in shoji are: creating a tensor, appending values, reading values, updating values. Tensors can be deleted, but individual tensor rows cannot.
Shoji treats the slice as the atomic unit when writing data. This means that if your program crashes in the middle of an operation, you are guaranteed that there will be no half-created rows, or partially updated elements in the database.
When more than one tensor shares their first dimension, the atomic unit for writing new data (i.e. for Dimension.append()) is a slice across all tensors that share the same first dimension. In other words, if your program crashes in the middle of an append() operation, shoji guarantees that some number of complete indices (or nothing at all) will have been written across all the relevant tensors, ensuring that they stay in sync.
If you need stronger guarantees, you can wrap multiple database operations in a shoji.transaction.
Shoji is built on FoundationDB, a powerful open-source key-value store developed by Apple. It is FoundationDB that gives Shoji a solid foundation of performance, scalability and ACID guarantees. In order to gain these features, there are a few limitations though:
Transactions cannot exceed 5 seconds. If a transaction takes longer, it's terminated and rolled back. For Shoji, this limits the total feasible size of a slice (or a set of rows for append operations), since Shoji reads and writes slices transactionally.
Transactions exceeding 1 MB can cause performance issues, and transactions cannot exceed 10 MB. This also limits the total feasible size of a tensor slice, since Shoji reads and writes slices transactionally to ensure consistency.
FoundationDB is optimized to run on SSDs. Running on mechanical disks is discouraged.
For more details about these and some other limitations, see the FoundationDB docs
Shoji requires Python 3.7+ (we recommend Anaconda)
First, install the FoundationDb client:
-
Double-click on FoundationDB-6.#.##.pkg and follow the instructions
💡 If you get a security error (“FoundationDB-6.2.27.pkg” cannot be opened because it is from an unidentified developer) go to Settings → Security & Privacy → General and click on Open Anyway and then on Open (in the dialog).
Next, in your terminal, install the foundationdb and shoji Python packages:
$ pip install foundationdb
$ git clone https://github.com/linnarsson-lab/shoji.git
$ pip install -e shoji
Check that you can now connect to the database:
import shoji
db = shoji.connect()
db
Typing db alone at the last line above should return a representation of the contents of the database (which might be empty at this point).
Clone the repository, and then go to the shoji/html
folder to browse the docs.
Shoji is based on FoundationDB. You can easily set up a local Shoji database by following these instructions:
-
Double-click on FoundationDB-6.#.##.pkg and follow the instructions
💡 If you get a security error (“FoundationDB-6.2.27.pkg” cannot be opened because it is from an unidentified developer) go to Settings → Security & Privacy → General and click on Open Anyway and then on Open (in the dialog). -
In your Terminal, type
fdbcli
and thenstatus
to confirm that the database is up and running -
Still in
fdbcli
, typeconfigure ssd
to change the storage engine to ssd-2 -
After a few minutes,
status
should again show asHealthy
pip install -U foundationdb
git clone https://github.com/linnarsson-lab/shoji.git
cd shoji
pip install -e .
>>> import shoji
>>> db = shoji.connect()
>>> db
(root) (shoji.Workspace)
Documentation: shoji/html/shoji/index.html
FDB cluster file: /usr/local/etc/foundationdb/fdb.cluster
FDB config: /usr/local/etc/foundationdb/foundationdb.conf
Data: /usr/local/foundationdb/