2D data table source files

In the dataset source folder folder, a subfolder 2D_datatables should be present. This is the root for a set of folders, each one describing an individual 2D data table, with the name of the folder serving as an identifier.

In each 2D data table folder, a zarr DirectoryStore data.zarr should be present, containing the arrays of properties. (example file).

In addition, a yaml settings file should be present in the 2D data table folder (see 2D Datatable settings).

Zarr source file structure

The source DirectoryStore data.zarr may contain the following arrays:

Properties arrays
One or more arrays specifying properties of the 2D data table. Note that these arrays can be 3D but the first two dimensions should be row and column.
Column index 1D array
A 1D array listing the identifiers of all columns, in the order they are used in the properties matrices.
Row index 1D array
A 1D array listing the identifiers of all rows, in the order they are used in the properties matrices.

Only scalar builtin dtypes (ie not structured with fields or user-defined) or strings currently permitted for zarr arrays.

Example python zarr creation code:

import zarr
store = zarr.DirectoryStore(output_dir)
root_grp = zarr.group(store, overwrite=True)
call = outfile.create_dataset("call", shape=(1000,10,2), dtype='i1')
call[:,:,:] = my_array_of_calls
allele_depth = outfile.create_dataset("allele_depth", shape=(1000,10,3), dtype='i2')
allele_depth[:,:,:] = my_array_depth
quality = outfile.create_dataset("quality", shape=(1000,10), dtype='i4')
quality[:,:] = my_array_of_quality

We recommend using VCFNP for converting from VCF. See the VCF example for details of how to do this.