This is the repository for Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA.
We have extracted the core implementation of the Learnt Tokenization Module and provide it here for anyone who is interested. To access the full implementation, please check out the full-model branch.
The implementation of the Learnt Tokenization Module in MxDNA is located in the mxdna
directory. The mxdna
directory contains the following files:
README.md
: This file provides an overview of the implementation of the Learnt Tokenization Module in MxDNA.mxdna.py
: This file contains the implementation of the Learnt Tokenization Module in MxDNA.BasicUnitNMS.cpp
: This file contains the implementation of the basic unit non-maximum suppression (NMS) algorithm used in MxDNA.CMakeLists.txt
: This file contains the CMake configuration for building the MxDNA project.
You need to further clone the pybind11 repository to compile the BasicUnitNMS.cpp
file into a shared object file for use in the mxdna.py
file.
pybind11
: This directory contains the pybind11 library used for Python bindings in MxDNA. You need to clone the pybind11 repository to use MxDNA.BasicUnitNMS.cpython-{$PYTHONVERSION}-{$SYSTEMARCHITECTURE}-linux-gnu.so
: This file contains the compiled shared object file for the basic unit NMS algorithm. You need to compile this file using the provided CMake configuration.
The core of the Learnt Tokenization Module in MxDNA is the MxDNALearntTokenizationLayer
class defined in mxdna.py
. The Non-maximum Suppression (NMS) algorithm is implemented in the BasicUnitNMS.cpp
file. It is compiled into a python packaging using pybind11 and used in the MxDNALearntTokenizationLayer
class. The sparse Mixture of Convolution Experts is the MxDNAConvMoeBlock
class defined in mxdna.py
. The deformable convolution is the MxDNADeforambleConvBlock
class defined in mxdna.py
. The comments in the code provide detailed explanations of the implementation.
Term | Description | Variable in Code |
---|---|---|
Number of nucleotides |
seq_len before tokenization |
|
Dimension of hidden states | hidden_dim |
|
Number of experts | num_experts |
|
Number of basic units |
seq_len after tokenization |
|
Kernel size of deformable convolution | deforamble_conv_kernel_size |
|
Indices of nucleotides or tokens | not used | |
Indices of experts | expert_idx |
|
Input nucleotide sequence |
hidden_states before tokenization |
|
Confidence scores of basic units existence | router_logits |
|
Kernel sizes of convolution experts | expert_kernel_sizes |
|
Mask of basic units existence | basic_unit_mask_center |
|
Convolution experts | MxDNAConvMoeBlock.experts |
|
Basic units |
hidden_states after sparse mixture of convolution experts |
|
Offsets of deformable convolution | offset |
|
Modulation factors of deformable convolution | modulator |
|
Final tokens |
hidden_states after deformable convolution |