CLC Assembly Cell
A high-performance computing solution for mapping reads to a reference and de novo assembly of next-generation sequencing data.
With the command-line interface of CLC Assembly Cell, you can easily include these functionalities in scripts and other next generation sequencing workflows. It is easy to install on your desktop computer or a larger compute cluster.
BenchmarkingThe latest version of CLC Assembly Cell introduces tools for error-correction and de novo assembly of raw PacBio reads. High quality assemblies can be generated in a fraction of the time that is needed by leading alternatives. CLC Assembly Cell consumes less than 10 percent of the memory used by alternative solutions, while completing the assembly faster.
We compared the performance of the industry standard HGAP1 when run on a high performance computer to the performance of a De Novo Assembly workflow in CLC Assembly Cell. Please note that our De Novo Assembly Pipeline was run on a standard laptop for this comparison.
CLC Assembly Cell is accelerated through advanced algorithm implementations: they use the SIMD instruction set to parallelize and accelerate compute intensive parts of the algorithms, and make the software one of the fastest and most accurate packages for NGS data analysis on the market.
Features in CLC Assembly Cell
Multiple CLC Assembly Cells can be run in parallel on a multi-node cluster.
In practice, almost every cluster is set up differently, and we therefore don’t provide an off-the-shelf solution that is guaranteed to work on your computer cluster. Instead we provide a free to download, free to use, and free to modify Perl script, as an example.
Job node distribution for CLC Assembly Cell
Multiple CLC Assembly Cells can be run in parallel on a multi-node cluster and as almost every cluster is set up differently, we provide the below free to download, free to use, and free to modify Perl script as an example. Please note that this is not an off-the-shelf solution that is guaranteed to work on your computer cluster but you are welcome to adjust it to fit your needs.
The script cluster_schedule distributes jobs defined in the schedule file on a number of nodes. An example could be distribution of CLC Assembly Cell reference assembly jobs. This requires an installation of CLC Assembly Cell on each node, and the best performance is reached if the reference sequence is stored locally on each node.
Each job is a list of commands which cluster_schedule will run in order on one node. If one of the commands in a job fails (error code is not zero) no more commands in the job is executed and the job is considered failed. If all commands in a job complete successfully (error codes are zero) the job is a success.
The nodes the jobs are run on can be defined on the command line or in the schedule_file. The nodes defined in command line replace all nodes defined in the schedule_file.
Each job is run on one node and each command is executed on the node using ssh.
Therefore, to use cluster_schedule make sure that all nodes are set up to use automatic ssh authentication.
fragment fix placeholder