Crate bitnuc

Expand description

§bitnuc

A library for efficient nucleotide sequence manipulation using 2-bit encoding.

§Features

2-bit nucleotide encoding (A=00, C=01, G=10, T=11)
Direct bit manipulation functions for custom implementations
Higher-level sequence type with additional analysis features

§Low-Level Packing Functions

For direct bit manipulation, use the as_2bit and from_2bit functions:

use bitnuc::{as_2bit, from_2bit, from_2bit_alloc};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Pack a sequence into a u64
    let packed = as_2bit(b"ACGT")?;
    assert_eq!(packed, 0b11100100);

    // Unpack back to a sequence using a reusable buffer
    let mut unpacked = Vec::new();
    from_2bit(packed, 4, &mut unpacked)?;
    assert_eq!(&unpacked, b"ACGT");
    unpacked.clear();

    // Unpack back to a sequence with a reallocation
    let unpacked = from_2bit_alloc(packed, 4)?;
    assert_eq!(&unpacked, b"ACGT");

    Ok(())
}

These functions are useful when you need to:

Implement custom sequence storage
Manipulate sequences at the bit level
Integrate with other bioinformatics tools
Copy sequences more efficiently
Hash sequences more efficiently

For example, packing multiple short sequences:

use bitnuc::{as_2bit, from_2bit};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Pack multiple 4-mers into u64s
    let kmers = [b"ACGT", b"TGCA", b"GGCC"];
    let packed: Vec<u64> = kmers
        .into_iter()
        .map(|kmer| as_2bit(kmer))
        .collect::<Result<_, _>>()?;

    // Unpack when needed
    let mut unpacked = Vec::new();
    from_2bit(packed[0], 4, &mut unpacked)?;
    assert_eq!(&unpacked, b"ACGT");
    Ok(())
}

§Mid-Level Encoding Functions

For more control over encoding and decoding, use the encode and decode functions:

These will handle sequences of any length, padding the last u64 with zeros if needed.

We’ll use the nucgen crate to generate random sequences for testing:

use bitnuc::{encode, decode};
use nucgen::Sequence;

let mut rng = rand::thread_rng();
let mut seq = Sequence::new();
let seq_len = 1000;

// Generate a random sequence
seq.fill_buffer(&mut rng, seq_len);

// Encode the sequence
let mut ebuf = Vec::new(); // Buffer for encoded sequence
encode(seq.bytes(), &mut ebuf);

// Decode the sequence
let mut dbuf = Vec::new(); // Buffer for decoded sequence
decode(&ebuf, seq_len, &mut dbuf);

// Check that the decoded sequence matches the original
assert_eq!(seq.bytes(), &dbuf);

Note that the encode function will always encode a full u64. If you have a sequence that is not a multiple of 32 bases, the final u64 will be backed up to the remainder, and the rest of the bits will be set to zero.

Decoding will ignore these zero bits and return the original sequence.

§High-Level Sequence Type

For more complex sequence manipulation, use the PackedSequence type:

use bitnuc::{PackedSequence, GCContent, BaseCount};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let seq = PackedSequence::new(b"ACGTACGT")?;

    // Sequence analysis
    println!("GC Content: {}%", seq.gc_content());
    let [a_count, c_count, g_count, t_count] = seq.base_counts();

    // Slicing
    let subseq = seq.slice(1..5)?;
    assert_eq!(&subseq, b"CGTA");
    Ok(())
}

§Memory Usage

The 2-bit encoding provides significant memory savings:

Standard encoding: 1 byte per base
ACGT = 4 bytes = 32 bits

2-bit encoding: 2 bits per base
ACGT = 8 bits

This means you can store 4 times as many sequences in the same amount of memory.

§Error Handling

All operations that could fail return a Result with NucleotideError:

use bitnuc::{as_2bit, NucleotideError};

// Invalid nucleotide
let err = as_2bit(b"ACGN").unwrap_err();
assert!(matches!(err, NucleotideError::InvalidBase(b'N')));

// Sequence too long
let long_seq = vec![b'A'; 33];
let err = as_2bit(&long_seq).unwrap_err();
assert!(matches!(err, NucleotideError::SequenceTooLong(33)));

§Performance Considerations

When working with many short sequences (like k-mers), using as_2bit and from_2bit directly can be more efficient than creating PackedSequence instances:

use bitnuc::{as_2bit, from_2bit};
use std::collections::HashMap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Efficient k-mer counting
    let mut kmer_counts = HashMap::new();

    // Pack k-mers directly into u64s
    let sequence = b"ACGTACGT";
    for window in sequence.windows(4) {
        let packed = as_2bit(window)?;
        *kmer_counts.entry(packed).or_insert(0) += 1;
    }

    // Count of "ACGT"
    let acgt_packed = as_2bit(b"ACGT")?;
    assert_eq!(kmer_counts.get(&acgt_packed), Some(&2));
    Ok(())
}

If you are unpacking many sequences, consider reusing a buffer to avoid reallocations:

use bitnuc::{as_2bit, from_2bit};

fn main() -> Result<(), Box<dyn std::error::Error>> {

    // Pack a sequence
    let packed = as_2bit(b"ACGT")?;

    // Reusable buffer for unpacking
    let mut unpacked = Vec::new();
    from_2bit(packed, 4, &mut unpacked)?;
    assert_eq!(&unpacked, b"ACGT");
    unpacked.clear();

    // Pack another sequence
    let packed = as_2bit(b"TGCA")?;
    from_2bit(packed, 4, &mut unpacked)?;
    assert_eq!(&unpacked, b"TGCA");
    Ok(())
}

See the documentation for as_2bit and from_2bit for more details on working with packed sequences directly.

Structs§

PackedSequence

Enums§

NucleotideError

Traits§

BaseCount
GCContent

Functions§

as_2bit: Converts a nucleotide sequence into a 2-bit packed representation.
decode: Unpacks a 2-bit packed sequence into a nucleotide sequence.
encode: Encode a sequence into a buffer of 2-bit encoded nucleotides.
encode_alloc: Encode a sequence into a buffer of 2-bit encoded nucleotides.
from_2bit: Converts a 2-bit packed representation back into a nucleotide sequence.
from_2bit_alloc: This calls from_2bit but allocates a new Vec to store the result.
hdist: Calculate hamming distance between two 2-bit encoded sequences Each u64 contains up to 32 bases (2 bits per base)
hdist_scalar: Calculate hamming distance between two 2-bit encoded u64 values Each u64 can contain up to 32 bases (2 bits per base) len must be <= 32
split_packed: Splits a packed nucleotide sequence into two subsequences at the given index.

Crate bitnucCopy item path