BAITS.VDJ.tl.cluster_bcrs#
- BAITS.VDJ.tl.cluster_bcrs(igh_df, threshold=0.85, sample_col=None, Vgene_col='Vgene', Jgene_col='Jgene', cdr3nt_col='cdr3nt', n_cpu=None)#
Cluster BCR sequences across an entire dataset and compute neighbor degrees.
This function groups sequences by Vgene, Jgene, and CDR3 length, applies
process_group_with_neighbor_countin parallel, and returns the original dataframe with added columns for BCR family ID and neighbor degree.- Parameters:
igh_df (pandas.DataFrame) – DataFrame containing BCR sequences.
threshold (float, default=0.85) – Sequence identity threshold for clustering.
sample_col (str or None, optional) – Column for sample/library ID. Currently reserved for future use.
Vgene_col (str, default="Vgene") – Column containing V gene names.
Jgene_col (str, default="Jgene") – Column containing J gene names.
cdr3nt_col (str, default="cdr3nt") – Column containing CDR3 nucleotide sequences.
n_cpu (int or None, optional) – Number of CPUs to use for parallel processing. Defaults to all available.
- Returns:
Original dataframe augmented with: - “BCR_familyID”: cluster ID assigned to each sequence - “Degree”: number of neighbors differing by one nucleotide within cluster
- Return type:
pandas.DataFrame