BAITS.VDJ.tl.cluster_bcrs

Contents

BAITS.VDJ.tl.cluster_bcrs#

BAITS.VDJ.tl.cluster_bcrs(igh_df, threshold=0.85, sample_col=None, Vgene_col='Vgene', Jgene_col='Jgene', cdr3nt_col='cdr3nt', n_cpu=None)#

Cluster BCR sequences across an entire dataset and compute neighbor degrees.

This function groups sequences by Vgene, Jgene, and CDR3 length, applies process_group_with_neighbor_count in parallel, and returns the original dataframe with added columns for BCR family ID and neighbor degree.

Parameters:
  • igh_df (pandas.DataFrame) – DataFrame containing BCR sequences.

  • threshold (float, default=0.85) – Sequence identity threshold for clustering.

  • sample_col (str or None, optional) – Column for sample/library ID. Currently reserved for future use.

  • Vgene_col (str, default="Vgene") – Column containing V gene names.

  • Jgene_col (str, default="Jgene") – Column containing J gene names.

  • cdr3nt_col (str, default="cdr3nt") – Column containing CDR3 nucleotide sequences.

  • n_cpu (int or None, optional) – Number of CPUs to use for parallel processing. Defaults to all available.

Returns:

Original dataframe augmented with: - “BCR_familyID”: cluster ID assigned to each sequence - “Degree”: number of neighbors differing by one nucleotide within cluster

Return type:

pandas.DataFrame