BAITS.IR module#

Import packages#

import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

import BAITS

Import files#

Users could read the example data from the github, also could access the data through the link

bcr_loc_df = pd.read_csv('../data/bcr_sampled.tsv', sep='\t', index_col=0)
bcr_loc_df.head(n=2) 
sample bcUmi clone aaClone Vgene Jgene Cgene cdr3nt
0 P0516-LM E14D5FEFEBE3D5C47 TGTCAACAGAGTCACAGTATACCTACTTTT_CQQSHSIPTF_IGKV... CQQSHSIPTF_IGKV1-39_IGKJ2_IGK IGKV1-39 IGKJ2 IGK TGTCAACAGAGTCACAGTATACCTACTTTT
1 P0516-LM 8E153B79054792CA2 TGTCAACAATATAGCACTGTCCCCCTCACTTTC_CQQYSTVPLTF_... CQQYSTVPLTF_IGKV1-27_IGKJ4_IGK IGKV1-27 IGKJ4 IGK TGTCAACAATATAGCACTGTCCCCCTCACTTTC

Define parameters#

sample_col = 'sample'
Umi_col = 'bcUmi'
clone_col = 'clone'
cdr3nt_col = 'cdr3nt'
Vgene_col = 'Vgene'
Jgene_col = 'Jgene'
Cgene_col = 'Cgene'

Quality Control#

Calculate clonal CDR3 length#

Here, BAITS provides functions to calculate and visualize the length distribution of CDR3 region. The cauculated results will be returned.

bcr_loc_df = BAITS.VDJ.tl.calculate_cdr3_length(bcr_loc_df, sample_col, Cgene_col, cdr3nt_col, cdr3_type='nt', plot=True)
../_images/1ce82cca5e57ba8689986f53094c1c384b394841d55a1d7789f7252a1708c17b.png
bcr_loc_df.head(n=2) 
sample bcUmi clone aaClone Vgene Jgene Cgene cdr3nt cdr3_length
0 P0516-LM E14D5FEFEBE3D5C47 TGTCAACAGAGTCACAGTATACCTACTTTT_CQQSHSIPTF_IGKV... CQQSHSIPTF_IGKV1-39_IGKJ2_IGK IGKV1-39 IGKJ2 IGK TGTCAACAGAGTCACAGTATACCTACTTTT 10
1 P0516-LM 8E153B79054792CA2 TGTCAACAATATAGCACTGTCCCCCTCACTTTC_CQQYSTVPLTF_... CQQYSTVPLTF_IGKV1-27_IGKJ4_IGK IGKV1-27 IGKJ4 IGK TGTCAACAATATAGCACTGTCCCCCTCACTTTC 11

Calculate clonal richness#

BAITS provides function to calculate and visualize the clonal number per sample

bcr_loc_df = BAITS.VDJ.tl.stat_clone(bcr_loc_df, 'sample', Cgene_col, clone_col) 
../_images/7ea12bfdae5ee1543e4a14890394bba38e0879071e50d29e95bded9622aba0ea.png
bcr_loc_df.head(n=2) 
sample bcUmi clone aaClone Vgene Jgene Cgene cdr3nt cdr3_length clone_by_sample
0 P0516-LM E14D5FEFEBE3D5C47 TGTCAACAGAGTCACAGTATACCTACTTTT_CQQSHSIPTF_IGKV... CQQSHSIPTF_IGKV1-39_IGKJ2_IGK IGKV1-39 IGKJ2 IGK TGTCAACAGAGTCACAGTATACCTACTTTT 10 5216
1 P0516-LM 8E153B79054792CA2 TGTCAACAATATAGCACTGTCCCCCTCACTTTC_CQQYSTVPLTF_... CQQYSTVPLTF_IGKV1-27_IGKJ4_IGK IGKV1-27 IGKJ4 IGK TGTCAACAATATAGCACTGTCCCCCTCACTTTC 11 5216

Calculate immune repertoire features#

Calculates the Shannon entropy of the CDR3 repertoire, leveraging UMI information to correct for sequencing errors and amplification bias.

groups = [sample_col, Cgene_col] # user-customizable based on experimental design
shannon_entropy_df = BAITS.VDJ.tl.compute_grouped_index(bcr_loc_df, sample_col, Cgene_col, clone_col, groups, count_basis = 'UMI', Umi_col='bcUmi', index='shannon_entropy')

Here, the Shannon entropy of the CDR3 repertoire is computed, stratified by both sample and C gene (chain type) groups.

shannon_entropy_df.head()
sample Cgene shannon_entropy
0 P0516-LM IGH 8.898825
1 P0516-LM IGK 9.235398
2 P0516-LM IGL 8.314985
3 P1128-LM IGH 8.548061
4 P1128-LM IGK 8.711763
fig, ax = plt.subplots(1,1,figsize=(4, 3.5))
BAITS.VDJ.pl._boxplot(shannon_entropy_df, 'Cgene', 'shannon_entropy', groupby='Cgene', palette='Pastel1', xlabel=None, ylabel='shannon_entropy', log=False, ax=ax )
plt.show()
../_images/c513bd38e6d72271aa1769e5ae1be09a3a281ae6f5be0b4ffce19d040d3617e5.png

Calculates the Clonality of the CDR3 repertoire, leveraging UMI information to correct for sequencing errors and amplification bias.

Clonality quantifies the unevenness of clonal frequencies in the immune repertoire. Higher clonality suggests antigen-driven expansion and potential immune dominance

groups = [sample_col, Cgene_col]
shannon_entropy_df = BAITS.VDJ.tl.compute_grouped_index(bcr_loc_df, sample_col, Cgene_col, clone_col, groups, count_basis = 'UMI', Umi_col='bcUmi', index='Clonality')
fig, ax = plt.subplots(1,1,figsize=(4, 3.5))
BAITS.VDJ.pl._boxplot(shannon_entropy_df, 'Cgene', 'Clonality', groupby='Cgene', palette='Pastel1', xlabel=None, ylabel='Clonality', log=False, ax=ax )
plt.show()
../_images/a8c40483ed958b9425a94003adbc2d45de79c693988ca4dc52b092fb11d57b29.png

Gini_index (based on UMIs)

groups = [sample_col, Cgene_col]
gini_index_df = BAITS.VDJ.tl.compute_grouped_index(bcr_loc_df, sample_col, Cgene_col, clone_col, groups, count_basis = 'UMI', Umi_col='bcUmi', index='gini_index')
gini_index_df['tissue'] = gini_index_df['sample'].str.split('-').str[1]
fig, ax = plt.subplots(1,1,figsize=(4, 3.5))
BAITS.VDJ.pl._boxplot(gini_index_df, 'Cgene', 'gini_index', groupby='Cgene', palette='Pastel1', xlabel=None, ylabel='gini_index', log=False, ax=ax )
plt.show()
../_images/af846fd12cd6ee6d82c75dd2967dcbaf1777d96f153d046b472eed538ece3114.png

Renyi_entropy (based on UMIs)

Renyi entropy is a generalized diversity measure that captures different aspects of repertoire structure through an adjustable parameter alpha.

groups = [sample_col, Cgene_col]
renyi_entropy_df = BAITS.VDJ.tl.compute_grouped_index(bcr_loc_df, sample_col, Cgene_col, clone_col, groups, count_basis = 'UMI', Umi_col='bcUmi', index='renyi_entropy')
fig, ax = plt.subplots(1,1,figsize=(4, 3.5))
sns.lineplot(data=renyi_entropy_df, x="alpha", y="renyi_entropy", hue='Cgene', ax=ax )
plt.show()
../_images/f81da412dfaa2872bc2cee95489ca2cf9f2e1b1eaf2d63c305b0528261fd8347.png

Cluster BCR sequences to get BCR family#

We first focus on IGH BCRs (Immunoglobulin Heavy Chains) because they are the most important for analyzing BCR clusters and clonal families. By selecting only rows where the Cgene column equals ‘IGH’, we create a subset of the data that contains only IGH sequences, which will be used for downstream clustering and diversity analysis.

igh_bcr_loc_df = bcr_loc_df[bcr_loc_df['Cgene']=='IGH'].copy()

BCR sequences are first grouped by V gene, J gene, and CDR3 length. Within each group, sequences sharing ≥85% nucleotide identity (user-defined) are seen from the same BCR family.

igh_bcr_loc_df = BAITS.VDJ.tl.cluster_bcrs(igh_bcr_loc_df, threshold=0.85, sample_col='sample',  Vgene_col='Vgene', Jgene_col='Jgene', cdr3nt_col='cdr3nt')
igh_bcr_loc_df.iloc[:,-5:].head() 
cdr3_length clone_by_sample CDR3_nt_length BCR_familyID Degree
0 13 1023 39 IGHV1-45_IGHJ4_39_0 0
1 20 1023 60 IGHV4-4_IGHJ5_60_0 0
2 14 1023 42 IGHV4-39_IGHJ4_42_2 0
3 21 1023 63 IGHV3-72_IGHJ4_63_0 1
4 18 1023 54 IGHV1-69_IGHJ6_54_1 0

Degree centrality

Clonal degree centrality: Measure of the number of similar clones (with a hamming distance ≤ 1) connected to each unique V(D)J sequence within a clonal family. A high degree centrality value suggests that a particular V(D)J sequence serves as a progenitor clone, from which multiple related sequences have diverged via somatic hypermutation. This pattern is indicative of active antigen-driven diversification and affinity maturation. Conversely, a low degree centrality value indicates that the sequence is peripheral or isolated within the clonal network, with few connected variants. This may suggest limited subsequent diversification, potentially representing either a recently emerged clone or one that has not undergone significant antigen-driven expansion.

plt.subplots(1,1,figsize=(3, 3))
sns.boxplot(igh_bcr_loc_df, x='sample', y='Degree', palette='Pastel1')
plt.ylabel('Clonal degree centrrality')
plt.show()
../_images/0e2b6492dd366abd7d1e4f63fa2ebb4da79cc5d783a57b55bfbd579f124359e6.png

Calculate clonal family diversification index

Clonal family diversification: Measures the unevenness in the distribution of unique V(D)J sequences across clonal families using the Gini index. A high value indicates that certain clonal families contain numerous distinct V(D)J sequences – a pattern often associated with antigen-driven expansion and diversification.

Clonal_family_diver_idx = BAITS.VDJ.tl.Clonal_family_diversification(igh_bcr_loc_df, sample_col='sample', cluster_col='BCR_familyID', cdr3_col='cdr3nt') 
    
Clonal_family_diver_idx.head()
sample Clonal_family_diver_index
0 P0516-LM 0.285553
1 P1128-LM 0.611375
plt.subplots(1,1,figsize=(3.5, 3.5))
sns.barplot(Clonal_family_diver_idx,x='sample', y='Clonal_family_diver_index', color='#4C72B0')
plt.ylabel('clonal family diversification index')
plt.show()
../_images/fdaf72571e1332bce5b214c0b740ca6226460a42f31b5562016d6ba36e6de1a3.png