BAITS.IR module#

Import packages#

import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

import BAITS

Import files#

Users could read the example data from the github, also could access the data through the link

bcr_loc_df = pd.read_csv('../data/bcr_sampled.tsv', sep='\t', index_col=0)
bcr_loc_df.head(n=2) 

	sample	bcUmi	clone	aaClone	Vgene	Jgene	Cgene	cdr3nt
0	P0516-LM	E14D5FEFEBE3D5C47	TGTCAACAGAGTCACAGTATACCTACTTTT_CQQSHSIPTF_IGKV...	CQQSHSIPTF_IGKV1-39_IGKJ2_IGK	IGKV1-39	IGKJ2	IGK	TGTCAACAGAGTCACAGTATACCTACTTTT
1	P0516-LM	8E153B79054792CA2	TGTCAACAATATAGCACTGTCCCCCTCACTTTC_CQQYSTVPLTF_...	CQQYSTVPLTF_IGKV1-27_IGKJ4_IGK	IGKV1-27	IGKJ4	IGK	TGTCAACAATATAGCACTGTCCCCCTCACTTTC

Define parameters#

sample_col = 'sample'
Umi_col = 'bcUmi'
clone_col = 'clone'
cdr3nt_col = 'cdr3nt'
Vgene_col = 'Vgene'
Jgene_col = 'Jgene'
Cgene_col = 'Cgene'

Quality Control#

Calculate clonal CDR3 length#

Here, BAITS provides functions to calculate and visualize the length distribution of CDR3 region. The cauculated results will be returned.

bcr_loc_df = BAITS.VDJ.tl.calculate_cdr3_length(bcr_loc_df, sample_col, Cgene_col, cdr3nt_col, cdr3_type='nt', plot=True)

../_images/1ce82cca5e57ba8689986f53094c1c384b394841d55a1d7789f7252a1708c17b.png

bcr_loc_df.head(n=2) 

	sample	bcUmi	clone	aaClone	Vgene	Jgene	Cgene	cdr3nt	cdr3_length
0	P0516-LM	E14D5FEFEBE3D5C47	TGTCAACAGAGTCACAGTATACCTACTTTT_CQQSHSIPTF_IGKV...	CQQSHSIPTF_IGKV1-39_IGKJ2_IGK	IGKV1-39	IGKJ2	IGK	TGTCAACAGAGTCACAGTATACCTACTTTT	10
1	P0516-LM	8E153B79054792CA2	TGTCAACAATATAGCACTGTCCCCCTCACTTTC_CQQYSTVPLTF_...	CQQYSTVPLTF_IGKV1-27_IGKJ4_IGK	IGKV1-27	IGKJ4	IGK	TGTCAACAATATAGCACTGTCCCCCTCACTTTC	11

Calculate clonal richness#

BAITS provides function to calculate and visualize the clonal number per sample

bcr_loc_df = BAITS.VDJ.tl.stat_clone(bcr_loc_df, 'sample', Cgene_col, clone_col) 

../_images/7ea12bfdae5ee1543e4a14890394bba38e0879071e50d29e95bded9622aba0ea.png

bcr_loc_df.head(n=2) 

	sample	bcUmi	clone	aaClone	Vgene	Jgene	Cgene	cdr3nt	cdr3_length	clone_by_sample
0	P0516-LM	E14D5FEFEBE3D5C47	TGTCAACAGAGTCACAGTATACCTACTTTT_CQQSHSIPTF_IGKV...	CQQSHSIPTF_IGKV1-39_IGKJ2_IGK	IGKV1-39	IGKJ2	IGK	TGTCAACAGAGTCACAGTATACCTACTTTT	10	5216
1	P0516-LM	8E153B79054792CA2	TGTCAACAATATAGCACTGTCCCCCTCACTTTC_CQQYSTVPLTF_...	CQQYSTVPLTF_IGKV1-27_IGKJ4_IGK	IGKV1-27	IGKJ4	IGK	TGTCAACAATATAGCACTGTCCCCCTCACTTTC	11	5216

Calculate immune repertoire features#

Calculates the Shannon entropy of the CDR3 repertoire, leveraging UMI information to correct for sequencing errors and amplification bias.

groups = [sample_col, Cgene_col] # user-customizable based on experimental design

shannon_entropy_df = BAITS.VDJ.tl.compute_grouped_index(bcr_loc_df, sample_col, Cgene_col, clone_col, groups, count_basis = 'UMI', Umi_col='bcUmi', index='shannon_entropy')

Here, the Shannon entropy of the CDR3 repertoire is computed, stratified by both sample and C gene (chain type) groups.

shannon_entropy_df.head()

	sample	Cgene	shannon_entropy
0	P0516-LM	IGH	8.898825
1	P0516-LM	IGK	9.235398
2	P0516-LM	IGL	8.314985
3	P1128-LM	IGH	8.548061
4	P1128-LM	IGK	8.711763

fig, ax = plt.subplots(1,1,figsize=(4, 3.5))
BAITS.VDJ.pl._boxplot(shannon_entropy_df, 'Cgene', 'shannon_entropy', groupby='Cgene', palette='Pastel1', xlabel=None, ylabel='shannon_entropy', log=False, ax=ax )
plt.show()

../_images/c513bd38e6d72271aa1769e5ae1be09a3a281ae6f5be0b4ffce19d040d3617e5.png

Calculates the Clonality of the CDR3 repertoire, leveraging UMI information to correct for sequencing errors and amplification bias.

Clonality quantifies the unevenness of clonal frequencies in the immune repertoire. Higher clonality suggests antigen-driven expansion and potential immune dominance

groups = [sample_col, Cgene_col]
shannon_entropy_df = BAITS.VDJ.tl.compute_grouped_index(bcr_loc_df, sample_col, Cgene_col, clone_col, groups, count_basis = 'UMI', Umi_col='bcUmi', index='Clonality')

fig, ax = plt.subplots(1,1,figsize=(4, 3.5))
BAITS.VDJ.pl._boxplot(shannon_entropy_df, 'Cgene', 'Clonality', groupby='Cgene', palette='Pastel1', xlabel=None, ylabel='Clonality', log=False, ax=ax )
plt.show()

../_images/a8c40483ed958b9425a94003adbc2d45de79c693988ca4dc52b092fb11d57b29.png

Gini_index (based on UMIs)

groups = [sample_col, Cgene_col]
gini_index_df = BAITS.VDJ.tl.compute_grouped_index(bcr_loc_df, sample_col, Cgene_col, clone_col, groups, count_basis = 'UMI', Umi_col='bcUmi', index='gini_index')
gini_index_df['tissue'] = gini_index_df['sample'].str.split('-').str[1]

fig, ax = plt.subplots(1,1,figsize=(4, 3.5))
BAITS.VDJ.pl._boxplot(gini_index_df, 'Cgene', 'gini_index', groupby='Cgene', palette='Pastel1', xlabel=None, ylabel='gini_index', log=False, ax=ax )
plt.show()

../_images/af846fd12cd6ee6d82c75dd2967dcbaf1777d96f153d046b472eed538ece3114.png

Renyi_entropy (based on UMIs)

Renyi entropy is a generalized diversity measure that captures different aspects of repertoire structure through an adjustable parameter alpha.

groups = [sample_col, Cgene_col]
renyi_entropy_df = BAITS.VDJ.tl.compute_grouped_index(bcr_loc_df, sample_col, Cgene_col, clone_col, groups, count_basis = 'UMI', Umi_col='bcUmi', index='renyi_entropy')

fig, ax = plt.subplots(1,1,figsize=(4, 3.5))
sns.lineplot(data=renyi_entropy_df, x="alpha", y="renyi_entropy", hue='Cgene', ax=ax )
plt.show()

../_images/f81da412dfaa2872bc2cee95489ca2cf9f2e1b1eaf2d63c305b0528261fd8347.png

Cluster BCR sequences to get BCR family#

We first focus on IGH BCRs (Immunoglobulin Heavy Chains) because they are the most important for analyzing BCR clusters and clonal families. By selecting only rows where the Cgene column equals ‘IGH’, we create a subset of the data that contains only IGH sequences, which will be used for downstream clustering and diversity analysis.

igh_bcr_loc_df = bcr_loc_df[bcr_loc_df['Cgene']=='IGH'].copy()

BCR sequences are first grouped by V gene, J gene, and CDR3 length. Within each group, sequences sharing ≥85% nucleotide identity (user-defined) are seen from the same BCR family.

igh_bcr_loc_df = BAITS.VDJ.tl.cluster_bcrs(igh_bcr_loc_df, threshold=0.85, sample_col='sample',  Vgene_col='Vgene', Jgene_col='Jgene', cdr3nt_col='cdr3nt')

igh_bcr_loc_df.iloc[:,-5:].head() 

	cdr3_length	clone_by_sample	CDR3_nt_length	BCR_familyID	Degree
0	13	1023	39	IGHV1-45_IGHJ4_39_0	0
1	20	1023	60	IGHV4-4_IGHJ5_60_0	0
2	14	1023	42	IGHV4-39_IGHJ4_42_2	0
3	21	1023	63	IGHV3-72_IGHJ4_63_0	1
4	18	1023	54	IGHV1-69_IGHJ6_54_1	0

Degree centrality

Clonal degree centrality: Measure of the number of similar clones (with a hamming distance ≤ 1) connected to each unique V(D)J sequence within a clonal family. A high degree centrality value suggests that a particular V(D)J sequence serves as a progenitor clone, from which multiple related sequences have diverged via somatic hypermutation. This pattern is indicative of active antigen-driven diversification and affinity maturation. Conversely, a low degree centrality value indicates that the sequence is peripheral or isolated within the clonal network, with few connected variants. This may suggest limited subsequent diversification, potentially representing either a recently emerged clone or one that has not undergone significant antigen-driven expansion.

plt.subplots(1,1,figsize=(3, 3))
sns.boxplot(igh_bcr_loc_df, x='sample', y='Degree', palette='Pastel1')
plt.ylabel('Clonal degree centrrality')
plt.show()

../_images/0e2b6492dd366abd7d1e4f63fa2ebb4da79cc5d783a57b55bfbd579f124359e6.png

Calculate clonal family diversification index

Clonal family diversification: Measures the unevenness in the distribution of unique V(D)J sequences across clonal families using the Gini index. A high value indicates that certain clonal families contain numerous distinct V(D)J sequences – a pattern often associated with antigen-driven expansion and diversification.

Clonal_family_diver_idx = BAITS.VDJ.tl.Clonal_family_diversification(igh_bcr_loc_df, sample_col='sample', cluster_col='BCR_familyID', cdr3_col='cdr3nt') 
    
Clonal_family_diver_idx.head()

	sample	Clonal_family_diver_index
0	P0516-LM	0.285553
1	P1128-LM	0.611375

plt.subplots(1,1,figsize=(3.5, 3.5))
sns.barplot(Clonal_family_diver_idx,x='sample', y='Clonal_family_diver_index', color='#4C72B0')
plt.ylabel('clonal family diversification index')
plt.show()

../_images/fdaf72571e1332bce5b214c0b740ca6226460a42f31b5562016d6ba36e6de1a3.png

BAITS.IR module

Contents