Usage#
Functional API (Recommended)#
The functional API is accessible from the root module, and is the easiest way to use SCEPTR. When using the functional API, you will be using the default SCEPTR model (see the model variants section below). To begin analysing TCR data with sceptr, you must first load the TCR data into memory in the prescribed format using pandas.
Tip
SCEPTR only recognises TCR V gene symbols that are IMGT-compliant, and also known to be functional (i.e. known pseudogenes or ORFs are not allowed). For easy standardisation of TCR gene nomenclature in your data, as well as filtering your data for functional V/J genes, check out tidytcells.
>>> from pandas import DataFrame
>>> tcrs = DataFrame(
... data = {
... "TRAV": ["TRAV38-1*01", "TRAV3*01", "TRAV13-2*01", "TRAV38-2/DV8*01"],
... "CDR3A": ["CAHRSAGGGTSYGKLTF", "CAVDNARLMF", "CAERIRKGQVLTGGGNKLTF", "CAYRSAGGGTSYGKLTF"],
... "TRBV": ["TRBV2*01", "TRBV25-1*01", "TRBV9*01", "TRBV2*01"],
... "CDR3B": ["CASSEFQGDNEQFF", "CASSDGSFNEQFF", "CASSVGDLLTGELFF", "CASSPGTGGNEQYF"],
... },
... index = [0,1,2,3]
... )
>>> print(tcrs)
TRAV CDR3A TRBV CDR3B
0 TRAV38-1*01 CAHRSAGGGTSYGKLTF TRBV2*01 CASSEFQGDNEQFF
1 TRAV3*01 CAVDNARLMF TRBV25-1*01 CASSDGSFNEQFF
2 TRAV13-2*01 CAERIRKGQVLTGGGNKLTF TRBV9*01 CASSVGDLLTGELFF
3 TRAV38-2/DV8*01 CAYRSAGGGTSYGKLTF TRBV2*01 CASSPGTGGNEQYF
calc_cdist_matrix#
As the name suggests, calc_cdist_matrix() gives you an easy
way to calculate a cross-distance matrix between two sets of TCRs.
>>> import sceptr
>>> cdist_matrix = sceptr.calc_cdist_matrix(tcrs.iloc[:2], tcrs.iloc[2:])
>>> print(cdist_matrix)
[[1.2849895 0.7521934]
[1.4653425 1.4646544]]
calc_pdist_vector#
If you’re only interested in calculating distances within a set,
calc_pdist_vector() gives you a one-dimensional array of
within-set distances.
>>> pdist_vector = sceptr.calc_pdist_vector(tcrs)
>>> print(pdist_vector)
[1.4135991 1.2849894 0.75219345 1.4653426 1.4646543 1.287208 ]
Tip
The end result of using the calc_cdist_matrix() and
calc_pdist_vector() functions are equivalent to generating
sceptr’s TCR representations first with
calc_vector_representations(), then using scipy’s cdist
or pdist
functions to get the corresponding matrix or vector, respectively. But on
machines with CUDA-enabled GPUs,
directly using sceptr’s calc_cdist_matrix() and
calc_pdist_vector() functions will run faster, as it
internally runs all computations on the GPU.
calc_vector_representations#
If you want to directly operate on sceptr’s TCR representations, you can use
calc_vector_representations().
>>> reps = sceptr.calc_vector_representations(tcrs)
>>> print(reps.shape)
(4, 64)
calc_residue_representations#
The package also provides the user with an easy way to get access to SCEPTR’s
internal representations of each individual amino acid residue in the tokenised
form of its input TCRs, as outputted by the penultimate layer of its
self-attention stack. Interested users can use
calc_residue_representations(). Please refer to the
documentation for the ResidueRepresentations class
for details on how to interpret the output.
>>> res_reps = sceptr.calc_residue_representations(tcrs)
>>> print(res_reps)
ResidueRepresentations[num_tcrs: 4, rep_dim: 64]
Model variants#
The sceptr.variant submodule allows users access a variety of
non-default SCEPTR model variants, and use them for TCR analysis. The submodule
exposes functions which return Sceptr objects with
the model state of the chosen variant loaded. These model instances expose the
same functions as those used in the functional API, so you can just plug and
play. For example:
>>> from sceptr import variant
>>> sceptr_tiny = variant.tiny()
>>> tiny_reps = sceptr_tiny.calc_vector_representations(tcrs)
>>> print(tiny_reps.shape)
(4, 16)
Hardware acceleration / device selection#
By default, SCEPTR will detect any available devices with harware-acceleration
capabilities and automatically load models onto those devices. Currently,
CUDA-enabled devices are supported,
with MPS pending on better upstream Pytorch support. In cases where you would
like to explicitly limit SCEPTR to using the CPU, you can call
sceptr.disable_hardware_acceleration().
Mus musculus support (Experimental)#
Caution
This is an experimental feature!
You can now use SCEPTR to generate representations of Mus musculus TCRs. See
sceptr.setup() for more details.
Prescribed data format#
SCEPTR expects to receive TCR data in the form of pyrepseq standard format-compliant
pandas DataFrames. All TCR data should be represented as a DataFrame with the following
structure and data types. The column order is irrelevant. Each row should
represent one TCR. Incomplete rows are allowed (e.g. only beta chain data
available) as long as the SCEPTR variant that is being used
has at least some partial information to go on. Extra columns are also allowed.
Column name |
Column datatype |
Column contents |
|---|---|---|
TRAV |
str |
IMGT symbol for the alpha chain V gene |
CDR3A |
str |
Amino acid sequence of the alpha chain CDR3, including the first C and last W/F residues, in all caps |
TRBV |
str |
IMGT symbol for the beta chain V gene |
CDR3B |
str |
Amino acid sequence of the beta chain CDR3, including the first C and last W/F residues, in all caps |