sceptr.model#

class sceptr.model.Sceptr#

Loads a trained state of a SCEPTR model and provides an easy interface for generating TCR representations and making inferences from them. Instances can be obtained through the sceptr.variant submodule.

name#

The name of the model variant.

Type:

str

calc_cdist_matrix(anchors: DataFrame, comparisons: DataFrame) ndarray[Any, dtype[float32]]#

Generate a cdist matrix between two collections of TCRs.

Parameters:
  • anchors (DataFrame) – DataFrame specifying the first (anchor) collection of input TCRs. It must be in the prescribed format.

  • comparisons (DataFrame) – DataFrame specifying the second (comparison) collection of input TCRs. It must be in the prescribed format.

Returns:

A 2D numpy ndarray representing a cdist matrix between TCRs from anchors and comparisons. The returned array will have shape \((X, Y)\) where \(X\) is the number of TCRs in anchors and \(Y\) is the number of TCRs in comparisons.

Return type:

NDArray[numpy.float32]

calc_pdist_vector(instances: DataFrame) ndarray[Any, dtype[float32]]#

Generate a pdist vector of distances between each pair of TCRs in the input data.

Parameters:

instances (DataFrame) – DataFrame specifying the input TCRs. It must be in the prescribed format.

Returns:

A 1D numpy ndarray representing a pdist vector of distances between each pair of TCRs in instances. The returned array will have shape \((\frac{1}{2}N(N-1),)\), where \(N\) is the number of TCRs in instances.

Return type:

NDArray[numpy.float32]

calc_residue_representations(instances: DataFrame) ResidueRepresentations#

Map each TCR to a set of amino acid residue-level representations. The residue-level representations are the output of the penultimate self-attention layer, as also used by the average_pooling() variant when generating TCR receptor-level representations.

Note

This method is currently only supported on SCEPTR model variants such as the default one that 1) use both the alpha and beta chains, and 2) take into account all three CDR loops from each chain.

Parameters:

instances (DataFrame) – DataFrame specifying the input TCRs. It must be in the prescribed format.

Returns:

An array of representation vectors for each amino acid residue in the tokenised forms of the input TCRs. For details on how to interpret/use this output, please refer to the documentation for ResidueRepresentations.

Return type:

ResidueRepresentations

calc_vector_representations(instances: DataFrame) ndarray[Any, dtype[float32]]#

Map TCRs to their corresponding vector representations.

Parameters:

instances (DataFrame) – DataFrame specifying the input TCRs. It must be in the prescribed format.

Returns:

A 2D numpy ndarray object where every row vector corresponds to a row in instances. The returned array will have shape \((N, D)\) where \(N\) is the number of TCRs in instances and \(D\) is the dimensionality of the current model variant.

Return type:

NDArray[numpy.float32]

disable_hardware_acceleration() None#

Move this Sceptr instance and its computations to the CPU. For toggling the package-level setting, see sceptr.disable_hardware_acceleration().

enable_hardware_acceleration() None#

Move this Sceptr instance and its computations to a hardware-accelerated device, if available (e.g. CUDA-enabled GPU). For toggling the package-level setting, see sceptr.enable_hardware_acceleration().

set_batch_size(batch_size: int) None#

Set the batch size used when generating TCR vector representations. That is, how many representations are computed at a time on the CPU / GPU. By default, the batch size is set to 512.

class sceptr.model.ResidueRepresentations#

An object containing information necessary to interpret and operate on residue-level representations from the SCEPTR family of models. Instances of this class can be obtained via the sceptr.calc_residue_representations() function and a method of the same name on the Sceptr class.

This feature is implemented to give power-users easy access to model internals to tinker around and examine what kind of information SCEPTR focuses on at the individual amino acid residue level. The “Examples” section below illustrates how to use instances of this class to examine SCEPTR’s residue-level embeddings.

representation_array#

A numpy float array containing the residue-level representation data. The array is of shape \((N, M, D)\) where \(N\) is the number of TCRs in the original input, \(M\) is the maximum number of residues among the input TCRs when put into its tokenised form, and \(D\) is the dimensionality of the model variant that produced the result.

Type:

NDArray[numpy.float32]

compartment_mask#

A numpy integer array mapping residue indices in the representation_array to corresponding CDR loops of the input TCRs. The array is of shape \((N, M)\) where \(N\) is the number of TCRs in the original input, and \(M\) is the maximum number of residues among the input TCRs when put into its tokenised form. Entries in compartment_mask have the following values:

If residue at index is from:

Entry has value:

None (padding token)

0

CDR1A

1

CDR2A

2

CDR3A

3

CDR1B

4

CDR2B

5

CDR3B

6

Within each CDR loop compartment, residues are ordered from C- to N-terminal from left to right.

Type:

NDArray[numpy.int64]

Examples

In the following we show how to extract the residue-level representations for the beta-chain CDR3 amino acid sequences of all input TCR sequences. To start with, we define a DataFrame tcrs that contains the sequence data for four TCRs.

>>> from pandas import DataFrame
>>> tcrs = DataFrame(
...         data = {
...                 "TRAV": ["TRAV38-1*01", "TRAV3*01", "TRAV13-2*01", "TRAV38-2/DV8*01"],
...                 "CDR3A": ["CAHRSAGGGTSYGKLTF", "CAVDNARLMF", "CAERIRKGQVLTGGGNKLTF", "CAYRSAGGGTSYGKLTF"],
...                 "TRBV": ["TRBV2*01", "TRBV25-1*01", "TRBV9*01", "TRBV2*01"],
...                 "CDR3B": ["CASSEFQGDNEQFF", "CASSDGSFNEQFF", "CASSVGDLLTGELFF", "CASSPGTGGNEQYF"],
...         },
...         index = [0,1,2,3]
... )
>>> print(tcrs)
              TRAV                 CDR3A         TRBV            CDR3B
0      TRAV38-1*01     CAHRSAGGGTSYGKLTF     TRBV2*01   CASSEFQGDNEQFF
1         TRAV3*01            CAVDNARLMF  TRBV25-1*01    CASSDGSFNEQFF
2      TRAV13-2*01  CAERIRKGQVLTGGGNKLTF     TRBV9*01  CASSVGDLLTGELFF
3  TRAV38-2/DV8*01     CAYRSAGGGTSYGKLTF     TRBV2*01   CASSPGTGGNEQYF

We can get the residue-level representations for those TCRs like so:

>>> import sceptr
>>> res_reps = sceptr.calc_residue_representations(tcrs)
>>> print(res_reps)
ResidueRepresentations[num_tcrs: 4, rep_dim: 64]

Now, we can iterate through the residue-level representation subarray corresponding to each TCR, and filter out/obtain the representations for the beta chain CDR3 sequence.

>>> cdr3b_reps = []
>>> for reps, mask in zip(res_reps.representation_array, res_reps.compartment_mask):
...     cdr3b_rep = reps[mask == 6] # collect only the residue representations for the beta CDR3 sequence
...     cdr3b_reps.append(cdr3b_rep)

Now we have a list containing four numpy ndarrays, each of which is a matrix whose row vectors are representations of individual CDR3B amino acid residues.

>>> type(cdr3b_reps[0])
<class 'numpy.ndarray'>
>>> cdr3b_reps[0].shape
(14, 64)

Note that the zeroth element of the shape tuple above is 14 because the CDR3B sequence of the first TCR in tcrs is 14 residues long, and the first element of the shape tuple is 64 because the model dimensionality of the default SCEPTR variant is 64.