DeepRVAT 2.0

DeepRVAT gene scores can be approximated as a sum of variant level effects.

DeepRVAT integrates variant annotations using deep set networks to boost rare variant association testing. The pre-trained DeepRVAT model can be used to compute gene impairment scores for new genes and samples.

My work consisted on revisiting DeepRVAT by focusing on 4 dimensions:

Robustness and Performance

The Max Permutation Invariant Function (PIF) cannot model interactions or compound effects. For example, one variant might suppress or mask another, while two moderate-impact variants together could produce a stronger effect than either alone.

Removing the final sigmoid layer alleviates issues with burden distribution and produces scores that better align with biological expectations.

Missing values should be treated as unknowns, not as defaults. This was addressed through the introduction of a dummy variable.

Insertions and deletions (indels) can disrupt splicing or regulatory regions (frameshift in-frame risks) but such scores are often unavailable. In non-coding regions, effects are even harder to predict, though the impact remains significant.

Some indels can be imputed by approximating them as single-nucleotide variants at the affected loci.

Interpretability

Genes with LoF variants can be used as markers for burden rescaling.

We compute rescaling factors:

X0: median of gene burdens of individuals with no variants

X1: median of gene burdens of individuals with exactly one loftee_hc=1 variant

Rescaling factors (for each model):

Rescaling burdens improves interpretability and downstream compatibility. Technically, unscaled burdens are difficult to compare with other tools because they lack a clear point of reference.

Usability

In most samples, DeepRVAT sees a single variant, not a set. This means that the DeepRVAT variant score is actually the gene impairment score.

For cases with multiple variants, summing variant-level scores instead of re-running the full end-to-end model yields comparable performance. This allows precomputation of variant-level scores, paving the way for ensemble methods that complement existing predictors such as CADD.