Overview

This repository is for reviewing the data files from the MUIC case study part of Research for the People.

R version 4.5.1 (2025-06-13)

IDE used: Positron

Data

The path to the data on google drive is stored in an environment file. All data from the experiments are nested in hat google drive folder.

Neoantigen

Neoantigen==Protein on a cancer cell’s surface
Protein resulted from non-synonymous mutation in tumor
Presented on HLA alleles
Illicit immune responses (immunogenecity)

Questions:

Why do the class I and class II report files have different column headers?
We want to make sure that peptides are presented on HLA alleles, or else they’re not able to be targeted, right?
We want to make sure the gene is expressed correct?
Does the SHERPA presentation rank take into account immunogenecity and binding ability of the peptides?

arrow::read_parquet(
  "data/DNA_SR24-58221_C1_neoantigen_class_I_report.parquet"
) |> 
  dplyr::filter(
    .data[['# of Presenting HLA alleles (per peptide)']]>0,
    .data[['Expressed']]=="Y",
    .data[['Immunogenicity Score']]>0
  ) |> 
  dplyr::summarise(
    Npeptides = dplyr::n_distinct(.data[['Peptide']]),
    Avg_SHERPA = mean(.data[['SHERPA Presentation Rank']]),
    .by = dplyr::all_of(c("Gene Symbol"))
  ) |> 
    dplyr::arrange(
      dplyr::desc(Avg_SHERPA),
      dplyr::desc(Npeptides)
    )

# A tibble: 64 × 3
   `Gene Symbol` Npeptides Avg_SHERPA
   <chr>             <int>      <dbl>
 1 KIAA1279              1       82.5
 2 CSF2RB                1       79.2
 3 TBCK                  1       71.7
 4 ZNF787                1       69.2
 5 EFS                   1       63.4
 6 LYPLA1                1       61.7
 7 MTHFD2L               1       60.0
 8 MYLK                  1       58.8
 9 GNPAT                 1       56  
10 ZNF500                1       55.9
# ℹ 54 more rows