Overview

This repository is for reviewing the data files from the MUIC case study part of Research for the People.

R version 4.5.1 (2025-06-13)

IDE used: Positron

Data

The path to the data on google drive is stored in an environment file. All data from the experiments are nested in hat google drive folder.

Neoantigen

  • Neoantigen==Protein on a cancer cell’s surface
  • Protein resulted from non-synonymous mutation in tumor
  • Presented on HLA alleles
  • Illicit immune responses (immunogenecity)

Questions:

  • Why do the class I and class II report files have different column headers?

  • We want to make sure that peptides are presented on HLA alleles, or else they’re not able to be targeted, right?

  • We want to make sure the gene is expressed correct?

  • Does the SHERPA presentation rank take into account immunogenecity and binding ability of the peptides?

arrow::read_parquet(
  "data/DNA_SR24-58221_C1_neoantigen_class_I_report.parquet"
) |> 
  dplyr::filter(
    .data[['# of Presenting HLA alleles (per peptide)']]>0,
    .data[['Expressed']]=="Y",
    .data[['Immunogenicity Score']]>0
  ) |> 
  dplyr::summarise(
    Npeptides = dplyr::n_distinct(.data[['Peptide']]),
    Avg_SHERPA = mean(.data[['SHERPA Presentation Rank']]),
    .by = dplyr::all_of(c("Gene Symbol"))
  ) |> 
    dplyr::arrange(
      dplyr::desc(Avg_SHERPA),
      dplyr::desc(Npeptides)
    )
# A tibble: 64 × 3
   `Gene Symbol` Npeptides Avg_SHERPA
   <chr>             <int>      <dbl>
 1 KIAA1279              1       82.5
 2 CSF2RB                1       79.2
 3 TBCK                  1       71.7
 4 ZNF787                1       69.2
 5 EFS                   1       63.4
 6 LYPLA1                1       61.7
 7 MTHFD2L               1       60.0
 8 MYLK                  1       58.8
 9 GNPAT                 1       56  
10 ZNF500                1       55.9
# ℹ 54 more rows