Overview
This repository is for reviewing the data files from the MUIC case study part of Research for the People.
R version 4.5.1 (2025-06-13)
IDE used: Positron
Data
The path to the data on google drive is stored in an environment file. All data from the experiments are nested in hat google drive folder.
Neoantigen
- Neoantigen==Protein on a cancer cell’s surface
- Protein resulted from non-synonymous mutation in tumor
- Presented on HLA alleles
- Illicit immune responses (immunogenecity)
Questions:
Why do the class I and class II report files have different column headers?
We want to make sure that peptides are presented on HLA alleles, or else they’re not able to be targeted, right?
We want to make sure the gene is expressed correct?
Does the SHERPA presentation rank take into account immunogenecity and binding ability of the peptides?
arrow::read_parquet(
"data/DNA_SR24-58221_C1_neoantigen_class_I_report.parquet"
) |>
dplyr::filter(
.data[['# of Presenting HLA alleles (per peptide)']]>0,
.data[['Expressed']]=="Y",
.data[['Immunogenicity Score']]>0
) |>
dplyr::summarise(
Npeptides = dplyr::n_distinct(.data[['Peptide']]),
Avg_SHERPA = mean(.data[['SHERPA Presentation Rank']]),
.by = dplyr::all_of(c("Gene Symbol"))
) |>
dplyr::arrange(
dplyr::desc(Avg_SHERPA),
dplyr::desc(Npeptides)
)
# A tibble: 64 × 3
`Gene Symbol` Npeptides Avg_SHERPA
<chr> <int> <dbl>
1 KIAA1279 1 82.5
2 CSF2RB 1 79.2
3 TBCK 1 71.7
4 ZNF787 1 69.2
5 EFS 1 63.4
6 LYPLA1 1 61.7
7 MTHFD2L 1 60.0
8 MYLK 1 58.8
9 GNPAT 1 56
10 ZNF500 1 55.9
# ℹ 54 more rows