Isoformic: Isoform-level biological interpretation of transcriptomic data

The Isoformic method

Welcome to isoformic, a workflow for isoform-level biological interpretation of transcriptomic data.

Information

All data used for the examples here were extracted from the paper: “Landscape of Dysregulated Placental RNA Editing Associated With Preeclampsia” in which were analyzed generated RNA-Seq datasets from preeclamptic and control placentas. This dataset was chosen due for having many biological replicates with high sequencing depth.

# devtools::load_all()
library(isoformic)

Load dependency packages

library(fs)
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

Setting up the data

Initial Considerations

We highly recommend the use of Salmon for transcript-level abundance estimation and the swish method implemented in the fishpond R package for isoform-level differential expression.

The GENCODE project offers a good annotation of isoforms for the human and mouse genomes, including isoforms of non-coding genes. Using other sources of annotation can render completely different results for the following analysis.

Part 1: Data input

In this version you would need two essential input and two optional inputs.

Essential input 1: A transcript differential expression table. This table can be outputted from any kind of differential expression software you use but it needs to contain 1) information on Transcript name OR Ensembl Transcript ID per line 2) log2FoldChange information in a column named “log2FoldChange”. 3) p-value information in a column named “pvalue”. Any other columns on the table will not be used on the main analysis. This DET table should be the UNFILTERED version of your table.

path_package("isoformic")

## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic

Example:

PE1_DETs <- read_csv(path_package("isoformic", "extdata", "DETs_fixed.csv"))
head(PE1_DETs)

## # A tibble: 6 × 4
##   transcript_id      log2FC  pvalue qvalue
##   <chr>               <dbl>   <dbl>  <dbl>
## 1 ENST00000456328.2 -0.692  0.0199  0.0940
## 2 ENST00000488147.1  0.0731 0.652   0.812 
## 3 ENST00000466430.5 -0.146  0.364   0.590 
## 4 ENST00000442987.3 -0.183  0.0841  0.242 
## 5 ENST00000494149.2 -0.728  0.00599 0.0420
## 6 ENST00000595919.1 -0.0789 0.951   0.977

Essential input 2: a FASTA file from GENCODE which corresponds to the same fasta you used for the transcriptome alignment. These need to be on the same version since here you will use names from that GENCODE version to do most of the mergings. The annotation used also need to be GENCODE since it posses the transcript_type column that will be used as information as well. If you wish you CAN provide that information through and external source not being GENCODE, and from that you will need a table with at least three columns: 1)a transcript_name column that needs to match those gene names on your DET table and on your TPM table, 2) a gene-name column to tell which gene those transcripts belong to and 3) a transcript_type column.

This table will also have to use transcript names from GENCODE and have a column stating the transcript biotype that is also got from GENCODE annotation. The statistics can be ‘pvalue’, ‘svalue’ or ‘qvalue’ but a ‘log2FoldChange’ between your case and control conditions is also needed for most of the plots.

Optional inputs: 1) A Transcript per million (TPM) table matching the transcripts in the differential expression table 2) a GFF3 file of the transcriptome version which corresponds to your FASTA and 3) a table of differentially expressed genes of that data to also use as comparison.

Any differential expression table can be used here but the pipeline authors, after multiple testing, reached the conclusion that the swish implementation for a differential transcript expression analysis using inferential replicate counts from Salmon is the one that performs the best for medium to high depth transcriptome libraries when looking at number of transcripts and significant values.

Not well annotated transcriptomes will not output results as these and all the tests here mentioned were done using human data.

Isoformic makes available example files that can be used as model for formatting the necessary files for running the workflow. The files can be found in the following path.

path_package("isoformic", "extdata")

## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata

And the files available can be seen with:

path_package("isoformic", "extdata") |>
  dir_ls()

## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata/DEGs_PE_fixed2.csv
## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata/DETs_fixed.csv
## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata/PE_1_DETs.csv
## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata/PE_1_counts.csv
## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata/c2.cp.reactome.v2023.1.Hs.symbols.gmt.txt
## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata/hsapiens_REAC_subset.gmt
## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata/isoformic-logo.png

PE1_DETs <- path_package("isoformic", "extdata", "DETs_fixed.csv") |>
  read_csv()
PE1_DEGs <- path_package("isoformic", "extdata", "DEGs_PE_fixed2.csv") |>
  read_csv()
PE1_counts <- path_package("isoformic", "extdata", "PE_1_counts.csv") |>
  read_csv() |>
  dplyr::rename(transcript_id = `...1`)

Here we load the table which points for the libraries that represent our cases (treatment) and our controls. In this library, cases are the pregnant woman with Preeclampsia and controls matched pregnant without Preeclampsia.

sample_table <- data.frame(
  samples = colnames(PE1_counts)[2:ncol(PE1_counts)],
  condition = c(rep("treatment", 8), rep("control", ncol(PE1_counts) - 9))
)
head(sample_table)

##       samples condition
## 1 SRR11498039 treatment
## 2 SRR11498040 treatment
## 3 SRR11498041 treatment
## 4 SRR11498042 treatment
## 5 SRR11498043 treatment
## 6 SRR11498044 treatment

Download reference files

The references used for this project were obtained from the GENCODE Project version 34 for the Human genome annotation. The annotation file in GFF3 format was obtained from https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.chr_patch_hapl_scaff.annotation.gff3.gz.

This step may take a while depending on the speed of your internet connection.

download_reference(version = "34", file_type = "fasta")
# download_reference(version = "34", file_type = "gtf")
download_reference(version = "34", file_type = "gff")

To download mouse references, it is necessary to include the letter ‘M’ in the version string (e.g., “M37”).

Part 2: Transcript to Gene and Gene to transcript reference tables

Using the FASTA file from GENCODE we will construct a transcript per gene dictionary table and add that information to the main DEG, DET and TPM table. This step will depend a lot on the names of the columns on your tables so in the Vignette we decided to change names to keep consistency.

The input used here is a FASTA file containing the transcript sequences and their annotation information downloaded from the GENCODE website with the specific version you used for the alignment. In the case here GENCODE v34.

fasta_path <- "data-raw/gencode.v34.transcripts.fa.gz"
read_lines(fasta_path, n_max = 5)

## [1] ">ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|"
## [2] "GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC"                                                                 
## [3] "TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA"                                                                 
## [4] "TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG"                                                                 
## [5] "CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG"

As that header shows the imported table is still very weird and not tidy, so we pass it through the make_tx_to_gene function that will make it tidy and ready for further use.

tx_to_gene <- make_tx_to_gene(
  file_path = fasta_path,
  file_type = "fasta"
)
head(tx_to_gene)

## # A tibble: 6 × 8
##   transcript_id    gene_id havanna_gene_id havanna_transcript_id transcript_name
##   <chr>            <chr>   <chr>           <chr>                 <chr>          
## 1 ENST00000456328… ENSG00… OTTHUMG0000000… OTTHUMT00000362751.1  DDX11L1-202    
## 2 ENST00000450305… ENSG00… OTTHUMG0000000… OTTHUMT00000002844.2  DDX11L1-201    
## 3 ENST00000488147… ENSG00… OTTHUMG0000000… OTTHUMT00000002839.1  WASH7P-201     
## 4 ENST00000619216… ENSG00… -               -                     MIR6859-1-201  
## 5 ENST00000473358… ENSG00… OTTHUMG0000000… OTTHUMT00000002840.1  MIR1302-2HG-202
## 6 ENST00000469289… ENSG00… OTTHUMG0000000… OTTHUMT00000002841.2  MIR1302-2HG-201
## # ℹ 3 more variables: gene_name <chr>, tx_length <chr>, transcript_type <chr>

Now our tx_to_gene table has 6 columns that are in order: Ensembl transcript id, Ensembl gene id, Havanna gene id, Havanna transcript id, transcript name, gene name, transcript length and transcript type. For the DEG, DET and TPM table we will need the Ensembl gene id, the Gene name and the transcript type information so we can convert our tables for transcript_name and add the type information and if the gene is a DE to the DET table.

Select the columns with the gene id and the gene name info

tx_to_gene <- tx_to_gene |>
  dplyr::select(
    transcript_id, gene_id,
    transcript_name, gene_name,
    transcript_type
  )

First we add the gene_name information to the DEG table

gene_join <- tx_to_gene |>
  dplyr::select(gene_id, gene_name) |>
  distinct()
PE1_DEGs <- PE1_DEGs |>
  left_join(gene_join, by = "gene_id")

Now the transcript name for the TPM table

tpm_join <- tx_to_gene |>
  dplyr::select(transcript_id, transcript_name) |>
  distinct()
PE1_counts <- PE1_counts |>
  left_join(tpm_join, by = c("transcript_id"))

The DET table will be our main table for analysis.

Part 3: Constructing the main table

The Gene-level information will input for us categorical values to be added on the DET table. In more detail: we need to now if that transcript’s gene (1) pass on the gene-level expression cutoff values and (2) which type does that transcript belongs to.

There are multiple types on the Ensembl library and some of their definitions superpose to one another, the ones further analyzed here can be seen on this figure

knitr::include_graphics("https://i.imgur.com/UWoAr0k.png")

First we add the transcript name and type information to the DET table

transcript_joined <- tx_to_gene |>
  dplyr::select(transcript_id, transcript_name, transcript_type) |>
  distinct()
PE1_DETs <- PE1_DETs |>
  left_join(transcript_joined, by = "transcript_id")

For the gene expression level we have to convert the DEG table and do some cutting to get the genes which present as DE and exclude possible noise. Here we used the cutoffs of absolute log2FC higher than one and pvalue lower than 0.05

So we first filter the DEG table for the significant ones and the add it as a column on our main DET table using the isDEGsig function.

PE1_DEGs_new_names_sig <- PE1_DEGs |>
  filter(abs(log2FC) >= 1) |>
  filter(pvalue <= 0.05) |>
  dplyr::select(gene_name) |>
  drop_na()
DEGs_sig_joined <- PE1_DEGs_new_names_sig |>
  left_join(tx_to_gene, by = "gene_name")
transcript_gene_join <- tx_to_gene |>
  dplyr::select(transcript_name, gene_name) |>
  distinct()

PE1_DETs_final <- is_deg_sig(DEGs_sig_joined$transcript_name, PE1_DETs)
PE1_DETs_final <- PE1_DETs_final |>
  left_join(transcript_gene_join, by = "transcript_name")

And now we have all the tables we will need for all graphs and analyses.

One detail is that the DET final table now allows us to see genes whose transcripts are differentially expressed but their genes are not with a simple dplyr filter.

DETs_not_DEGs <- PE1_DETs_final |>
  filter(pvalue < 0.05, abs(log2FC) > 1, DEG_sig == "NO")

This table will represent cases which could be characterized as isoform switches, when two transcripts of the same gene are expressed in opposite directions what makes the total expression of that gene not be significant either up or down-regulated.

Colors

Before we start plotting we will define a general set of colors to be used through the entire pipeline associated with a certain type of transcript. Here we colored all the most abundant types separately and the less abundant on the same grey tone and name that vector accordingly

# TODO: @luciorq Check implementation of function `tx_type_palette()`
fixed_tx_biotypes <- c(
  "gene", "protein_coding", "retained_intron",
  "protein_coding_CDS_not_defined", "nonsense_mediated_decay",
  "lncRNA", "processed_pseudogene",
  "transcribed_unprocessed_pseudogene",
  "unprocessed_pseudogene", "non_stop_decay",
  "transcribed_unitary_pseudogene",
  "pseudogene", "unitary_pseudogene", "processed_transcript"
)

tx_type_color_names <- c(
  "#fb8072", "#a6d854", "#8da0cb", "#fc8d62",
  "#66c2a5", "#e78ac3", "#ffd92f", "#e5c494",
  "#d9d9d9", "#d9d9d9", "#d9d9d9", "#ffffb3",
  "#d9d9d9", "#d9d9d9"
)

names(tx_type_color_names) <- fixed_tx_biotypes

tx_type_color_names

Log2FC Plot

The simplest and first plot on this tutorial will be a log2FC plot, this plot will compare the foldchange of case vs control from the gene, to that of its transcripts adding to that the significance information.

For that we will make a combined version of the DEG table with the DET table using the function join_DEG_DET.

DEG_DET_table <- join_DEG_DET(PE1_DEGs, PE1_DETs_final, 1, 0.05)
head(DEG_DET_table)

## # A tibble: 6 × 9
##   id             log2FC pvalue qvalue name  transcript_type gene_name abs_log2FC
##   <chr>           <dbl>  <dbl>  <dbl> <chr> <chr>           <chr>          <dbl>
## 1 ENSG00000000… -0.320  0.0485  0.125 TSPA… gene            TSPAN6        0.320 
## 2 ENSG00000000…  0.381  0.107   0.222 TNMD  gene            TNMD          0.381 
## 3 ENSG00000000… -0.0693 0.260   0.417 DPM1  gene            DPM1          0.0693
## 4 ENSG00000000…  0.107  0.386   0.547 SCYL3 gene            SCYL3         0.107 
## 5 ENSG00000000…  0.162  0.229   0.380 C1or… gene            C1orf112      0.162 
## 6 ENSG00000000… -0.162  0.160   0.296 FGR   gene            FGR           0.162 
## # ℹ 1 more variable: significance <chr>

Now you just use the plotLog2FC for any gene you would like. The function also works well with a small vector of gene_names.

# selected_gene = "RBPJ"
plot_log2FC(DEG_DET_table, "RBPJ")

# Work here to look better? or just remove for now
# the best would be a loop that goes over each one of a list and
# plots them in a folder the default could be the DET not deg table
plot_obj <- plot_log2FC(DEG_DET_table, c("RBPJ", "EGFR", "PNCK"))

plot_obj +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Profile Plot

Another good more quantifiable way to visualize this switch is using the Transcript per Million Count of each transcript, compared to those of the gene between the case and control conditions. For this we use a profile plot that in one size plots the values of TPM for the case conditions and in the other the value of the TPM for the control conditions.

DEG_DET_table |>
  filter(!transcript_type %in% "gene")

## # A tibble: 70,870 × 9
##    id          log2FC  pvalue  qvalue name  transcript_type gene_name abs_log2FC
##    <chr>        <dbl>   <dbl>   <dbl> <chr> <chr>           <chr>          <dbl>
##  1 ENST000006…  0.640 3.11e-1 5.40e-1 ISG1… protein_coding  ISG15          0.640
##  2 ENST000003…  0.534 1.29e-2 7.07e-2 AGRN… protein_coding  AGRN           0.534
##  3 ENST000004… -0.530 5.81e-2 1.91e-1 AGRN… retained_intron AGRN           0.530
##  4 ENST000004…  1.83  1.66e-5 6.81e-4 RNF2… protein_coding  RNF223         1.83 
##  5 ENST000004…  0.415 2.20e-1 4.40e-1 TNFR… retained_intron TNFRSF4        0.415
##  6 ENST000003…  0.374 2.27e-1 4.48e-1 TNFR… protein_coding  TNFRSF4        0.374
##  7 ENST000003…  0.479 6.70e-2 2.09e-1 TAS1… protein_coding  TAS1R3         0.479
##  8 ENST000003…  0.217 3.82e-1 6.06e-1 DVL1… protein_coding  DVL1           0.217
##  9 ENST000003…  0.570 8.12e-2 2.37e-1 DVL1… protein_coding  DVL1           0.570
## 10 ENST000006…  1.91  4.34e-1 6.52e-1 DVL1… protein_coding  DVL1           1.91 
## # ℹ 70,860 more rows
## # ℹ 1 more variable: significance <chr>

unique(DEG_DET_table$transcript_type)

##  [1] "gene"                               "protein_coding"                    
##  [3] "retained_intron"                    "processed_transcript"              
##  [5] "lncRNA"                             "nonsense_mediated_decay"           
##  [7] "processed_pseudogene"               "unprocessed_pseudogene"            
##  [9] "transcribed_unprocessed_pseudogene" "non_stop_decay"                    
## [11] "transcribed_unitary_pseudogene"     "pseudogene"                        
## [13] "unitary_pseudogene"

profile_data_df <- prepare_profile_data(
  txi_transcript = PE1_counts,
  tx_to_gene = tx_to_gene,
  sample_metadata = sample_table,
  de_result_gene = PE1_DEGs,
  de_result_transcript = PE1_DETs,
  var = "condition",
  var_levels = c("control", "treatment")
)
#
# # Create output directory if don't exist yet
# path_to_save <- "results/profile_plot"
# fs::dir_create(path_to_save)

profile_plot <- plot_tx_expr(
  genes_to_plot = "RBPJ",
  profile_data = profile_data_df
)
profile_plot

profile_data_df |>
  filter(genename %in% "IL2RA")

## # A tibble: 2 × 7
##   genename condition mean_TPM    SD transcript_type DE    parent_gene
##   <chr>    <fct>        <dbl> <dbl> <chr>           <fct> <chr>      
## 1 IL2RA    control       32.5 14.5  gene            Yes   IL2RA      
## 2 IL2RA    treatment     11.8  9.04 gene            Yes   IL2RA

profile_plot <- plot_tx_expr(
  genes_to_plot = "IL2RA",
  profile_data = profile_data_df
)
profile_plot

Functional Transcript Enrichment

One of the biggest caveats for transcript level analysis is that in many times is hard to extract biologically relevant information from so much data. Instead of having a final table with 900 genes you get a table with over 3.000 transcripts after the differential expression cut. The next step for gene-level DE would be functional enrichment or assigning the genes to metabolic pathways those may be regulating. Unfortunately are no comprehensive datasets for pathways transcripts may be regulating and the gene level analyses normally loses the difference between those transcripts which can produce proteins (protein_coding) from canonical translation pathways and those which cannot. To solve this problem we developed a method of expanding the known .gmts for transcript information and then separately enrich each selected category of transcript. Between the alternative spliced isoforms of that do not code for canonical proteins the most abundant are those classified as Nonsense-mediated decay, that have a premature stop codon which is subject to targeted degradation and the Protein coding CDS not defined (formerly identified as “processed transcript”), which, for any reason, do not possess a complete Open Reading Frame. Inside the processed transcript category the one with the highest count are the Retained introns, sequences which retain an intronic portion after their processing.

These three categories are the most abundant in those transcripts which arise from the alternative splicing of a protein coding gene and these three will be the main focus for our enrichment and further graphs.

So first we choose a .gmt to be used for the enrichment, in this case we loaded a human reactome gene list from MSigDB called c2. But any gene list in GMT format works here. The gmt is loaded on the fgsea format with lists for each biological process.

fs::path_package("isoformic", "extdata", "c2.cp.reactome.v2023.1.Hs.symbols.gmt.txt")

## /private/var/folders/2q/937_bkg10svdwx1x00prs9nm0000gn/T/Rtmp5mEsdw/temp_libpath42b91e0a8903/isoformic/extdata/c2.cp.reactome.v2023.1.Hs.symbols.gmt.txt

genesets_list <- fgsea::gmtPathways(
  gmt.file = fs::path_package("isoformic", "extdata", "c2.cp.reactome.v2023.1.Hs.symbols.gmt.txt")
)

head(str(genesets_list[1:5]))

## List of 5
##  $ REACTOME_INTERLEUKIN_6_SIGNALING        : chr [1:11] "JAK2" "TYK2" "CBL" "STAT1" ...
##  $ REACTOME_APOPTOSIS                      : chr [1:179] "BAD" "CFLAR" "PSMB1" "PSMC4" ...
##  $ REACTOME_HEMOSTASIS                     : chr [1:678] "FGR" "CD99" "TFPI" "KDM1A" ...
##  $ REACTOME_INTRINSIC_PATHWAY_FOR_APOPTOSIS: chr [1:55] "BAD" "BID" "BAK1" "CASP8" ...
##  $ REACTOME_MAPK3_ERK1_ACTIVATION          : chr [1:10] "JAK2" "MAPK3" "TYK2" "IL6ST" ...

## NULL

head(genesets_list[[1]])

## [1] "JAK2"  "TYK2"  "CBL"   "STAT1" "IL6ST" "IL6"

Visualize how is our table before running

head(PE1_DETs_final)

## # A tibble: 6 × 8
##   transcript_id   log2FC  pvalue  qvalue transcript_name transcript_type DEG_sig
##   <chr>            <dbl>   <dbl>   <dbl> <chr>           <chr>           <chr>  
## 1 ENST0000064952…  0.640 3.11e-1 5.40e-1 ISG15-203       protein_coding  YES    
## 2 ENST0000037937…  0.534 1.29e-2 7.07e-2 AGRN-201        protein_coding  YES    
## 3 ENST0000046111… -0.530 5.81e-2 1.91e-1 AGRN-202        retained_intron YES    
## 4 ENST0000045346…  1.83  1.66e-5 6.81e-4 RNF223-201      protein_coding  YES    
## 5 ENST0000049786…  0.415 2.20e-1 4.40e-1 TNFRSF4-203     retained_intron YES    
## 6 ENST0000037923…  0.374 2.27e-1 4.48e-1 TNFRSF4-201     protein_coding  YES    
## # ℹ 1 more variable: gene_name <chr>

Then you run the run_enrichment() function it needs your DETs final table, the gene set list and a p-value cutoff to be used. It will generate a table of enrichment but with an extra column “experiment”.

enrichment_df <- run_enrichment(
  det_df = PE1_DETs_final,
  genesets_list = genesets_list,
  tx_to_gene = tx_to_gene,
  pval_cutoff = 0.05
)
head(enrichment_df)

##                                                         pathway         pval
##                                                          <char>        <num>
## 1:              REACTOME_ABC_FAMILY_PROTEINS_MEDIATED_TRANSPORT 0.0025381836
## 2:                           REACTOME_ABC_TRANSPORTER_DISORDERS 0.0182832492
## 3: REACTOME_ACROSOME_REACTION_AND_SPERM_OOCYTE_MEMBRANE_BINDING 0.0224136073
## 4:       REACTOME_ACTIVATED_NTRK2_SIGNALS_THROUGH_FRS2_AND_FRS3 0.0437500000
## 5:                REACTOME_ACTIVATED_NTRK2_SIGNALS_THROUGH_PI3K 0.0006616736
## 6:                 REACTOME_ACTIVATED_NTRK2_SIGNALS_THROUGH_RAS 0.0041758237
##          padj   log2err        ES      NES  size
##         <num>     <num>     <num>    <num> <int>
## 1: 0.04818654 0.4317077 0.3725212 1.465466   263
## 2: 0.14031331 0.3524879 0.3614713 1.381439   196
## 3: 0.16008590 0.3524879 0.8512491 1.565424     5
## 4: 0.24806701 0.2712886 0.5351451 1.488537    26
## 5: 0.02426137 0.4772708 0.7292574 1.880501    19
## 6: 0.06500103 0.4070179 0.6460168 1.777428    24
##                                                        leadingEdge
##                                                             <list>
## 1:     OS9-209,PSMD7-204,PSMA5-206,UBC-205,ABCB1-201,UBA52-201,...
## 2:    OS9-209,PSMD7-204,PSMA5-206,UBC-205,UBA52-201,ERLIN2-206,...
## 3:                                 CD9-208,CD9-216,CD9-212,CD9-204
## 4:     NTRK2-208,NTRK2-204,NTF4-203,NTF4-201,FRS2-206,FRS3-201,...
## 5: NTRK2-208,PIK3R1-202,PIK3R1-216,NTRK2-204,NTF4-203,GAB1-201,...
## 6:                  NTRK2-208,NTRK2-204,NTF4-203,NTF4-201,SHC1-208
##        experiment
##            <char>
## 1: protein_coding
## 2: protein_coding
## 3: protein_coding
## 4: protein_coding
## 5: protein_coding
## 6: protein_coding

head(names(genesets_list))

## [1] "REACTOME_INTERLEUKIN_6_SIGNALING"                                                           
## [2] "REACTOME_APOPTOSIS"                                                                         
## [3] "REACTOME_HEMOSTASIS"                                                                        
## [4] "REACTOME_INTRINSIC_PATHWAY_FOR_APOPTOSIS"                                                   
## [5] "REACTOME_MAPK3_ERK1_ACTIVATION"                                                             
## [6] "REACTOME_TRANSLESION_SYNTHESIS_BY_Y_FAMILY_DNA_POLYMERASES_BYPASSES_LESIONS_ON_DNA_TEMPLATE"

unique(enrichment_df$experiment)

## [1] "protein_coding"          "unproductive"           
## [3] "retained_intron"         "processed_transcript"   
## [5] "nonsense_mediated_decay"

This experiment column has five possible values: Protein-coding: which is the enrichment associated with the transcripts categorized as protein coding. Unproductive: This is a term that will be used moving forward to combine those three categories of alternative spliced isoforms transcribed by coding genes. The authors are aware that biologically this term is deprecated since those kind of transcripts can produce peptides from alternative translation pathways. So here unproductive should be read as virtually incapable of producing the protein that is associated with that gene. As interpretation, we made this category to find pathways which are not being regulated on our coding data, but by the unproductive transcripts.

We also added three categories which are the individual alternative spliced types and the pathways regulated by those for specific analysis. In a very deep transcriptome the individual enrichment from those categories can also lead to promising insights.

Plotting the enrichment

We used a LollipopPlot to plot all the enrichments side by side with the size of each pathway as the radius of the circles and the transparency is if that pathway passes on the desired cutoff. First we plot for only Protein_coding versus Unproductive with a very extringent NES cutoff.

enrichment_df |>
  dplyr::filter(
    experiment %in% c("protein_coding", "unproductive") & abs(NES) >= 2
  ) |>
  dplyr::arrange(padj) |>
  dplyr::slice_head(n = 30) |>
  ggplot2::ggplot(ggplot2::aes(pathway, experiment)) +
  ggplot2::geom_point(ggplot2::aes(col = NES, size = size)) +
  ggplot2::coord_flip() +
  ggplot2::theme_minimal() +
  viridis::scale_color_viridis(option = "mako", direction = -1)

And now the specific unproductive subtypes

enrichment_df |>
  dplyr::filter(!experiment %in% c("protein_coding", "unproductive") & abs(NES) > 1.5) |>
  dplyr::arrange(padj) |>
  dplyr::slice_head(n = 20) |>
  ggplot2::ggplot(ggplot2::aes(pathway, experiment)) +
  ggplot2::geom_point(ggplot2::aes(col = NES, size = size)) +
  ggplot2::coord_flip() +
  ggplot2::theme_minimal() +
  viridis::scale_color_viridis(option = "magma", direction = -1)

Genomic Context Plot

One of the main issues we arrived at the start of the isoform level analysis, was that there was no easy direct way to visualize transcript-types if compared one to another, and using the transcript-type and the transcript per million information.

Most of the alignment plots today use the outputs from .bam/.sam files that align directly to the genome making us lose the transcript-type information and increasing considerably the processing time for any analysis for the size of the files and the time it takes to re-align.

To solve this problem we used a more direct approach which allows us to visualize the difference of introns and exons between the transcript, the types of those transcripts and how much they were counted according to the pseudo-alignment; but in turn it loses the read alignment count proportion. This alignment count unfortunately requires running alignment softwares and dealing with .sam and .bam files which will not be covered on this tutorial. We called this plot the genomic context plot and it takes inspiration from the way Ensembl shows it transcripts on their genome browser.

This specific plot requires a GFF file that can also be downloaded from GENCODE to be included in the file path on next function. This GFF file needs to be downloaded on the accurate version for your transcriptome, in this case v34.

exon_df <- prepare_exon_annotation(
  # gene_name = "RBPJ",
  gene_name = "FLT1",
  file_path = "data-raw/gencode.v34.annotation.gff3.gz",
  file_type = "gff"
)

This will be the table used for plotting.

head(exon_df)

## # A tibble: 6 × 5
##   exon_left exon_right strand tx_id             tx_name
##   <chr>     <chr>      <chr>  <chr>             <chr>  
## 1 28494780  28495128   -      ENST00000282397.9 FLT1   
## 2 28467521  28467617   -      ENST00000282397.9 FLT1   
## 3 28466903  28467129   -      ENST00000282397.9 FLT1   
## 4 28438221  28438345   -      ENST00000282397.9 FLT1   
## 5 28434058  28434220   -      ENST00000282397.9 FLT1   
## 6 28433819  28433955   -      ENST00000282397.9 FLT1

Plotting

exon_df |>
  plot_tx_context()

Protein coding example

exon_df <- prepare_exon_annotation(
  gene_name = "EGFR",
  file_path = "data-raw/gencode.v34.annotation.gff3.gz",
  file_type = "gff"
)

# exon_table = exon_df
exon_df |>
  plot_tx_context()

Testing with XIST and EGFR for genes in dual context.

dual_exon_df <- prepare_exon_annotation(
  gene_name = c("XIST", "EGFR"),
  file_path = "data-raw/gencode.v34.annotation.gff3.gz",
  file_type = "gff"
)

dual_exon_df |>
  plot_tx_context()

dual_exon_df |>
  dplyr::filter(tx_id %in% c("ENST00000602495.1", "ENST00000602863.2")) |>
  plot_tx_context()

Session Information

sessionInfo()

## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.5.2        stringr_1.5.1        tidyr_1.3.1         
## [4] dplyr_1.1.4          readr_2.1.5          fs_1.6.6            
## [7] isoformic_0.1.1.9003
## 
## loaded via a namespace (and not attached):
##  [1] fastmatch_1.1-6     gtable_0.3.6        xfun_0.52          
##  [4] bslib_0.9.0         htmlwidgets_1.6.4   ggrepel_0.9.6      
##  [7] rstatix_0.7.2       lattice_0.22-6      tzdb_0.5.0         
## [10] vctrs_0.6.5         tools_4.5.0         generics_0.1.4     
## [13] curl_6.2.3          parallel_4.5.0      tibble_3.2.1       
## [16] pkgconfig_2.0.3     Matrix_1.7-3        data.table_1.17.4  
## [19] RColorBrewer_1.1-3  desc_1.4.3          lifecycle_1.0.4    
## [22] compiler_4.5.0      farver_2.1.2        textshaping_1.0.1  
## [25] fgsea_1.34.0        codetools_0.2-20    carData_3.0-5      
## [28] htmltools_0.5.8.1   sass_0.4.10         yaml_2.3.10        
## [31] Formula_1.2-5       pillar_1.10.2       pkgdown_2.1.3      
## [34] car_3.1-3           ggpubr_0.6.0        crayon_1.5.3       
## [37] jquerylib_0.1.4     BiocParallel_1.42.0 cachem_1.1.0       
## [40] viridis_0.6.5       abind_1.4-8         tidyselect_1.2.1   
## [43] digest_0.6.37       stringi_1.8.7       purrr_1.0.4        
## [46] labeling_0.4.3      forcats_1.0.0       cowplot_1.1.3      
## [49] fastmap_1.2.0       grid_4.5.0          cli_3.6.5          
## [52] magrittr_2.0.3      utf8_1.2.5          broom_1.0.8        
## [55] withr_3.0.2         scales_1.4.0        backports_1.5.0    
## [58] bit64_4.6.0-1       rmarkdown_2.29      bit_4.6.0          
## [61] gridExtra_2.3       ggsignif_0.6.4      ragg_1.4.0         
## [64] hms_1.1.3           evaluate_1.0.3      knitr_1.50         
## [67] viridisLite_0.4.2   rlang_1.1.6         Rcpp_1.0.14        
## [70] glue_1.8.0          vroom_1.6.5         jsonlite_2.0.0     
## [73] R6_2.6.1            systemfonts_1.2.3

Izabela Mamede Conceição

Lucio Rezende Queiroz

2023-04-07