Skip to contents

Extracts a transcript-to-gene mapping table from GENCODE annotation files, such as the transcriptome FASTA file. Currently, only FASTA files are supported.

Usage

make_tx_to_gene(file_path, file_type = c("fasta", "gff"))

Arguments

file_path

A character string specifying the path to the reference file (e.g., GENCODE FASTA file).

file_type

A character string specifying the type of the reference file. Currently, only "fasta" is supported. Default is "fasta".

Value

A tibble containing the transcript-to-gene mapping information, including transcript IDs, gene IDs, transcript names, gene names, and transcript types.

Details

The function reads the headers of the FASTA file and extracts relevant information to create a mapping table. For GTF or GFF3 files, support is not yet implemented.

Examples

# Assuming you have downloaded the GENCODE transcriptome FASTA file:
fasta_file <- download_reference(
  version = "43",
  organism = "human",
  file_type = "fasta",
  output_path = "data-raw"
)
#>  data-raw/gencode.v43.transcripts.fa.gz already exists.

# Create the transcript-to-gene mapping table
tx_to_gene <- make_tx_to_gene(file_path = fasta_file, file_type = "fasta")

# View the first few rows
head(tx_to_gene)
#> # A tibble: 6 × 8
#>   transcript_id    gene_id havanna_gene_id havanna_transcript_id transcript_name
#>   <chr>            <chr>   <chr>           <chr>                 <chr>          
#> 1 ENST00000456328… ENSG00… -               OTTHUMT00000362751.1  DDX11L2-202    
#> 2 ENST00000450305… ENSG00… OTTHUMG0000000… OTTHUMT00000002844.2  DDX11L1-201    
#> 3 ENST00000488147… ENSG00… OTTHUMG0000000… OTTHUMT00000002839.1  WASH7P-201     
#> 4 ENST00000619216… ENSG00… -               -                     MIR6859-1-201  
#> 5 ENST00000473358… ENSG00… OTTHUMG0000000… OTTHUMT00000002840.1  MIR1302-2HG-202
#> 6 ENST00000469289… ENSG00… OTTHUMG0000000… OTTHUMT00000002841.2  MIR1302-2HG-201
#> # ℹ 3 more variables: gene_name <chr>, entrez_id <chr>, transcript_type <chr>