TPM vs. Raw count

Question from the 1st informational meeting (10/6/23):

What is the difference between TPM and raw count? Should these be linear?

We provide PBMC bulk rnaseq data in both TPM (Transcripts Per Million) counts and raw counts. Both of these count matrices measure and normalize gene expression differently. TPM counts to account for gene length and sequencing depth. To compute TPM, you first normalize for gene length (yielding RPK - Reads Per Kilobase) and then normalize for sequencing depth. TPM is especially useful when comparing gene expression between samples. It allows for the direct comparison of transcript levels between genes within a sample and between different samples. Raw counts represent the absolute number of reads that map to a particular gene. Raw counts are often used as input for differential expression analysis, particularly with tools like DESeq2 and edgeR, which require raw counts to account for library size differences.

A relationship between raw counts and TPM is expected to be linear on a log-log scale because both values increase as the abundance of a transcript increases. However, the relationship might not be perfectly linear due to differences in normalization, especially when comparing across different samples with varying library sizes and compositions.

Sources:

  1. machine learning - Should I use Raw Counts, TPMs, or RPKM gene expression values for training ML models? - Cross Validated
  2. What the FPKM? A review of RNA-Seq expression units | The farrago
1 Like