Step 3: Improved TCR Matrix for NMF

Filter noise clonotypes, test normalizations, export improved matrices for MEGPath

Background

The TCR matrix problem

Ethan’s tcr_expansion.csv is a 2,541 × 4 matrix of clonotype frequencies across timepoints. Script 01 revealed a major structural issue: 96% of clonotypes appear in only 1 timepoint. These rows are mostly zeros with a single nonzero entry — NMF can’t learn any temporal pattern from them because there’s no trajectory to decompose.

This is different from the scRNA problem. With scRNA, we had too many genes (noise from housekeeping/undetected genes). With TCR, we have too many clonotypes that don’t have enough data to contribute meaningful signal. NMF is being asked to find temporal expansion patterns, but most clonotypes only exist at one timepoint, so there’s no expansion to find.

What Table S8 contains

The supplementary Table S8 from Abdelfatah et al. (2023) is the source for the TCR matrix. It has richer information than the CSV:

  • CDR3 amino acid sequence — unique clonotype identifier (e.g., CASSSLGSFDEQFF)
  • Counts per timepoint — absolute number of cells with this clonotype at Pre, Wk3, Wk6, Wk9
  • Frequencies per timepoint — percentage of total cells at each timepoint
  • Expanded — whether the clonotype was statistically significantly expanded
  • TIL.observed — whether the clonotype was also found in tumor-infiltrating lymphocytes
  • TIL.counts / TIL.freq — abundance in the tumor

We’ll rebuild from Table S8 directly so we have labels and metadata for interpretation.

What Zihan did (weekly reports 2/12–2/26)

Zihan built a 2,541 × 7 matrix: the 4 frequency columns from Table S8, plus 3 -log10(Fisher p-value) columns from Tables S9A/B/C (comparing each timepoint to pre-treatment), capped at 10. Her rationale: -log10 transforms spread out significance signal (p=0.001 → 3, p=10⁻⁹ → 9).

She then ran a proper rank selection (k=2 through k=6) with:

  • Cophenetic correlation — clustering stability across 10 runs
  • Dispersion — sharpness of consensus matrix block boundaries
  • RSS — reconstruction error
  • Selected k=5 based on a composite score (dispersion peaked at k=5)

She also tried k-means clustering on the full 4-timepoint frequency vector (6 clusters) with standardization across timepoints.

This was rigorous work — proper rank selection with consensus clustering is exactly the right methodology. However, we have a concern about the p-value columns.

Fisher exact test p-values are deterministic functions of the count data that’s already captured in the frequency columns. Here’s why they’re almost certainly linear combinations:

The Fisher exact test for “is clonotype X expanded at week 6 vs pre-treatment?” takes as input: (count at pre, count at wk6, total cells at pre, total cells at wk6). The total cells per timepoint are constants (same for every clonotype: Pre=899, Wk3=166, Wk6=1556, Wk9=430). So the p-value is a function of only two variables: count_pre and count_wk6. But freq_pre and freq_wk6 are also functions of only those same two variables (freq = count / total). This means p_value = f(count_pre, count_wk6) = g(freq_pre, freq_wk6).

The -log10 transformation is monotonic, so -log10(p) is still a function of the same two frequencies. NMF is a linear factorization — it can only capture linear relationships between columns. If the p-values are nonlinear functions of the frequencies, NMF might treat them as partially independent, but the “new information” is just a nonlinear distortion of existing data, not genuinely new biological signal.

The risk: the 3 extra columns give NMF more room to fit noise (7 columns allows up to k=6 factors instead of k=3). If those columns are just nonlinear echoes of the frequency data, higher ranks will overfit rather than discover real structure.

We test this empirically below: run NMF on both 4-column and 7-column versions at the same ranks, plus higher ranks on the 7-column matrix, and compare NMAE.

Approach

Same as Script 02 — change one thing at a time, measure the impact:

  1. Load Table S8, understand the data
  2. Reproduce Ethan’s CSV, verify match
  3. Quantify the sparsity problem
  4. Reproduce Zihan’s 7-column matrix and test if p-values help
  5. Test filtering strategies (multi-timepoint, minimum count, expanded-only)
  6. Test normalization approaches (frequencies, counts, sqrt, log)
  7. Export best variants for MEGPath

Setup

Code
library(NMF)
library(readxl)
library(ggplot2)
library(dplyr)
library(reshape2)

source("theme_donq.R")
Code
# Same function from Script 02
compute_errors <- function(mat, label, ranks = 2:3, nrun = 10) {
  mat <- mat[rowSums(mat) > 0, , drop = FALSE]
  mat <- pmax(mat, 0)
  mat <- mat[rowSums(mat) > 0, , drop = FALSE]

  results <- list()
  for (k in ranks) {
    fit <- nmf(mat, rank = k, method = "lee", seed = "random",
               nrun = nrun, .opt = "vp2")
    residual <- mat - fitted(fit)
    frob <- sqrt(sum(residual^2))
    lad <- sum(abs(residual))
    n_elem <- length(residual)

    mat_norm <- mat / max(mat)
    fitted_norm <- fitted(fit) / max(mat)
    lad_norm <- sum(abs(mat_norm - fitted_norm))

    cat(sprintf("[%s] k=%d: LAD=%.2f, NMAE=%.6f, Frobenius=%.2f, Rows=%d\n",
                label, k, lad, lad_norm / n_elem, frob, nrow(mat)))

    results[[length(results) + 1]] <- data.frame(
      label = label, rank = k,
      frobenius = frob, lad = lad,
      lad_norm = lad_norm,
      nmae = lad_norm / n_elem,
      n_rows = nrow(mat), n_elements = n_elem,
      stringsAsFactors = FALSE
    )
  }
  bind_rows(results)
}

Step 1: Load Table S8

Code
# Excel supplementary tables have 4 header rows (title, description, 2 blank)
# before the actual column names. skip = 4 confirmed via console inspection.
s8 <- read_excel("NMF Research Data/crc-22-0383-s01.xlsx", sheet = "Table S8", skip = 4)

cat("Table S8 dimensions:", nrow(s8), "x", ncol(s8), "\n")
Table S8 dimensions: 2541 x 13 
Code
cat("Column names:\n")
Column names:
Code
cat(paste(colnames(s8), collapse = "\n"), "\n")
TCRB.clonotype
Pre-treatment
3 weeks
6 weeks
9 weeks
pre-treatment.freq
3 weeks.freq
6 weeks.freq
9 weeks.freq
Expanded
TIL.observed
TIL.counts
TIL.freq 
Code
# Display first few rows and structure
cat("\n=== First 6 rows ===\n")

=== First 6 rows ===
Code
print(head(s8))
# A tibble: 6 × 13
  TCRB.clonotype  `Pre-treatment` `3 weeks` `6 weeks` `9 weeks`
  <chr>                     <dbl>     <dbl>     <dbl>     <dbl>
1 CASSSLGSFDEQFF               13        20        81        28
2 CASSYRGRSTGELFF               1         2        29        14
3 CSVEERWTLTEAFF                0         1        16         5
4 CASSYTGFEQYF                  0         1        12         4
5 CASSGRYEQYF                   0         1         2         3
6 CASSIGTSTDTQYF                0         2         3         3
# ℹ 8 more variables: `pre-treatment.freq` <dbl>, `3 weeks.freq` <dbl>,
#   `6 weeks.freq` <dbl>, `9 weeks.freq` <dbl>, Expanded <chr>,
#   TIL.observed <chr>, TIL.counts <chr>, TIL.freq <chr>
Code
cat("\n=== Column types ===\n")

=== Column types ===
Code
str(s8)
tibble [2,541 × 13] (S3: tbl_df/tbl/data.frame)
 $ TCRB.clonotype    : chr [1:2541] "CASSSLGSFDEQFF" "CASSYRGRSTGELFF" "CSVEERWTLTEAFF" "CASSYTGFEQYF" ...
 $ Pre-treatment     : num [1:2541] 13 1 0 0 0 0 0 0 0 0 ...
 $ 3 weeks           : num [1:2541] 20 2 1 1 1 2 0 3 0 0 ...
 $ 6 weeks           : num [1:2541] 81 29 16 12 2 3 2 2 5 1 ...
 $ 9 weeks           : num [1:2541] 28 14 5 4 3 3 3 3 3 2 ...
 $ pre-treatment.freq: num [1:2541] 1.446 0.111 0 0 0 ...
 $ 3 weeks.freq      : num [1:2541] 12.048 1.205 0.602 0.602 0.602 ...
 $ 6 weeks.freq      : num [1:2541] 5.209 1.865 1.029 0.772 0.129 ...
 $ 9 weeks.freq      : num [1:2541] 6.512 3.256 1.163 0.93 0.698 ...
 $ Expanded          : chr [1:2541] "Expanded" "Expanded" "Expanded" "Expanded" ...
 $ TIL.observed      : chr [1:2541] "TIL.observed" "TIL.observed" "TIL.observed" "TIL.observed" ...
 $ TIL.counts        : chr [1:2541] "18.0" "23.0" "28.0" "5.0" ...
 $ TIL.freq          : chr [1:2541] "0.047498417" "0.060692421" "0.073886426" "0.013194005" ...
Code
cat("\n=== Summary Statistics ===\n")

=== Summary Statistics ===
Code
# Confirmed columns: TCRB.clonotype, Pre-treatment, 3 weeks, 6 weeks, 9 weeks,
# pre-treatment.freq, 3 weeks.freq, 6 weeks.freq, 9 weeks.freq,
# Expanded, TIL.observed, TIL.counts, TIL.freq
cols <- colnames(s8)
cat("All columns:", paste(cols, collapse = " | "), "\n\n")
All columns: TCRB.clonotype | Pre-treatment | 3 weeks | 6 weeks | 9 weeks | pre-treatment.freq | 3 weeks.freq | 6 weeks.freq | 9 weeks.freq | Expanded | TIL.observed | TIL.counts | TIL.freq 
Code
# Show ranges of numeric columns
for (col in cols) {
  if (is.numeric(s8[[col]])) {
    cat(sprintf("  %s: range [%.4f, %.4f], zeros: %d, NAs: %d\n",
                col, min(s8[[col]], na.rm = TRUE), max(s8[[col]], na.rm = TRUE),
                sum(s8[[col]] == 0, na.rm = TRUE), sum(is.na(s8[[col]]))))
  } else {
    cat(sprintf("  %s: %d unique values (character/factor)\n",
                col, length(unique(s8[[col]]))))
  }
}
  TCRB.clonotype: 2541 unique values (character/factor)
  Pre-treatment: range [0.0000, 13.0000], zeros: 1714, NAs: 0
  3 weeks: range [0.0000, 20.0000], zeros: 2403, NAs: 0
  6 weeks: range [0.0000, 81.0000], zeros: 1205, NAs: 0
  9 weeks: range [0.0000, 28.0000], zeros: 2179, NAs: 0
  pre-treatment.freq: range [0.0000, 1.4461], zeros: 1714, NAs: 0
  3 weeks.freq: range [0.0000, 12.0482], zeros: 2403, NAs: 0
  6 weeks.freq: range [0.0000, 5.2090], zeros: 1205, NAs: 0
  9 weeks.freq: range [0.0000, 6.5116], zeros: 2179, NAs: 0
  Expanded: 2 unique values (character/factor)
  TIL.observed: 2 unique values (character/factor)
  TIL.counts: 54 unique values (character/factor)
  TIL.freq: 54 unique values (character/factor)

Step 2: Reproduce Ethan’s TCR CSV

Verify we can rebuild tcr_expansion.csv from Table S8, just like we did for the scRNA matrix in Script 01. This confirms we understand the data provenance.

Code
# Load Ethan's CSV for comparison
ethan_tcr <- as.matrix(read.csv("NMF Research Data/Data (scRNA, TCR)/tcr_expansion.csv",
                                 header = FALSE))
colnames(ethan_tcr) <- c("Pre", "Wk3", "Wk6", "Wk9")

cat("Ethan's tcr_expansion.csv:", nrow(ethan_tcr), "x", ncol(ethan_tcr), "\n")
Ethan's tcr_expansion.csv: 2541 x 4 
Code
cat("Range:", range(ethan_tcr), "\n")
Range: 0 12.04819 
Code
cat("Column sums:", colSums(ethan_tcr), "\n")
Column sums: 100 100 100 100 
Code
cat("(Should be ~100 per column if frequencies)\n")
(Should be ~100 per column if frequencies)
Code
# Build frequency and count matrices using exact column names from Table S8

# Frequency matrix (what Ethan used in tcr_expansion.csv)
freq_cols <- c("pre-treatment.freq", "3 weeks.freq", "6 weeks.freq", "9 weeks.freq")
our_freq <- as.matrix(s8[, freq_cols])
our_freq[is.na(our_freq)] <- 0
colnames(our_freq) <- c("Pre", "Wk3", "Wk6", "Wk9")

# Count matrix (absolute cell counts per clonotype per timepoint)
count_cols <- c("Pre-treatment", "3 weeks", "6 weeks", "9 weeks")
count_mat <- as.matrix(s8[, count_cols])
count_mat[is.na(count_mat)] <- 0
colnames(count_mat) <- c("Pre", "Wk3", "Wk6", "Wk9")

cat(sprintf("Frequency matrix: %d x %d, range [%.4f, %.4f]\n",
            nrow(our_freq), ncol(our_freq), min(our_freq), max(our_freq)))
Frequency matrix: 2541 x 4, range [0.0000, 12.0482]
Code
cat(sprintf("Column sums: %s (should be ~100 each)\n",
            paste(round(colSums(our_freq), 2), collapse = ", ")))
Column sums: 100, 100, 100, 100 (should be ~100 each)
Code
cat(sprintf("\nCount matrix: %d x %d, range [%.0f, %.0f]\n",
            nrow(count_mat), ncol(count_mat), min(count_mat), max(count_mat)))

Count matrix: 2541 x 4, range [0, 81]
Code
# Compare our frequency matrix from Table S8 to Ethan's CSV
cat("Our frequency matrix:", nrow(our_freq), "x", ncol(our_freq), "\n")
Our frequency matrix: 2541 x 4 
Code
cat("Range:", range(our_freq), "\n")
Range: 0 12.04819 
Code
cat("Column sums:", colSums(our_freq), "\n\n")
Column sums: 100 100 100 100 
Code
max_diff <- max(abs(our_freq - ethan_tcr))
cat(sprintf("Max difference vs Ethan's CSV: %.2e\n", max_diff))
Max difference vs Ethan's CSV: 0.00e+00
Code
if (max_diff < 1e-6) {
  cat("MATCH CONFIRMED — our Table S8 extraction matches Ethan's CSV exactly.\n")
} else {
  cat("MISMATCH — investigating...\n")
  cat("First row ours:", our_freq[1, ], "\n")
  cat("First row Ethan:", ethan_tcr[1, ], "\n")
}
MATCH CONFIRMED — our Table S8 extraction matches Ethan's CSV exactly.

Step 3: Quantify the Sparsity Problem

This is the core issue with the TCR matrix. Most clonotypes appear at only one timepoint, giving NMF rows that are mostly zeros. NMF can’t learn a temporal trajectory from a single data point.

Code
# Use our frequency matrix (or Ethan's if match failed)
freq_mat <- our_freq

# How many timepoints does each clonotype appear in?
n_timepoints <- rowSums(freq_mat > 0)
tp_table <- table(n_timepoints)

cat("=== Clonotype Presence Across Timepoints ===\n\n")
=== Clonotype Presence Across Timepoints ===
Code
for (i in 1:4) {
  n <- sum(n_timepoints == i)
  pct <- 100 * n / length(n_timepoints)
  cat(sprintf("  Present in %d timepoint(s): %d clonotypes (%.1f%%)\n", i, n, pct))
}
  Present in 1 timepoint(s): 2443 clonotypes (96.1%)
  Present in 2 timepoint(s): 78 clonotypes (3.1%)
  Present in 3 timepoint(s): 16 clonotypes (0.6%)
  Present in 4 timepoint(s): 4 clonotypes (0.2%)
Code
cat(sprintf("\nTotal: %d clonotypes\n", length(n_timepoints)))

Total: 2541 clonotypes
Code
cat(sprintf("Single-timepoint (uninformative for NMF): %d (%.1f%%)\n",
            sum(n_timepoints == 1), 100 * sum(n_timepoints == 1) / length(n_timepoints)))
Single-timepoint (uninformative for NMF): 2443 (96.1%)
Code
cat(sprintf("Multi-timepoint (informative): %d (%.1f%%)\n",
            sum(n_timepoints >= 2), 100 * sum(n_timepoints >= 2) / length(n_timepoints)))
Multi-timepoint (informative): 98 (3.9%)
Code
tp_df <- data.frame(
  timepoints = factor(1:4),
  count = as.numeric(tp_table)
)

p_sparsity <- ggplot(tp_df, aes(x = timepoints, y = count)) +
  geom_col(fill = "#b0b0b0", width = 0.6) +
  geom_text(aes(label = sprintf("%d\n(%.0f%%)", count, 100 * count / sum(count))),
            vjust = -0.3, color = "#e0e0e0", size = 4) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  theme_donq() +
  labs(
    title = "TCR Sparsity: Most Clonotypes Appear at Only 1 Timepoint",
    subtitle = "NMF needs multi-timepoint clonotypes to learn temporal patterns",
    x = "Number of Timepoints Present",
    y = "Number of Clonotypes",
    caption = "Source: Table S8, Abdelfatah et al. (2023) | eigen.io"
  )

p_sparsity

Code
save_donq(p_sparsity, "03_tcr_sparsity.png", width = 10, height = 5)
Code
# Which timepoints do single-timepoint clonotypes belong to?
single_tp <- freq_mat[n_timepoints == 1, ]
single_which <- apply(single_tp, 1, which.max)
single_table <- table(colnames(freq_mat)[single_which])

cat("=== Single-Timepoint Clonotypes by Timepoint ===\n")
=== Single-Timepoint Clonotypes by Timepoint ===
Code
print(single_table)

 Pre  Wk3  Wk6  Wk9 
 765  112 1252  314 
Code
cat("\nThis reflects the sampling depth at each timepoint.\n")

This reflects the sampling depth at each timepoint.
Code
cat("Pre has the most cells (1,861) so the most unique clonotypes.\n")
Pre has the most cells (1,861) so the most unique clonotypes.

Step 4: Filtering Strategies

We test three filtering approaches, from most aggressive to least:

Filter A: Multi-timepoint only (2+ timepoints)

Only keep clonotypes observed at 2 or more timepoints. These are the only clonotypes where NMF can learn a temporal pattern. Aggressive — loses most rows.

Filter B: Minimum total count (≥3 cells across all timepoints)

Keep clonotypes with at least 3 total cells. Less aggressive than Filter A — includes some single-timepoint clonotypes if they have enough cells to be reliable.

Filter C: Expanded + TIL-observed only

Keep only clonotypes marked as “Expanded” or “TIL.observed” in Table S8. These are the biologically interesting ones — the rest are background repertoire noise.

Code
# Filter A: 2+ timepoints
filter_a <- freq_mat[n_timepoints >= 2, ]
cat("=== Filter A: Multi-timepoint (2+ timepoints) ===\n")
=== Filter A: Multi-timepoint (2+ timepoints) ===
Code
cat(sprintf("Rows: %d (%.1f%% of original)\n", nrow(filter_a),
            100 * nrow(filter_a) / nrow(freq_mat)))
Rows: 98 (3.9% of original)
Code
cat("Timepoint distribution:", table(rowSums(filter_a > 0)), "\n")
Timepoint distribution: 78 16 4 
Code
# Filter B: Total count >= 3
total_counts <- rowSums(count_mat)
filter_b <- freq_mat[total_counts >= 3, ]
count_b <- count_mat[total_counts >= 3, ]

cat("=== Filter B: Total count >= 3 ===\n")
=== Filter B: Total count >= 3 ===
Code
cat(sprintf("Rows: %d (%.1f%% of original)\n", nrow(filter_b),
            100 * nrow(filter_b) / nrow(freq_mat)))
Rows: 51 (2.0% of original)
Code
cat("Total count distribution of kept clonotypes:\n")
Total count distribution of kept clonotypes:
Code
print(summary(total_counts[total_counts >= 3]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   3.000   3.000   8.706   7.000 142.000 
Code
# Filter C: Expanded or TIL-observed
# Column values are "Expanded" / "TIL.observed" or NA
is_expanded <- !is.na(s8[["Expanded"]]) & s8[["Expanded"]] == "Expanded"
is_til <- !is.na(s8[["TIL.observed"]]) & s8[["TIL.observed"]] == "TIL.observed"

cat(sprintf("Expanded clonotypes: %d\n", sum(is_expanded)))
Expanded clonotypes: 9
Code
cat(sprintf("TIL-observed clonotypes: %d\n", sum(is_til)))
TIL-observed clonotypes: 415
Code
cat(sprintf("Either (union): %d\n", sum(is_expanded | is_til)))
Either (union): 416
Code
cat(sprintf("Both (intersection): %d\n", sum(is_expanded & is_til)))
Both (intersection): 8
Code
filter_c <- freq_mat[is_expanded | is_til, ]
cat(sprintf("\n=== Filter C: Expanded or TIL-observed ===\n"))

=== Filter C: Expanded or TIL-observed ===
Code
cat(sprintf("Rows: %d (%.1f%% of original)\n", nrow(filter_c),
            100 * nrow(filter_c) / nrow(freq_mat)))
Rows: 416 (16.4% of original)
Code
# How many of these are multi-timepoint?
fc_tp <- rowSums(filter_c > 0)
cat(sprintf("Of these, multi-timepoint: %d (%.1f%%)\n",
            sum(fc_tp >= 2), 100 * sum(fc_tp >= 2) / nrow(filter_c)))
Of these, multi-timepoint: 62 (14.9%)
Code
filter_sizes <- data.frame(
  filter = c("No filter", "A: 2+ timepoints", "B: count >= 3", "C: Expanded/TIL"),
  rows = c(nrow(freq_mat), nrow(filter_a), nrow(filter_b), nrow(filter_c))
)
filter_sizes$filter <- factor(filter_sizes$filter, levels = filter_sizes$filter)

p_filters <- ggplot(filter_sizes, aes(x = filter, y = rows)) +
  geom_col(fill = "#b0b0b0", width = 0.6) +
  geom_text(aes(label = rows), vjust = -0.3, color = "#e0e0e0", size = 4.5) +
  theme_donq() +
  theme(axis.text.x = element_text(angle = 15, hjust = 1)) +
  labs(
    title = "TCR Matrix Size After Each Filter",
    subtitle = "Aggressive filtering removes noise but reduces matrix size",
    x = NULL,
    y = "Number of Clonotypes (rows)",
    caption = "eigen.io"
  )

p_filters

Code
save_donq(p_filters, "03_tcr_filter_sizes.png", width = 10, height = 5)

Step 5: Normalization Approaches

For each filtered matrix, we test different normalizations. The question: should we give MEGPath frequencies (sum to 100 per timepoint), raw counts, or some transformation?

Frequencies (what Ethan used): Each column sums to ~100. Good because it accounts for different sampling depths across timepoints (Pre has 1,861 cells, Wk3 only 223). But the scale is very small — most values are <0.1%.

Raw counts: Integer number of cells per clonotype per timepoint. Doesn’t account for sampling depth — a clonotype with 5 cells at Wk3 (out of 223) is more impressive than 5 cells at Pre (out of 1,861). But counts may be more natural for NMF.

Sqrt-transformed frequencies: sqrt(freq) compresses the range while preserving relative differences. Common in ecology for species abundance data.

Log-transformed frequencies: log1p(freq) compresses even more aggressively. Risk: same double-log issue as scRNA if not careful about what we’re logging.

Code
# Use Filter A (multi-timepoint) as the test case — it's the cleanest for NMF
test_mat <- filter_a
cat(sprintf("Testing normalizations on Filter A: %d clonotypes x 4 timepoints\n\n",
            nrow(test_mat)))
Testing normalizations on Filter A: 98 clonotypes x 4 timepoints
Code
# 1. Frequencies as-is (Ethan's approach for this subset)
norm_freq <- test_mat
cat("Frequencies: range", range(norm_freq), "\n")
Frequencies: range 0 12.04819 
Code
# 2. Sqrt-transformed frequencies
norm_sqrt <- sqrt(test_mat)
cat("Sqrt(freq): range", range(norm_sqrt), "\n")
Sqrt(freq): range 0 3.471051 
Code
# 3. Log-transformed frequencies
norm_log <- log1p(test_mat)
cat("Log1p(freq): range", range(norm_log), "\n")
Log1p(freq): range 0 2.56865 
Code
# 4. Raw counts
count_a <- count_mat[n_timepoints >= 2, ]
cat("Raw counts: range", range(count_a), "\n")
Raw counts: range 0 81 
Code
cat("Count row sums: range", range(rowSums(count_a)), "\n")
Count row sums: range 2 142 

Step 6: Reproduce Zihan’s 7-Column Matrix

Zihan added 3 columns to the 4 frequency columns: -log10(Fisher p-value) from Tables S9A (Pre vs 3wk), S9B (Pre vs 6wk), and S9C (Pre vs 9wk), capped at 10.

With 7 columns instead of 4, the max meaningful NMF rank increases from k=3 to k=6. Zihan selected k=5 using cophenetic correlation, dispersion, and RSS — a proper rank selection methodology.

The question: do the p-value columns add real information, or are they redundant with the frequency columns? We test this by running NMF on both the 4-column and 7-column versions and comparing errors. If the p-value columns help, the 7-column matrix should have lower NMAE at the same rank, and higher ranks (k=4,5) should improve further.

Code
# Load Tables S9A, S9B, S9C — same skip = 4 as S8
# Columns: Sequence, Pre-treatment (n), X weeks (n), Pre-treatment fraction,
#           X weeks fraction, Pre-treatment_neg, X weeks_neg, fisher, Diff
s9a <- read_excel("NMF Research Data/crc-22-0383-s01.xlsx", sheet = "Table S9A", skip = 4)
s9b <- read_excel("NMF Research Data/crc-22-0383-s01.xlsx", sheet = "Table S9B", skip = 4)
s9c <- read_excel("NMF Research Data/crc-22-0383-s01.xlsx", sheet = "Table S9C", skip = 4)

cat("S9A (Pre vs 3wk):", nrow(s9a), "x", ncol(s9a), "\n")
S9A (Pre vs 3wk): 955 x 9 
Code
cat("S9B (Pre vs 6wk):", nrow(s9b), "x", ncol(s9b), "\n")
S9B (Pre vs 6wk): 2115 x 9 
Code
cat("S9C (Pre vs 9wk):", nrow(s9c), "x", ncol(s9c), "\n")
S9C (Pre vs 9wk): 1167 x 9 
Code
# S8 CDR3 column: "TCRB.clonotype"
# S9 CDR3 column: "Sequence"
# S9 p-value column: "fisher"
cdr3_vec <- s8[["TCRB.clonotype"]]
cat("First 3 CDR3 sequences:", paste(head(cdr3_vec, 3), collapse = ", "), "\n")
First 3 CDR3 sequences: CASSSLGSFDEQFF, CASSYRGRSTGELFF, CSVEERWTLTEAFF 
Code
# Build CDR3 → Fisher p-value lookups from each S9 table
p_3wk <- setNames(as.numeric(s9a[["fisher"]]), s9a[["Sequence"]])
p_6wk <- setNames(as.numeric(s9b[["fisher"]]), s9b[["Sequence"]])
p_9wk <- setNames(as.numeric(s9c[["fisher"]]), s9c[["Sequence"]])

cat(sprintf("S9A: %d clonotypes with Fisher p-values\n", length(p_3wk)))
S9A: 955 clonotypes with Fisher p-values
Code
cat(sprintf("S9B: %d clonotypes with Fisher p-values\n", length(p_6wk)))
S9B: 2115 clonotypes with Fisher p-values
Code
cat(sprintf("S9C: %d clonotypes with Fisher p-values\n", length(p_9wk)))
S9C: 1167 clonotypes with Fisher p-values
Code
# Build -log10(p) columns, matched to S8 clonotype order
neglog_3wk <- -log10(pmax(p_3wk[cdr3_vec], 1e-10))
neglog_6wk <- -log10(pmax(p_6wk[cdr3_vec], 1e-10))
neglog_9wk <- -log10(pmax(p_9wk[cdr3_vec], 1e-10))

# Clonotypes not in S9 tables (not tested) → 0
neglog_3wk[is.na(neglog_3wk)] <- 0
neglog_6wk[is.na(neglog_6wk)] <- 0
neglog_9wk[is.na(neglog_9wk)] <- 0

# Cap at 10 (as Zihan did)
neglog_3wk <- pmin(neglog_3wk, 10)
neglog_6wk <- pmin(neglog_6wk, 10)
neglog_9wk <- pmin(neglog_9wk, 10)

# Build 7-column matrix
zihan_mat <- cbind(freq_mat,
                   neglog_p_3wk = neglog_3wk,
                   neglog_p_6wk = neglog_6wk,
                   neglog_p_9wk = neglog_9wk)

cat(sprintf("\nZihan's 7-column matrix: %d x %d\n", nrow(zihan_mat), ncol(zihan_mat)))

Zihan's 7-column matrix: 2541 x 7
Code
cat("Column ranges:\n")
Column ranges:
Code
for (j in 1:ncol(zihan_mat)) {
  cat(sprintf("  %s: [%.4f, %.4f]\n", colnames(zihan_mat)[j],
              min(zihan_mat[,j]), max(zihan_mat[,j])))
}
  Pre: [0.0000, 1.4461]
  Wk3: [0.0000, 12.0482]
  Wk6: [0.0000, 5.2090]
  Wk9: [0.0000, 6.5116]
  neglog_p_3wk: [0.0000, 8.6353]
  neglog_p_6wk: [-0.0000, 6.0251]
  neglog_p_9wk: [-0.0000, 5.8997]
Code
# Filtered version (multi-timepoint only)
zihan_filtered <- zihan_mat[n_timepoints >= 2, ]
cat(sprintf("\nFiltered (multi-timepoint): %d x %d\n",
            nrow(zihan_filtered), ncol(zihan_filtered)))

Filtered (multi-timepoint): 98 x 7

Test 1: Are the p-value columns actually independent?

Before running NMF, we can directly measure how much new information the p-value columns carry. If -log10(p_6wk) can be predicted from freq_pre and freq_6wk with high accuracy, it’s redundant. We test this with R² from linear regression: if R² ≈ 1, the p-value column is a (near-)linear function of existing columns and adds no new information for a linear factorization like NMF.

We also compute the correlation matrix between all 7 columns. High correlations between p-value columns and frequency columns confirm redundancy.

Code
cat("=== Test: How predictable are p-value columns from frequency columns? ===\n\n")
=== Test: How predictable are p-value columns from frequency columns? ===
Code
# For each p-value column, regress it on the frequency columns
freq_only <- zihan_mat[, 1:4]
p_cols <- zihan_mat[, 5:7]

for (j in 1:3) {
  p_name <- colnames(p_cols)[j]
  nonzero <- p_cols[, j] > 0
  if (sum(nonzero) > 10) {
    fit <- lm(p_cols[nonzero, j] ~ freq_only[nonzero, ])
    r2 <- summary(fit)$r.squared
    cat(sprintf("  %s ~ freq columns: R² = %.4f (n=%d)\n", p_name, r2, sum(nonzero)))
  } else {
    cat(sprintf("  %s: too few nonzero values (%d) to test\n", p_name, sum(nonzero)))
  }
}
  neglog_p_3wk ~ freq columns: R² = 0.9817 (n=139)
  neglog_p_6wk ~ freq columns: R² = 0.7873 (n=839)
  neglog_p_9wk ~ freq columns: R² = 0.9311 (n=367)
Code
cat("\nR² close to 1.0 = the p-value column is almost entirely explained by the\n")

R² close to 1.0 = the p-value column is almost entirely explained by the
Code
cat("frequency columns. NMF gains no new information from it.\n")
frequency columns. NMF gains no new information from it.
Code
cat("\n=== Correlation Matrix (all 7 columns) ===\n\n")

=== Correlation Matrix (all 7 columns) ===
Code
has_pval <- rowSums(p_cols) > 0
if (sum(has_pval) > 20) {
  cor_mat <- cor(zihan_mat[has_pval, ], use = "complete.obs")
} else {
  cat("Note: few rows with p-values, correlations may be noisy.\n")
  cor_mat <- cor(zihan_mat, use = "complete.obs")
}
print(round(cor_mat, 3))
                Pre   Wk3   Wk6   Wk9 neglog_p_3wk neglog_p_6wk neglog_p_9wk
Pre           1.000 0.284 0.421 0.143        0.124        0.814       -0.111
Wk3           0.284 1.000 0.807 0.670        0.955        0.337        0.408
Wk6           0.421 0.807 1.000 0.855        0.653        0.618        0.630
Wk9           0.143 0.670 0.855 1.000        0.524        0.329        0.913
neglog_p_3wk  0.124 0.955 0.653 0.524        1.000        0.200        0.318
neglog_p_6wk  0.814 0.337 0.618 0.329        0.200        1.000        0.163
neglog_p_9wk -0.111 0.408 0.630 0.913        0.318        0.163        1.000
Code
# Scatter: freq_6wk vs neglog_p_6wk to visualize the functional relationship
# Use 6wk since it has the most expanded clonotypes (S9B: 2118 rows)
has_6wk_p <- zihan_mat[, "neglog_p_6wk"] > 0
scatter_df <- data.frame(
  freq_6wk = zihan_mat[has_6wk_p, "Wk6"],
  neglog_p = zihan_mat[has_6wk_p, "neglog_p_6wk"]
)

p_scatter <- ggplot(scatter_df, aes(x = freq_6wk, y = neglog_p)) +
  geom_point(color = "#b0b0b0", alpha = 0.6, size = 2) +
  theme_donq() +
  labs(
    title = "Frequency vs -log10(p-value) at Week 6",
    subtitle = "If this shows a tight functional curve, the p-value column is redundant",
    x = "Frequency at Week 6 (%)",
    y = "-log10(Fisher p-value)",
    caption = "Only clonotypes with Fisher test results | eigen.io"
  )

p_scatter

Code
save_donq(p_scatter, "03_freq_vs_pvalue.png", width = 10, height = 5)

Test 2: NMF with and without p-value columns

Now the empirical test. We run NMF on the same clonotypes with 4 columns (frequencies only) vs 7 columns (frequencies + p-values) at the same ranks. If theory is correct:

  • At k=2,3: 7 columns should have similar or worse NMAE — the extra columns are noise that NMF wastes factors fitting.
  • At k=4,5: 7 columns will have lower total error, but this is overfitting — the extra factors are memorizing the nonlinear freq→p-value mapping, not finding biology.
Code
cat("=== NMF: 4-column vs 7-column comparison ===\n\n")
=== NMF: 4-column vs 7-column comparison ===
Code
cat("--- Unfiltered (all 2541 clonotypes) ---\n")
--- Unfiltered (all 2541 clonotypes) ---
Code
zihan_all_err <- compute_errors(zihan_mat, "Zihan 7col: all", ranks = 2:3)

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.23    0.02    3.14 
[Zihan 7col: all] k=2: LAD=840.35, NMAE=0.003921, Frobenius=13.23, Rows=2541

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.37    0.00    3.39 
[Zihan 7col: all] k=3: LAD=315.94, NMAE=0.001474, Frobenius=5.58, Rows=2541
Code
cat("\n--- Filtered (multi-timepoint) ---\n")

--- Filtered (multi-timepoint) ---
Code
zihan_filt_err <- compute_errors(zihan_filtered, "Zihan 7col: filtered", ranks = 2:3)

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.20    0.03    2.86 
[Zihan 7col: filtered] k=2: LAD=68.64, NMAE=0.008305, Frobenius=4.25, Rows=98

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.27    0.01    3.01 
[Zihan 7col: filtered] k=3: LAD=43.17, NMAE=0.005223, Frobenius=2.91, Rows=98
Code
cat("\n--- Higher ranks on 7-column (k=4,5) ---\n")

--- Higher ranks on 7-column (k=4,5) ---
Code
zihan_highk_err <- compute_errors(zihan_mat, "Zihan 7col: all k4-5", ranks = 4:5)

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.21    0.02    3.48 
[Zihan 7col: all k4-5] k=4: LAD=182.44, NMAE=0.000851, Frobenius=3.07, Rows=2541

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.24    0.00    3.73 
[Zihan 7col: all k4-5] k=5: LAD=59.34, NMAE=0.000277, Frobenius=1.94, Rows=2541
Code
zihan_filt_highk_err <- compute_errors(zihan_filtered, "Zihan 7col: filt k4-5", ranks = 4:5)

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.25    0.00    3.00 
[Zihan 7col: filt k4-5] k=4: LAD=25.56, NMAE=0.003093, Frobenius=1.71, Rows=98

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.25    0.00    3.03 
[Zihan 7col: filt k4-5] k=5: LAD=22.71, NMAE=0.002747, Frobenius=1.35, Rows=98

Step 7: NMF Error Comparison

Run NMF on each filter × normalization combination and compare errors.

Code
cat("=== Baseline: Ethan's full TCR matrix (unfiltered frequencies) ===\n\n")
=== Baseline: Ethan's full TCR matrix (unfiltered frequencies) ===
Code
baseline_tcr <- compute_errors(freq_mat, "Baseline: all freq")

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.20    0.02    3.06 
[Baseline: all freq] k=2: LAD=298.40, NMAE=0.002437, Frobenius=4.86, Rows=2541

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.24    0.00    3.34 
[Baseline: all freq] k=3: LAD=186.99, NMAE=0.001527, Frobenius=3.18, Rows=2541
Code
cat("\n=== Filter A (multi-timepoint) × Normalizations ===\n\n")

=== Filter A (multi-timepoint) × Normalizations ===
Code
fa_freq_err  <- compute_errors(norm_freq, "A: freq")

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.23    0.00    2.91 
[A: freq] k=2: LAD=22.25, NMAE=0.004712, Frobenius=1.74, Rows=98

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.28    0.03    3.02 
[A: freq] k=3: LAD=11.10, NMAE=0.002351, Frobenius=1.05, Rows=98
Code
fa_sqrt_err  <- compute_errors(norm_sqrt, "A: sqrt(freq)")

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.25    0.00    2.82 
[A: sqrt(freq)] k=2: LAD=46.39, NMAE=0.034091, Frobenius=3.17, Rows=98

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.27    0.00    2.83 
[A: sqrt(freq)] k=3: LAD=19.68, NMAE=0.014462, Frobenius=1.60, Rows=98
Code
fa_log_err   <- compute_errors(norm_log, "A: log1p(freq)")

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.32    0.00    3.44 
[A: log1p(freq)] k=2: LAD=19.94, NMAE=0.019808, Frobenius=1.49, Rows=98

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.30    0.00    3.29 
[A: log1p(freq)] k=3: LAD=9.29, NMAE=0.009222, Frobenius=0.84, Rows=98
Code
fa_count_err <- compute_errors(count_a, "A: counts")

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.28    0.01    3.42 
[A: counts] k=2: LAD=143.00, NMAE=0.004504, Frobenius=11.41, Rows=98

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.43    0.00    3.43 
[A: counts] k=3: LAD=83.16, NMAE=0.002619, Frobenius=8.12, Rows=98
Code
cat("\n=== Filter B (count >= 3) × Frequencies ===\n\n")

=== Filter B (count >= 3) × Frequencies ===
Code
fb_freq_err <- compute_errors(filter_b, "B: freq (count>=3)")

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.25    0.00    3.18 
[B: freq (count>=3)] k=2: LAD=17.53, NMAE=0.007134, Frobenius=1.97, Rows=51

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.28    0.00    3.37 
[B: freq (count>=3)] k=3: LAD=9.37, NMAE=0.003811, Frobenius=1.17, Rows=51
Code
cat("\n=== Filter C (Expanded/TIL) × Frequencies ===\n\n")

=== Filter C (Expanded/TIL) × Frequencies ===
Code
fc_freq_err <- compute_errors(filter_c, "C: freq (bio)")

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.31    0.05    3.42 
[C: freq (bio)] k=2: LAD=61.34, NMAE=0.003060, Frobenius=2.70, Rows=416

Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   1.36    0.01    3.56 
[C: freq (bio)] k=3: LAD=27.31, NMAE=0.001362, Frobenius=1.47, Rows=416

Step 8: Compare All TCR Errors

Code
all_tcr_errors <- bind_rows(
  baseline_tcr,
  fa_freq_err,
  fa_sqrt_err,
  fa_log_err,
  fa_count_err,
  fb_freq_err,
  fc_freq_err,
  zihan_all_err,
  zihan_filt_err,
  zihan_highk_err,
  zihan_filt_highk_err
)

cat("=== All TCR Error Comparisons ===\n\n")
=== All TCR Error Comparisons ===
Code
cat("Sorted by NMAE (normalized, scale-independent):\n\n")
Sorted by NMAE (normalized, scale-independent):
Code
all_tcr_errors |>
  arrange(rank, nmae) |>
  select(label, rank, n_rows, lad, nmae) |>
  as.data.frame() |>
  print()
                   label rank n_rows        lad         nmae
1     Baseline: all freq    2   2541 298.401088 0.0024367661
2          C: freq (bio)    2    416  61.337593 0.0030595073
3        Zihan 7col: all    2   2541 840.354665 0.0039213716
4              A: counts    2     98 143.002621 0.0045037358
5                A: freq    2     98  22.253107 0.0047117548
6     B: freq (count>=3)    2     51  17.533011 0.0071335290
7   Zihan 7col: filtered    2     98  68.637289 0.0083045116
8         A: log1p(freq)    2     98  19.944772 0.0198078862
9          A: sqrt(freq)    2     98  46.386133 0.0340911101
10         C: freq (bio)    3    416  27.308655 0.0013621505
11       Zihan 7col: all    3   2541 315.935275 0.0014742580
12    Baseline: all freq    3   2541 186.991081 0.0015269834
13               A: freq    3     98  11.101863 0.0023506496
14             A: counts    3     98  83.158264 0.0026189930
15    B: freq (count>=3)    3     51   9.367917 0.0038114564
16  Zihan 7col: filtered    3     98  43.168659 0.0052230302
17        A: log1p(freq)    3     98   9.285943 0.0092222117
18         A: sqrt(freq)    3     98  19.677777 0.0144620216
19  Zihan 7col: all k4-5    4   2541 182.444357 0.0008513455
20 Zihan 7col: filt k4-5    4     98  25.560935 0.0030926496
21  Zihan 7col: all k4-5    5   2541  59.337945 0.0002768904
22 Zihan 7col: filt k4-5    5     98  22.706497 0.0027472875
Code
cat("\n=== Improvement vs Ethan's Baseline ===\n\n")

=== Improvement vs Ethan's Baseline ===
Code
for (k in 2:3) {
  bl <- baseline_tcr |> filter(rank == k)

  improved <- all_tcr_errors |>
    filter(label != "Baseline: all freq", rank == k) |>
    arrange(nmae)

  cat(sprintf("--- Rank k=%d (baseline NMAE: %.6f) ---\n", k, bl$nmae[1]))
  for (i in 1:nrow(improved)) {
    pct <- 100 * (bl$nmae[1] - improved$nmae[i]) / bl$nmae[1]
    cat(sprintf("  %s: NMAE=%.6f (%+.1f%%, %d rows)\n",
                improved$label[i], improved$nmae[i], -pct, improved$n_rows[i]))
  }
  cat("\n")
}
--- Rank k=2 (baseline NMAE: 0.002437) ---
  C: freq (bio): NMAE=0.003060 (+25.6%, 416 rows)
  Zihan 7col: all: NMAE=0.003921 (+60.9%, 2541 rows)
  A: counts: NMAE=0.004504 (+84.8%, 98 rows)
  A: freq: NMAE=0.004712 (+93.4%, 98 rows)
  B: freq (count>=3): NMAE=0.007134 (+192.7%, 51 rows)
  Zihan 7col: filtered: NMAE=0.008305 (+240.8%, 98 rows)
  A: log1p(freq): NMAE=0.019808 (+712.9%, 98 rows)
  A: sqrt(freq): NMAE=0.034091 (+1299.0%, 98 rows)

--- Rank k=3 (baseline NMAE: 0.001527) ---
  C: freq (bio): NMAE=0.001362 (-10.8%, 416 rows)
  Zihan 7col: all: NMAE=0.001474 (-3.5%, 2541 rows)
  A: freq: NMAE=0.002351 (+53.9%, 98 rows)
  A: counts: NMAE=0.002619 (+71.5%, 98 rows)
  B: freq (count>=3): NMAE=0.003811 (+149.6%, 51 rows)
  Zihan 7col: filtered: NMAE=0.005223 (+242.0%, 98 rows)
  A: log1p(freq): NMAE=0.009222 (+503.9%, 98 rows)
  A: sqrt(freq): NMAE=0.014462 (+847.1%, 98 rows)

Visualization

Code
plot_tcr <- all_tcr_errors |>
  mutate(label = factor(label, levels = unique(label)))

p_tcr <- ggplot(plot_tcr, aes(x = label, y = nmae, fill = factor(rank))) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_donq() +
  theme_donq() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1, size = 10)) +
  labs(
    title = "TCR NMF Error: Filtering & Normalization Comparison",
    subtitle = "NMAE (normalized to [0,1] scale) — lower is better",
    x = NULL,
    y = "Normalized MAE",
    fill = "Rank (k)",
    caption = "Lee algorithm, 10 random starts | eigen.io"
  )

p_tcr

Code
save_donq(p_tcr, "03_tcr_error_comparison.png", width = 12, height = 7)

Clonotype Trajectory Visualization

Let’s visualize the top expanded clonotypes to see what the temporal patterns actually look like. These are the clonotypes NMF should be learning factors from.

Code
# Get multi-timepoint clonotypes with their CDR3 sequences
multi_idx <- which(n_timepoints >= 2)
multi_df <- data.frame(
  cdr3 = s8[["TCRB.clonotype"]][multi_idx],
  freq_mat[multi_idx, ],
  total = rowSums(freq_mat[multi_idx, ]),
  n_tp = n_timepoints[multi_idx]
)

# Top 8 by total frequency
top10 <- multi_df |> arrange(desc(total)) |> head(8)
cat("=== Top 8 Multi-Timepoint Clonotypes (by total frequency) ===\n")
=== Top 8 Multi-Timepoint Clonotypes (by total frequency) ===
Code
print(as.data.frame(top10))
              cdr3       Pre        Wk3       Wk6       Wk9     total n_tp
1   CASSSLGSFDEQFF 1.4460512 12.0481928 5.2090032 6.5116279 25.214875    4
2  CASSYRGRSTGELFF 0.1112347  1.2048193 1.8649518 3.2558140  6.436820    4
3   CSVEERWTLTEAFF 0.0000000  0.6024096 1.0289389 1.1627907  2.794139    3
4  CASSQVRRGNYGYTF 0.0000000  1.8072289 0.1286174 0.6976744  2.633521    3
5 CASSEGSGRLENIQYF 0.3337041  1.8072289 0.0000000 0.2325581  2.373491    3
6     CASSYTGFEQYF 0.0000000  0.6024096 0.7717042 0.9302326  2.304346    3
7   CASSIGTSTDTQYF 0.0000000  1.2048193 0.1929260 0.6976744  2.095420    3
8   CASSFMGGPYEQYF 0.2224694  0.6024096 0.3858521 0.2325581  1.443289    4
Code
# Reshape for plotting
top_long <- top10 |>
  select(cdr3, Pre, Wk3, Wk6, Wk9) |>
  reshape2::melt(id.vars = "cdr3", variable.name = "timepoint", value.name = "frequency")

top_long$timepoint <- factor(top_long$timepoint, levels = c("Pre", "Wk3", "Wk6", "Wk9"))
top_long$cdr3_short <- substr(top_long$cdr3, 1, 15)

p_traj <- ggplot(top_long, aes(x = timepoint, y = frequency,
                                group = cdr3_short, color = cdr3_short)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2.5) +
  scale_color_donq() +
  theme_donq() +
  labs(
    title = "Top 8 Expanded Clonotype Trajectories",
    subtitle = "Frequency (%) across treatment timepoints",
    x = "Timepoint",
    y = "Frequency (%)",
    color = "CDR3 sequence",
    caption = "Multi-timepoint clonotypes only | eigen.io"
  )

p_traj

Code
save_donq(p_traj, "03_top_clonotype_trajectories.png", width = 10, height = 6)

Export Improved CSVs for MEGPath

Code
dir.create("megpath_inputs", showWarnings = FALSE)

# Filter A variants
write.table(norm_freq, file = "megpath_inputs/tcr_multi_tp_freq.csv",
            sep = ",", row.names = FALSE, col.names = FALSE)
write.table(norm_sqrt, file = "megpath_inputs/tcr_multi_tp_sqrt.csv",
            sep = ",", row.names = FALSE, col.names = FALSE)
write.table(norm_log, file = "megpath_inputs/tcr_multi_tp_log.csv",
            sep = ",", row.names = FALSE, col.names = FALSE)

write.table(count_a, file = "megpath_inputs/tcr_multi_tp_counts.csv",
            sep = ",", row.names = FALSE, col.names = FALSE)

# Filter B
write.table(filter_b, file = "megpath_inputs/tcr_count3_freq.csv",
            sep = ",", row.names = FALSE, col.names = FALSE)

# Filter C
write.table(filter_c, file = "megpath_inputs/tcr_bio_freq.csv",
            sep = ",", row.names = FALSE, col.names = FALSE)

# Labeled version for interpretation
write.csv(data.frame(cdr3 = multi_df$cdr3, norm_freq),
          file = "megpath_inputs/tcr_multi_tp_freq_labeled.csv", row.names = FALSE)

# Zihan's 7-column matrix (for comparison)
write.table(zihan_mat, file = "megpath_inputs/tcr_7col_all.csv",
            sep = ",", row.names = FALSE, col.names = FALSE)
write.table(zihan_filtered, file = "megpath_inputs/tcr_7col_filtered.csv",
            sep = ",", row.names = FALSE, col.names = FALSE)

cat("=== Exported TCR CSVs ===\n")
=== Exported TCR CSVs ===
Code
cat("megpath_inputs/tcr_multi_tp_freq.csv       — Filter A, frequencies (4 col)\n")
megpath_inputs/tcr_multi_tp_freq.csv       — Filter A, frequencies (4 col)
Code
cat("megpath_inputs/tcr_multi_tp_sqrt.csv       — Filter A, sqrt(freq) (4 col)\n")
megpath_inputs/tcr_multi_tp_sqrt.csv       — Filter A, sqrt(freq) (4 col)
Code
cat("megpath_inputs/tcr_multi_tp_log.csv        — Filter A, log1p(freq) (4 col)\n")
megpath_inputs/tcr_multi_tp_log.csv        — Filter A, log1p(freq) (4 col)
Code
cat("megpath_inputs/tcr_multi_tp_counts.csv     — Filter A, raw counts (4 col)\n")
megpath_inputs/tcr_multi_tp_counts.csv     — Filter A, raw counts (4 col)
Code
cat("megpath_inputs/tcr_count3_freq.csv         — Filter B, frequencies (4 col)\n")
megpath_inputs/tcr_count3_freq.csv         — Filter B, frequencies (4 col)
Code
cat("megpath_inputs/tcr_bio_freq.csv            — Filter C, frequencies (4 col)\n")
megpath_inputs/tcr_bio_freq.csv            — Filter C, frequencies (4 col)
Code
cat("megpath_inputs/tcr_7col_all.csv            — Zihan's approach, all (7 col)\n")
megpath_inputs/tcr_7col_all.csv            — Zihan's approach, all (7 col)
Code
cat("megpath_inputs/tcr_7col_filtered.csv       — Zihan's approach, filtered (7 col)\n")
megpath_inputs/tcr_7col_filtered.csv       — Zihan's approach, filtered (7 col)
Code
cat("megpath_inputs/tcr_multi_tp_freq_labeled.csv — Filter A with CDR3 labels\n")
megpath_inputs/tcr_multi_tp_freq_labeled.csv — Filter A with CDR3 labels

Save Results

Code
saveRDS(list(
  all_tcr_errors = all_tcr_errors,
  table_s8 = s8,
  freq_mat = freq_mat,
  filter_a = filter_a,
  filter_b = filter_b,
  filter_c = filter_c,
  n_timepoints = n_timepoints,
  top_clonotypes = top10
), "03_tcr_outputs.rds")

cat("Saved to 03_tcr_outputs.rds\n")
Saved to 03_tcr_outputs.rds

Summary

Code
cat("=== Script 03 Summary ===\n\n")
=== Script 03 Summary ===
Code
cat("--- The Problem ---\n")
--- The Problem ---
Code
cat(sprintf("Ethan's TCR matrix: %d clonotypes, but %d (%.0f%%) appear at only 1 timepoint.\n",
            nrow(freq_mat), sum(n_timepoints == 1),
            100 * sum(n_timepoints == 1) / nrow(freq_mat)))
Ethan's TCR matrix: 2541 clonotypes, but 2443 (96%) appear at only 1 timepoint.
Code
cat("NMF can't learn temporal patterns from single-timepoint data.\n\n")
NMF can't learn temporal patterns from single-timepoint data.
Code
cat("--- Filtering ---\n")
--- Filtering ---
Code
cat(sprintf("Filter A (multi-timepoint): %d clonotypes\n", nrow(filter_a)))
Filter A (multi-timepoint): 98 clonotypes
Code
cat(sprintf("Filter B (count >= 3): %d clonotypes\n", nrow(filter_b)))
Filter B (count >= 3): 51 clonotypes
Code
cat(sprintf("Filter C (bio-relevant): %d clonotypes\n", nrow(filter_c)))
Filter C (bio-relevant): 416 clonotypes
Code
cat("\n--- Best Result ---\n")

--- Best Result ---
Code
best_tcr <- all_tcr_errors |>
  filter(rank == 3) |>
  slice_min(nmae, n = 1)
bl_tcr <- baseline_tcr |> filter(rank == 3)
pct_tcr <- 100 * (bl_tcr$nmae[1] - best_tcr$nmae[1]) / bl_tcr$nmae[1]

cat(sprintf("Baseline: NMAE=%.6f (%d rows)\n", bl_tcr$nmae[1], bl_tcr$n_rows[1]))
Baseline: NMAE=0.001527 (2541 rows)
Code
cat(sprintf("Best: %s, NMAE=%.6f (%d rows, %.1f%% improvement)\n",
            best_tcr$label[1], best_tcr$nmae[1], best_tcr$n_rows[1], pct_tcr))
Best: C: freq (bio), NMAE=0.001362 (416 rows, 10.8% improvement)
Code
cat("\n--- Next Steps ---\n")

--- Next Steps ---
Code
cat("1. Hand TCR CSVs to Ethan for MEGPath testing\n")
1. Hand TCR CSVs to Ethan for MEGPath testing
Code
cat("2. Script 04: Biological interpretation of NMF factors\n")
2. Script 04: Biological interpretation of NMF factors
Code
cat("3. Explore MEGPath constraint feature with known clonotype patterns\n")
3. Explore MEGPath constraint feature with known clonotype patterns