Step 1: Reproduce & Understand Ethan’s Preprocessing

Trace his testing.R exactly, verify outputs, identify problems

Goal

Before we improve anything, we need to:

Reproduce Ethan’s testing.R preprocessing step by step
Verify our reproduced output matches his CSVs (dense_data.csv, dense_dataLT.csv)
Document exactly what each step does to the data
Identify the specific preprocessing problems causing high NMF error
Compute baseline error metrics so improvements are quantifiable

Ethan’s NMF runs in MEGPath (C++ simulated annealing), not R. The error metric MEGPath uses is Least Absolute Deviation (sum of |V - WH|). We’ll compute both LAD and Frobenius norm for completeness.

Setup

Code

library(Seurat)
library(ggplot2)
library(dplyr)
library(Matrix)

source("theme_donq.R")

data_dir <- "NMF Research Data/Data (scRNA, TCR)"

Part 1: Load the Seurat Object

This is the same published object Ethan loads in testing.R line 1.

Code

sobj <- readRDS("NMF Research Data/Copy of Ito_MQ_scRNAseq_sct(1).rds")

cat("Cells:  ", ncol(sobj), "\n")

Cells:   4876

Code

cat("Genes:  ", nrow(sobj), "\n")

Genes:   17857

Code

cat("Assays: ", paste(Assays(sobj), collapse = ", "), "\n")

Assays:  RNA, SCT

Code

cat("Default assay:", DefaultAssay(sobj), "\n")

Default assay: RNA

Code

# What metadata columns exist?
cat("Metadata columns:\n")

Metadata columns:

Code

print(colnames(sobj@meta.data))

 [1] "orig.ident"                "nCount_RNA"               
 [3] "nFeature_RNA"              "nCount_HTO"               
 [5] "nFeature_HTO"              "HTO_maxID"                
 [7] "HTO_secondID"              "HTO_margin"               
 [9] "HTO_classification"        "HTO_classification.global"
[11] "hash.ID"                   "channel"                  
[13] "cdr3s_aa"                  "percent.mt"               
[15] "percent.ribo"              "Scrublet.Score"           
[17] "Predicted.Doublets"        "S.Score"                  
[19] "G2M.Score"                 "CC.Difference"            
[21] "Phase"                     "nCount_SCT"               
[23] "nFeature_SCT"              "SCT_snn_res.0.8"          
[25] "seurat_clusters"           "seurat_clusters2"

Code

# HTO hashtag assignments — this is what Ethan groups by
cat("\nHTO_maxID (timepoint assignments):\n")


HTO_maxID (timepoint assignments):

Code

print(table(sobj$HTO_maxID))


HashTag1 HashTag2 HashTag3 HashTag4 
    1861      223     2137      655

Code

# Cluster assignments
cat("\nCluster distribution:\n")


Cluster distribution:

Code

if ("seurat_clusters" %in% colnames(sobj@meta.data)) {
  print(table(sobj$seurat_clusters))
}


  0   1   2   3   4   5   6   7   8   9  10  11  12 
937 851 615 516 402 345 335 235 226 191  96  78  49

Code

# What's in the RNA assay vs SCT assay?
cat("=== RNA assay ===\n")

=== RNA assay ===

Code

DefaultAssay(sobj) <- "RNA"
cat("Layers:", paste(Layers(sobj), collapse = ", "), "\n")

Layers: counts, data, scale.data

Code

# Check if "data" layer exists (this is what Ethan uses)
rna_data_sample <- LayerData(sobj, layer = "data")[1:5, 1:5]
cat("\nRNA 'data' slot (first 5x5):\n")


RNA 'data' slot (first 5x5):

Code

print(as.matrix(rna_data_sample))

           MQpool1_RS-03742653_AAACCTGTCGCGTTTC-1
AL627309.1                                      0
AL627309.5                                      0
LINC01409                                       0
FAM87B                                          0
LINC01128                                       0
           MQpool1_RS-03742653_AAACCTGTCTTGCAAG-1
AL627309.1                                      0
AL627309.5                                      0
LINC01409                                       0
FAM87B                                          0
LINC01128                                       0
           MQpool1_RS-03742653_AAACGGGCAAACTGCT-1
AL627309.1                                      0
AL627309.5                                      0
LINC01409                                       0
FAM87B                                          0
LINC01128                                       0
           MQpool1_RS-03742653_AAACGGGCAATCTACG-1
AL627309.1                               0.000000
AL627309.5                               0.000000
LINC01409                                0.000000
FAM87B                                   0.000000
LINC01128                                1.084157
           MQpool1_RS-03742653_AAACGGGCAGTAAGAT-1
AL627309.1                                      0
AL627309.5                                      0
LINC01409                                       0
FAM87B                                          0
LINC01128                                       0

Code

cat("\n=== SCT assay ===\n")


=== SCT assay ===

Code

DefaultAssay(sobj) <- "SCT"
cat("Layers:", paste(Layers(sobj), collapse = ", "), "\n")

Layers: counts, data, scale.data

Code

sct_data_sample <- LayerData(sobj, layer = "data")[1:5, 1:5]
cat("\nSCT 'data' slot (first 5x5):\n")


SCT 'data' slot (first 5x5):

Code

print(as.matrix(sct_data_sample))

           MQpool1_RS-03742653_AAACCTGTCGCGTTTC-1
AL627309.5                                      0
LINC01409                                       0
LINC01128                                       0
LINC00115                                       0
FAM41C                                          0
           MQpool1_RS-03742653_AAACCTGTCTTGCAAG-1
AL627309.5                                      0
LINC01409                                       0
LINC01128                                       0
LINC00115                                       0
FAM41C                                          0
           MQpool1_RS-03742653_AAACGGGCAAACTGCT-1
AL627309.5                                      0
LINC01409                                       0
LINC01128                                       0
LINC00115                                       0
FAM41C                                          0
           MQpool1_RS-03742653_AAACGGGCAATCTACG-1
AL627309.5                              0.0000000
LINC01409                               0.0000000
LINC01128                               0.6931472
LINC00115                               0.0000000
FAM41C                                  0.0000000
           MQpool1_RS-03742653_AAACGGGCAGTAAGAT-1
AL627309.5                                      0
LINC01409                                       0
LINC01128                                       0
LINC00115                                       0
FAM41C                                          0

Part 2: Reproduce Ethan’s Pipeline Exactly

Following testing.R line by line.

Step 2a: AverageExpression on RNA “data” slot

Ethan: AverageExpression(published_data, group.by = "HTO_maxID", assays = "RNA", slot = "data", return.seurat = TRUE)

The "data" slot in Seurat’s RNA assay contains log-normalized values from the original NormalizeData() call. So he’s averaging already-log-normalized values.

Code

DefaultAssay(sobj) <- "RNA"

# Reproduce Ethan's AverageExpression call
dim_reduction <- AverageExpression(sobj, group.by = "HTO_maxID",
                                    assays = "RNA", slot = "data",
                                    return.seurat = TRUE)

cat("Result dimensions:", nrow(dim_reduction), "genes x", ncol(dim_reduction), "groups\n")

Result dimensions: 17857 genes x 4 groups

Code

cat("Column names (HTO groups):\n")

Column names (HTO groups):

Code

print(colnames(dim_reduction))

[1] "HashTag1" "HashTag2" "HashTag3" "HashTag4"

Step 2b: Double log-normalization

Ethan: NormalizeData(dim_reduction, normalization.method = "LogNormalize", scale.factor = 10000)

This applies log1p(x / colSum * 10000) to the already-averaged log-normalized values. This is the double log-normalization problem.

Code

dim_reduction <- NormalizeData(dim_reduction,
                                normalization.method = "LogNormalize",
                                scale.factor = 10000)

# Extract the final matrix (what becomes dense_dataLT.csv)
reproduced_LT <- as.matrix(LayerData(dim_reduction, assay = "RNA", layer = "data"))

cat("Reproduced LT matrix:", nrow(reproduced_LT), "x", ncol(reproduced_LT), "\n")

Reproduced LT matrix: 17857 x 4

Code

cat("Value range:", range(reproduced_LT), "\n")

Value range: 0 6.331372

Step 2c: Compare with Ethan’s actual CSV

Code

# Load Ethan's output
ethan_LT <- as.matrix(read.csv(file.path(data_dir, "dense_dataLT.csv"), header = FALSE))
colnames(ethan_LT) <- c("Pre", "Wk3", "Wk6", "Wk9")

cat("Ethan's LT:", nrow(ethan_LT), "x", ncol(ethan_LT), "\n")

Ethan's LT: 17857 x 4

Code

cat("Our reproduced:", nrow(reproduced_LT), "x", ncol(reproduced_LT), "\n")

Our reproduced: 17857 x 4

Code

# Check if dimensions match
if (nrow(ethan_LT) == nrow(reproduced_LT)) {
  # Compare values — need to match column order
  # Ethan's columns are Pre, Wk3, Wk6, Wk9 but we need to check
  # which HTO_maxID maps to which timepoint
  cat("\nColumn mapping (our column names):\n")
  print(colnames(reproduced_LT))

  # Try direct comparison of first few rows
  cat("\nOur first 3 rows:\n")
  print(reproduced_LT[1:3, ])
  cat("\nEthan's first 3 rows:\n")
  print(ethan_LT[1:3, ])
} else {
  cat("\nDimension mismatch — Ethan may have filtered rows\n")
  cat("Difference:", nrow(reproduced_LT) - nrow(ethan_LT), "rows\n")
}


Column mapping (our column names):
[1] "HashTag1" "HashTag2" "HashTag3" "HashTag4"

Our first 3 rows:
              HashTag1   HashTag2    HashTag3  HashTag4
AL627309.1 0.000000000 0.00000000 0.001019636 0.0000000
AL627309.5 0.006586141 0.00000000 0.003520073 0.0133544
LINC01409  0.089719810 0.05835255 0.119702096 0.1186154

Ethan's first 3 rows:
             Pre        Wk3         Wk6       Wk9
[1,] 0.000000000 0.00000000 0.001019636 0.0000000
[2,] 0.006586141 0.00000000 0.003520073 0.0133544
[3,] 0.089719810 0.05835255 0.119702096 0.1186154

Code

# Figure out the HTO → timepoint mapping
# From the paper: C0251 = Pre, C0252 = 3wk, C0253 = 6wk, C0254 = 9wk
# But HTO_maxID values may be labeled differently

# Strategy: compare column values to find the mapping
# Each column in Ethan's CSV should match one column in ours

if (nrow(ethan_LT) == nrow(reproduced_LT)) {
  cat("Correlation matrix (our cols vs Ethan's cols):\n")
  cor_mat <- cor(reproduced_LT, ethan_LT)
  print(round(cor_mat, 4))

  # The diagonal or near-diagonal of high correlations reveals the mapping
  cat("\nBest match per Ethan column:\n")
  for (j in 1:ncol(ethan_LT)) {
    best <- which.max(cor_mat[, j])
    cat(sprintf("  Ethan col %d (%s) ← Our col '%s' (r = %.4f)\n",
                j, colnames(ethan_LT)[j], rownames(cor_mat)[best], cor_mat[best, j]))
  }

  # Check if values match exactly after reordering
  # Find best column order
  col_order <- apply(cor_mat, 2, which.max)
  reproduced_reordered <- reproduced_LT[, col_order]

  max_diff <- max(abs(reproduced_reordered - ethan_LT))
  mean_diff <- mean(abs(reproduced_reordered - ethan_LT))
  cat(sprintf("\nAfter reordering — max difference: %.6e, mean difference: %.6e\n",
              max_diff, mean_diff))

  if (max_diff < 1e-6) {
    cat("MATCH: Our reproduction matches Ethan's CSV exactly.\n")
  } else {
    cat("MISMATCH: Values differ. Investigating...\n")
  }
}

Correlation matrix (our cols vs Ethan's cols):
            Pre    Wk3    Wk6    Wk9
HashTag1 1.0000 0.9712 0.9742 0.9743
HashTag2 0.9712 1.0000 0.9882 0.9863
HashTag3 0.9742 0.9882 1.0000 0.9953
HashTag4 0.9743 0.9863 0.9953 1.0000

Best match per Ethan column:
  Ethan col 1 (Pre) ← Our col 'HashTag1' (r = 1.0000)
  Ethan col 2 (Wk3) ← Our col 'HashTag2' (r = 1.0000)
  Ethan col 3 (Wk6) ← Our col 'HashTag3' (r = 1.0000)
  Ethan col 4 (Wk9) ← Our col 'HashTag4' (r = 1.0000)

After reordering — max difference: 5.329071e-15, mean difference: 2.040335e-16
MATCH: Our reproduction matches Ethan's CSV exactly.

Part 3: Examine the dense_data.csv (Raw/Min-Max Version)

Code

ethan_raw <- as.matrix(read.csv(file.path(data_dir, "dense_data.csv"), header = FALSE))
colnames(ethan_raw) <- c("Pre", "Wk3", "Wk6", "Wk9")

cat("dense_data.csv:", nrow(ethan_raw), "x", ncol(ethan_raw), "\n")

dense_data.csv: 17857 x 4

Code

cat("Value range:", range(ethan_raw), "\n")

Value range: 0 560.9268

Code

# Is this min-max normalized? Check if max = 1 per column
cat("\nColumn maxima:", apply(ethan_raw, 2, max), "\n")


Column maxima: 560.9268 351.9221 333.4951 381.8898

Code

cat("Column minima:", apply(ethan_raw, 2, min), "\n")

Column minima: 0 0 0 0

Code

# Or is it globally min-max normalized?
cat("Global max:", max(ethan_raw), "\n")

Global max: 560.9268

Code

# Check relationship to the LT version
cat("\nIs raw = min-max(LT)?\n")


Is raw = min-max(LT)?

Code

lt_minmax <- apply(ethan_LT, 2, function(x) (x - min(x)) / (max(x) - min(x)))
diff_check <- max(abs(lt_minmax - ethan_raw))
cat("Max difference between min-max(LT) and raw:", diff_check, "\n")

Max difference between min-max(LT) and raw: 559.9268

Code

if (diff_check < 1e-6) {
  cat("YES: dense_data.csv is column-wise min-max normalized dense_dataLT.csv\n")
} else {
  # Try global min-max
  lt_global_mm <- (ethan_LT - min(ethan_LT)) / (max(ethan_LT) - min(ethan_LT))
  diff_check2 <- max(abs(lt_global_mm - ethan_raw))
  cat("Global min-max diff:", diff_check2, "\n")

  # Try: maybe raw is from before the second NormalizeData call?
  cat("\nInvestigating: raw might be from a different normalization path\n")
}

Global min-max diff: 559.9268 

Investigating: raw might be from a different normalization path

Step 3b: Investigate dense_data.csv origin

The range 0–560 doesn’t match any min-max of the LT version. Let’s test whether dense_data.csv comes from averaging raw counts (slot = "counts") instead of the log-normalized slot = "data".

Code

# Hypothesis 1: dense_data.csv = AverageExpression on RNA "counts" slot
avg_counts <- AverageExpression(sobj, group.by = "HTO_maxID",
                                 assays = "RNA", slot = "counts")$RNA
avg_counts <- as.matrix(avg_counts)

cat("Averaged raw counts: range", range(avg_counts), "\n")

Averaged raw counts: range 0 121.5023

Code

cat("Ethan raw: range", range(ethan_raw), "\n")

Ethan raw: range 0 560.9268

Code

# Reorder columns to match Ethan (HashTag1=Pre, ..., HashTag4=Wk9)
avg_counts_ordered <- avg_counts[, c("HashTag1", "HashTag2", "HashTag3", "HashTag4")]

# Direct comparison
cat("\nFirst 3 rows of averaged counts:\n")


First 3 rows of averaged counts:

Code

print(avg_counts_ordered[1:3, ])

              HashTag1   HashTag2     HashTag3    HashTag4
AL627309.1 0.000000000 0.00000000 0.0004679457 0.000000000
AL627309.5 0.001612037 0.00000000 0.0009358914 0.003053435
LINC01409  0.021493821 0.03139013 0.0416471689 0.033587786

Code

cat("\nFirst 3 rows of Ethan's raw:\n")


First 3 rows of Ethan's raw:

Code

print(ethan_raw[1:3, ])

             Pre        Wk3         Wk6        Wk9
[1,] 0.000000000 0.00000000 0.001020156 0.00000000
[2,] 0.006607878 0.00000000 0.003526276 0.01344397
[3,] 0.093867750 0.06008867 0.127161016 0.12593675

Code

if (nrow(avg_counts_ordered) == nrow(ethan_raw)) {
  max_diff_counts <- max(abs(avg_counts_ordered - ethan_raw))
  cat(sprintf("\nMax diff (avg counts vs ethan raw): %.10e\n", max_diff_counts))

  if (max_diff_counts < 1e-6) {
    cat("MATCH: dense_data.csv = AverageExpression(slot='counts')\n")
    cat("This means dense_data.csv is average raw UMI counts per timepoint.\n")
    cat("NOT double-log-normalized, NOT min-max scaled.\n")
  } else {
    cat("No match. Trying other hypotheses...\n")

    # Hypothesis 2: averaged "data" slot BEFORE second NormalizeData
    avg_data <- AverageExpression(sobj, group.by = "HTO_maxID",
                                   assays = "RNA", slot = "data")$RNA
    avg_data <- as.matrix(avg_data)
    avg_data_ordered <- avg_data[, c("HashTag1", "HashTag2", "HashTag3", "HashTag4")]

    max_diff_data <- max(abs(avg_data_ordered - ethan_raw))
    cat(sprintf("Max diff (avg data slot vs ethan raw): %.10e\n", max_diff_data))
    cat("Avg data slot range:", range(avg_data_ordered), "\n")

    if (max_diff_data < 1e-6) {
      cat("MATCH: dense_data.csv = AverageExpression(slot='data') before 2nd normalize\n")
    } else {
      # Hypothesis 3: check correlation to identify transformation
      cor_counts <- cor(as.vector(avg_counts_ordered), as.vector(ethan_raw))
      cor_data <- cor(as.vector(avg_data_ordered), as.vector(ethan_raw))
      cat(sprintf("\nCorrelation with avg counts: %.6f\n", cor_counts))
      cat(sprintf("Correlation with avg data:   %.6f\n", cor_data))
      cat("Higher correlation indicates the source.\n")
    }
  }
}


Max diff (avg counts vs ethan raw): 4.6200794642e+02
No match. Trying other hypotheses...
Max diff (avg data slot vs ethan raw): 4.9737991503e-13
Avg data slot range: 0 560.9268 
MATCH: dense_data.csv = AverageExpression(slot='data') before 2nd normalize

Code

# Summary: which CSV does Ethan actually feed to MEGPath?
cat("=== MEGPath Input Analysis ===\n\n")

=== MEGPath Input Analysis ===

Code

cat("dense_data.csv:\n")

dense_data.csv:

Code

cat("  Range: 0 to", max(ethan_raw), "\n")

  Range: 0 to 560.9268

Code

cat("  NOT in [0,1] — needs min-max before MEGPath can use it\n")

  NOT in [0,1] — needs min-max before MEGPath can use it

Code

cat("  MEGPath must either handle this internally or Ethan scales it elsewhere\n\n")

  MEGPath must either handle this internally or Ethan scales it elsewhere

Code

cat("dense_dataLT.csv:\n")

dense_dataLT.csv:

Code

cat("  Range: 0 to", max(ethan_LT), "\n")

  Range: 0 to 6.331372

Code

cat("  Also not in [0,1] — same issue\n\n")

  Also not in [0,1] — same issue

Code

cat("MEGPath arguments.txt shows: max_runs=100000, 4 columns, 3 patterns\n")

MEGPath arguments.txt shows: max_runs=100000, 4 columns, 3 patterns

Code

cat("The normalization to [0,1] likely happens inside MEGPath C++ code\n")

The normalization to [0,1] likely happens inside MEGPath C++ code

Code

cat("(the Dyer & Dymacek paper states: 'normalized to the [0-1] domain')\n")

(the Dyer & Dymacek paper states: 'normalized to the [0-1] domain')

Part 4: Document the Data Quality Issues

Code

# What fraction of each matrix is zeros or near-zeros?
cat("=== Sparsity ===\n")

=== Sparsity ===

Code

cat("dense_data.csv exact zeros:", mean(ethan_raw == 0), "\n")

dense_data.csv exact zeros: 0.124223

Code

cat("dense_dataLT.csv exact zeros:", mean(ethan_LT == 0), "\n")

dense_dataLT.csv exact zeros: 0.124223

Code

cat("Values < 0.001 in raw:", mean(ethan_raw < 0.001), "\n")

Values < 0.001 in raw: 0.1351431

Code

cat("Values < 0.01 in raw:", mean(ethan_raw < 0.01), "\n")

Values < 0.01 in raw: 0.2762922

Code

# Distribution of row sums (total expression per gene)
row_sums <- rowSums(ethan_raw)
cat("\nRow sum distribution:\n")


Row sum distribution:

Code

print(summary(row_sums))

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 3.391e-02 3.301e-01 2.240e+00 1.265e+00 1.628e+03

Code

cat("Rows with sum < 0.01:", sum(row_sums < 0.01), "\n")

Rows with sum < 0.01: 2145

Code

cat("Rows with sum < 0.001:", sum(row_sums < 0.001), "\n")

Rows with sum < 0.001: 190

Code

# Histogram of values
val_df <- data.frame(value = as.vector(ethan_raw))
p_hist <- ggplot(val_df, aes(x = value)) +
  geom_histogram(bins = 100, fill = donq_colors$gold, color = NA, alpha = 0.8) +
  theme_donq() +
  labs(
    title = "Distribution of Values in dense_data.csv",
    subtitle = "Most values are near zero — NMF wastes capacity fitting noise",
    x = "Expression Value",
    y = "Count",
    caption = "Ethan's scRNA matrix | 17,857 genes x 4 timepoints"
  ) +
  scale_y_log10()

p_hist

Code

save_donq(p_hist, "01_value_distribution.png")

Code

# Which genes actually vary across timepoints?
row_vars <- apply(ethan_raw, 1, var)

cat("Row variance distribution:\n")

Row variance distribution:

Code

print(summary(row_vars))

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 0.000e+00 6.000e-04 1.025e+00 4.200e-03 1.092e+04

Code

cat("Genes with variance < 1e-6:", sum(row_vars < 1e-6), "\n")

Genes with variance < 1e-6: 542

Code

cat("Genes with variance < 1e-4:", sum(row_vars < 1e-4), "\n")

Genes with variance < 1e-4: 5833

Code

cat("Genes with variance > 0.001:", sum(row_vars > 0.001), "\n")

Genes with variance > 0.001: 7828

Code

p_var <- ggplot(data.frame(var = row_vars[row_vars > 0]), aes(x = var)) +
  geom_histogram(bins = 100, fill = donq_colors$blue, color = NA, alpha = 0.8) +
  theme_donq() +
  scale_x_log10() +
  labs(
    title = "Gene Variance Across Timepoints",
    subtitle = "Only genes with non-zero variance shown (log scale)",
    x = "Variance (log10)",
    y = "Count",
    caption = "Higher variance = more informative for NMF"
  )

p_var

Code

save_donq(p_var, "01_gene_variance.png")

Part 5: Examine TCR Data

Code

ethan_tcr <- as.matrix(read.csv(file.path(data_dir, "tcr_expansion.csv"), header = FALSE))
colnames(ethan_tcr) <- c("Pre", "Wk3", "Wk6", "Wk9")

cat("tcr_expansion.csv:", nrow(ethan_tcr), "x", ncol(ethan_tcr), "\n")

tcr_expansion.csv: 2541 x 4

Code

cat("Value range:", range(ethan_tcr), "\n")

Value range: 0 12.04819

Code

cat("Column sums:", colSums(ethan_tcr), "\n")

Column sums: 100 100 100 100

Code

# Verify this matches Table S8 frequencies
# First row should be CASSSLGSFDEQFF: 1.446, 12.048, 5.209, 6.512
cat("\nFirst row:", ethan_tcr[1, ], "\n")


First row: 1.446051 12.04819 5.209003 6.511628

Code

cat("Expected (CASSSLGSFDEQFF freqs): 1.446, 12.048, 5.209, 6.512\n")

Expected (CASSSLGSFDEQFF freqs): 1.446, 12.048, 5.209, 6.512

Code

# Sparsity
cat("\nTCR exact zeros:", mean(ethan_tcr == 0), "\n")


TCR exact zeros: 0.7379969

Code

cat("TCR rows with all zeros:", sum(rowSums(ethan_tcr) == 0), "\n")

TCR rows with all zeros: 0

Code

cat("TCR rows present in only 1 timepoint:", sum(rowSums(ethan_tcr > 0) == 1), "\n")

TCR rows present in only 1 timepoint: 2443

Code

cat("TCR rows present in >= 2 timepoints:", sum(rowSums(ethan_tcr > 0) >= 2), "\n")

TCR rows present in >= 2 timepoints: 98

Part 6: Compute Baseline Error Metrics

Since MEGPath uses Least Absolute Deviation (LAD), we compute that alongside Frobenius norm. We can’t run MEGPath here, but we can measure how well a simple rank-k approximation fits using R — this gives a lower bound on error since R’s multiplicative updates converge better per-iteration than simulated annealing.

Code

if (!requireNamespace("NMF", quietly = TRUE)) install.packages("NMF")
library(NMF)

compute_errors <- function(mat, k, label, nrun = 10) {
  # Remove all-zero rows
  mat_clean <- mat[rowSums(mat) > 0, , drop = FALSE]

  # Run NMF (Lee algorithm minimizes Frobenius directly)
  fit <- nmf(mat_clean, rank = k, method = "lee", seed = "random",
             nrun = nrun, .opt = "vp2")

  reconstructed <- fitted(fit)
  residual <- mat_clean - reconstructed

  # Frobenius norm (sqrt of sum of squared errors)
  frob <- sqrt(sum(residual^2))

  # LAD — Least Absolute Deviation (what MEGPath uses)
  lad <- sum(abs(residual))

  # Per-element metrics for interpretability
  n_elements <- length(residual)

  cat(sprintf("[%s] k=%d: Frobenius=%.4f, LAD=%.4f, MAE=%.6f, Rows=%d\n",
              label, k, frob, lad, lad / n_elements, nrow(mat_clean)))

  data.frame(
    label = label, rank = k,
    frobenius = frob, lad = lad,
    mae = lad / n_elements,
    n_rows = nrow(mat_clean),
    n_elements = n_elements,
    stringsAsFactors = FALSE
  )
}

Code

baseline_results <- bind_rows(
  # scRNA
  compute_errors(ethan_raw, k = 2, "scRNA raw"),
  compute_errors(ethan_raw, k = 3, "scRNA raw"),
  compute_errors(ethan_LT, k = 2, "scRNA log"),
  compute_errors(ethan_LT, k = 3, "scRNA log"),
  # TCR
  compute_errors(ethan_tcr, k = 2, "TCR freq"),
  compute_errors(ethan_tcr, k = 3, "TCR freq")
)


Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   2.94    0.02    6.08 
[scRNA raw] k=2: Frobenius=44.5798, LAD=2407.4945, MAE=0.033891, Rows=17759


Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   2.64    0.05    8.17 
[scRNA raw] k=3: Frobenius=19.4095, LAD=1147.8237, MAE=0.016158, Rows=17759


Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   2.69    0.02    6.03 
[scRNA log] k=2: Frobenius=7.6842, LAD=1072.3840, MAE=0.015096, Rows=17759


Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   2.62    0.02    8.85 
[scRNA log] k=3: Frobenius=3.8368, LAD=504.0271, MAE=0.007095, Rows=17759


Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   2.69    0.03    4.59 
[TCR freq] k=2: Frobenius=4.8565, LAD=298.3988, MAE=0.029358, Rows=2541


Runs: |                                                        
Runs: |                                                  |   0%
Runs: |                                                        
Runs: |==================================================| 100%
System time:
   user  system elapsed 
   2.57    0.08    4.71 
[TCR freq] k=3: Frobenius=3.1848, LAD=187.0445, MAE=0.018403, Rows=2541

Code

cat("\n=== Baseline Error Summary ===\n")


=== Baseline Error Summary ===

Code

print(baseline_results)

      label rank frobenius       lad         mae n_rows n_elements
1 scRNA raw    2 44.579827 2407.4945 0.033891189  17759      71036
2 scRNA raw    3 19.409538 1147.8237 0.016158338  17759      71036
3 scRNA log    2  7.684250 1072.3840 0.015096346  17759      71036
4 scRNA log    3  3.836781  504.0271 0.007095375  17759      71036
5  TCR freq    2  4.856511  298.3988 0.029358401   2541      10164
6  TCR freq    3  3.184796  187.0445 0.018402646   2541      10164

Code

p_baseline <- ggplot(baseline_results,
                     aes(x = factor(rank), y = lad, fill = label)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_donq() +
  theme_donq() +
  labs(
    title = "Baseline Reconstruction Error (LAD)",
    subtitle = "R NMF on Ethan's matrices — lower bound for MEGPath error",
    x = "Rank (k)",
    y = "Least Absolute Deviation (sum |V - WH|)",
    fill = "Matrix",
    caption = "Lee algorithm, 10 random starts | eigen.io"
  )

p_baseline

Code

save_donq(p_baseline, "01_baseline_lad.png")

Part 7: Save Everything for Later Steps

Code

saveRDS(list(
  ethan_raw = ethan_raw,
  ethan_LT = ethan_LT,
  ethan_tcr = ethan_tcr,
  reproduced_LT = reproduced_LT,
  baseline_results = baseline_results,
  avg_counts = if (exists("avg_counts_ordered")) avg_counts_ordered else NULL,
  avg_data = if (exists("avg_data_ordered")) avg_data_ordered else NULL
), "01_explore_outputs.rds")

cat("Saved to 01_explore_outputs.rds\n")

Saved to 01_explore_outputs.rds

Summary of Findings

After reproducing Ethan’s pipeline, here are the documented issues:

Preprocessing Problems

Issue	What Ethan Did	Impact
Double log-normalization	`AverageExpression(slot="data")` then `NormalizeData("LogNormalize")`	Compresses dynamic range; distorts relative expression
Wrong assay	Used RNA assay	SCT assay has variance-stabilized values (that’s why the object is `_sct.rds`)
Averaging in log space	Averages log-normalized per-cell values	Gets geometric mean, not arithmetic mean; underestimates expression for sparse genes
No feature selection	All 17,857+ genes	Thousands of near-zero noise rows inflate total error
No gene labels	Exported without row names	Cannot interpret which genes NMF assigns to which factor

What Stays the Same

The 4-column pseudobulk structure (Pre, Wk3, Wk6, Wk9)
MEGPath handles [0,1] normalization internally (the paper states this)
TCR frequencies from Table S8 (these appear correctly extracted)
dense_data.csv origin is now identified (see Step 3b)

Bottom Line

dense_data.csv = AverageExpression(slot = "data") on the RNA assay — averaged log-normalized values, range 0–561. This is the single-log version.
dense_dataLT.csv = the above run through NormalizeData("LogNormalize") a second time — double-log, range 0–6.33.
tcr_expansion.csv = frequency columns from Table S8 — percentages summing to 100 per timepoint. 96% of clonotypes appear in only one timepoint.
MEGPath normalizes to [0,1] internally, so these CSVs are fed in as-is.

Next Step

02_improved_scrna_matrix.qmd will fix each preprocessing issue one at a time, exporting corrected CSVs for MEGPath. We’ll track how each fix changes the R NMF error as a proxy before running MEGPath.