Goal: Compare multiple ML approaches on a single tabular prediction task — predicting whether a 7-card opening hand in MTG Limited will lead to a win — and present results in an interactive Streamlit UI.
Data: Card-level and game-level statistics from 17lands.com, the largest community dataset for MTG Arena draft data.
Code: github.com/donq-t/mtg-hand-evaluator
The Problem
In Magic: The Gathering Limited (Draft), you start each game by drawing 7 cards. The quality of this opening hand — land count, card quality, mana curve, color consistency — has a measurable effect on win rate. Can we quantify that effect and predict outcomes from the hand alone?
The input is a 7-card hand. The output is $P(\text{win})$.
Feature Engineering
Each hand is transformed into a 16-dimensional feature vector:
| Feature | Description |
|---|---|
num_lands | Count of lands in hand (0–7) |
num_creatures, num_spells | Card type breakdown |
avg_gih_wr, max_gih_wr, min_gih_wr | Card quality stats (Games In Hand Win Rate from 17lands) |
avg_cmc | Average mana cost of non-land cards |
curve_1 through curve_5plus | Mana curve distribution |
num_colors | Color diversity |
color_consistency | Fraction of spells castable by lands in hand |
on_play | Play/draw indicator |
The key insight: 17lands publishes per-card win rates (GIH WR), so each card in a hand has a known empirical strength. Aggregating these into hand-level statistics gives the models strong signal.
Models
Five models, ordered by complexity:
1. Logistic Regression (baseline)
$$P(\text{win} \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)$$
A linear model — the weighted sum of features passed through a sigmoid. Fast and interpretable. The coefficients directly tell you which features matter and in which direction.
2. Random Forest
An ensemble of 200 decision trees that vote on the outcome. Handles non-linear relationships and feature interactions naturally. Provides feature importance via impurity reduction.
3. XGBoost (Gradient Boosting)
Builds trees sequentially, where each new tree corrects the residual errors of the ensemble so far. State of the art for tabular data. Typically the strongest model for structured prediction tasks.
4. Neural Network (MLP)
A 3-layer multi-layer perceptron ($128 \to 64 \to 32$) with ReLU activations, batch normalization, and dropout. Trained with Adam and early stopping. The deep learning entry point — learns non-linear feature representations through backpropagation.
5. Card Embedding Model (stretch goal)
Instead of hand-crafted features, this model learns a dense vector representation (embedding) for each card — similar to word embeddings in NLP. The hand representation is the mean of its 7 card embeddings, passed through a prediction head.
$$\mathbf{h} = \frac{1}{7}\sum_{i=1}^{7} \text{Embed}(c_i), \quad P(\text{win}) = \sigma(f(\mathbf{h}))$$
This can capture card synergies that aggregate statistics miss.
Results
Trained on 50,000 synthetic games (MKM — Murders at Karlov Manor), split 70/15/15:
| Model | AUC | Accuracy | MAE | RMSE | Log Loss |
|---|---|---|---|---|---|
| Logistic Regression | 0.9844 | 0.9908 | 0.0137 | 0.0827 | 0.0243 |
| Random Forest | 0.9891 | 0.9909 | 0.0129 | 0.0785 | 0.0215 |
| XGBoost | 0.9890 | 0.9908 | 0.0113 | 0.0813 | 0.0218 |
| Neural Network | 0.9871 | 0.9908 | 0.0126 | 0.0795 | 0.0219 |
| Card Embeddings | 0.9933 | 0.9903 | 0.0101 | 0.0757 | 0.0170 |
All models comfortably beat the random baseline (0.50 AUC). The Card Embedding model achieved the highest AUC, suggesting that learned card representations capture signal beyond hand-crafted features. Among the feature-engineered models, XGBoost had the lowest MAE.
Key Findings
Card quality dominates —
avg_gih_wr(average card win rate) is the most important feature across all models. Drafting good cards matters more than anything else.Land count sweet spot: 2–3 lands — Hands with 2–3 lands win most often. 0–1 lands (mana screw) and 5+ lands (mana flood) significantly hurt win rate.
Mana curve matters — Having plays at 2 and 3 mana is important. Hands full of expensive cards struggle even with enough lands.
Non-linearity helps — Random Forest and XGBoost outperform Logistic Regression, confirming that feature interactions (e.g., land count $\times$ curve shape) carry real signal.
Embeddings capture more — The card embedding model outperforms all feature-engineered models, suggesting card-level structure (synergies, archetypes) that aggregate statistics miss.
Interactive Dashboard
The Streamlit app includes four pages:
- Data Explorer — Dataset stats, card ratings table with filtering, win rate distributions
- Model Comparison — Side-by-side metrics, bar charts, radar plots, training curves
- Hand Evaluator — Pick 7 cards, see each model’s prediction with card art from Scryfall
- Insights — Feature importance heatmaps, partial dependence plots, correlation matrix
Run locally:
git clone https://github.com/donq-t/mtg-hand-evaluator.git
cd mtg-hand-evaluator
pip install -r requirements.txt
python src/train.py
streamlit run app/Home.py
Stack
Python, pandas, scikit-learn, XGBoost, PyTorch, Streamlit, Plotly, 17lands API, Scryfall API