Mechanistic interpretability pipeline for an MLP + Sparse Autoencoder (SAE) model. Ranks SAE latent features by activation impact, then performs ablation and activation steering across all output groups to understand how individual features influence predictions.
Input (10)
└─ Linear(10 → 256) + BatchNorm + ReLU
└─ Linear(256 → 512) + BatchNorm + ReLU
└─ Linear(512 → 256) ← hook point: hidden2
└─ ReLU
└─ Linear(256 → 1) output
A Sparse Autoencoder is attached to the hidden2 activation (256-dim):
hidden2 (256) → Encoder → latents (2048, Top-K=128 sparsity)
└─ Decoder → reconstructed (256)
The decoder direction trick avoids a full matrix multiply per feature during throttle testing:
contrib_i = latents[:, i] * W_dec[:, i] # (N, 256)
ablated = decoded - contrib_i # zero out feature i
steered_α = decoded + (α - 1) * contrib_i # scale feature i by α
output = ReLU(modified) @ W_out.T + b_out
Integer-valued regression: 10-integer input lists → output value in {-10…-1, +1…+10}.
The dataset is pre-split across 20 output groups (no zero class).
| File | Purpose |
|---|---|
dataset/mlp_train.xlsx |
MLP training set |
dataset/mlp_val.xlsx |
MLP validation set |
dataset/mlp_test.xlsx |
Primary test set used by this pipeline |
dataset/extrap_test.xlsx |
Out-of-distribution: extrapolation |
dataset/interp_test.xlsx |
Out-of-distribution: interpolation |
dataset/precision_test.xlsx |
Precision stress test |
dataset/scaling_test.xlsx |
Scaling stress test |
Group sample counts from a typical run (1000 total samples):
Group -10: n=16 | Group -9: n=31 | Group -8: n=54 | Group -7: n=67
Group -6: n=82 | Group -5: n=45 | Group -4: n=36 | Group -3: n=49
Group -2: n=64 | Group -1: n=56 | Group +1: n=59 | Group +2: n=61
Group +3: n=52 | Group +4: n=48 | Group +5: n=30 | Group +6: n=97
Group +7: n=66 | Group +8: n=45 | Group +9: n=24 | Group +10: n=18
Conda environment (act-abl):
Set up a Conda environment with Python 3.11.6:
git clone https://github.com/Palani-SN/Feature-Ranking-et-Throttle-Testing.git
cd Feature-Ranking-et-Throttle-Testing
conda create -n act-abl python=3.11.6
conda activate act-abl
python -m pip install torch==2.10.0 torchvision==0.25.0 --extra-index-url https://download.pytorch.org/whl/cu126
python -m pip install -r reqs.txt
Key packages from reqs.txt: numpy, pandas, openpyxl,
matplotlib, plotly, scikit-learn, tqdm (and their
dependencies).
Torch and torchvision are installed separately as shown above to pull the CUDA 12.6 build.
CUDA 12.6 required for GPU acceleration (falls back to CPU automatically).
Pre-trained weights must be present:
mlp/perfect_mlp.pthsae/universal_sae.pthconda activate act-abl
cd Feature-Ranking-et-Throttle-Testing
workflow.bat
The batch file activates the environment, runs all pipeline scripts in sequence, and prints a timing summary.
Typical runtime: ~14 minutes (CUDA, 1000 samples, 2048 features).
[1/4] conda activate act-abl
[2/4] python feature_ranking.py → results/feature_ranks.pt
[3/4] python throttle_testing.py → results/throttle_results.pt
[4/4] python generate_reports.py → results/reports.html
python network_graph.py → results/index.html
Once complete, open results/index.html in any browser.
feature_ranking.pyGroups the test dataset by rounded output value and computes per-group MLP predictions through the SAE. Ranks all 2048 SAE features by mean absolute activation, least impactful first.
Output — results/feature_ranks.pt:
baseline_stats : {group → {mu, sigma, predictions, n}}
feature_ranks : {group → [2048 feature indices, ascending impact]}
global_rank : [2048 feature indices, ascending impact, averaged across groups]
groups : [-10, -9, ..., -1, +1, ..., +10]
group_mean_acts : {group → [2048 mean absolute activations]}
dataset_path : path used
throttle_testing.pyIterates all 2048 features in global rank order. For each feature and each output group, applies two interventions using the decoder-direction trick:
Progress is printed every 256 features.
Output — results/throttle_results.pt:
results : {feature_idx → {group_val → {ablation: [...], steering: {α: [...]}}}}
global_rank : [2048 feature indices]
groups : list of group values
multipliers : [-4, -2, -1, 0, 1, 2, 4]
baseline_stats: {group → {mu, sigma, predictions, n}}
generate_reports.pyReads both .pt result files and produces results/reports.html — a
self-contained
single-file viewer. All data is embedded as a JavaScript constant (const REPORTS_DATA = {...});
no server or fetch required. Open directly from disk. At 2048 features the file is ~11.6 MB.
The viewer displays ridgeline plots of prediction distributions and intervention deltas. URL parameters:
reports.html?rank=N — open feature at local rank N (1 = most impactful stored)reports.html?feature=N — open feature by SAE feature indexnetwork_graph.pyGenerates results/index.html — an interactive D3.js bipartite force-directed graph.
Config (top of file):
| Variable | Default | Meaning |
|---|---|---|
TOP_K |
254 | Top features per group included in graph |
MIN_GROUPS |
2 | Minimum groups a feature must appear in to be shown |
Node encoding:
Edge encoding: colour inherits from the connected group; thickness ∝ normalised mean activation.
Interactions:
reports.html?feature=N in a new tab.A typical run produces ~288 visible feature nodes and ~5073 edges across 20 groups.
verify_feature.py — Manual Verification ToolStandalone script for inspecting a single feature without running the full pipeline. Recomputes ablation and steering from scratch and prints ASCII bar-chart output.
python verify_feature.py <feature_idx>
# Example — most impactful feature in a sample run:
python verify_feature.py 1347
Ablation section — one row per group, shift bar relative to maximum shift seen:
Group -10 (n= 16) : -9.799 (baseline) [ |███> ] -9.601 (ablated) (+0.198)
Group +7 (n= 66) : +7.039 (baseline) [<████████| ] +6.574 (ablated) (-0.465)
> = prediction rose when feature removed; < = prediction fell; █
blocks = relative magnitude.
Activation steering section — per group, ● tracks mean prediction on a shared
number-line; | marks 0:
Group -10 (n= 16) | baseline mu = -9.799 sigma = 0.1399
---------------------------------------------------------------------------------
Original (α= 1) : -9.799 [ ● | ]
Ablation (α= 0) : -9.601 [ ● | ] (Shift: +0.198 >)
Steer (α=-4) : -6.330 [ ● | ] (Shift: +3.468 >)
Steer (α=+4) : -10.039 [ ● | ] (Shift: -0.240 <)
Features with zero shift across all groups are inactive (sparse SAE did not fire them for any sample in those groups).
Feature #1347 — Global / Universal (global rank 2048 / 2048, most impactful)
Ablation shifts every one of the 20 groups, and the direction flips cleanly at the sign boundary — negative-output groups move toward zero, positive-output groups also move toward zero. When the feature is active it amplifies prediction magnitude away from zero across the whole task. Steering confirms this: at α=−4, group −10 jumps from −9.8 to −6.3 (+3.5 shift); every group responds and the response scales with α.
(act-abl) C:\Workspace\Git_Repos\MLP-Experiments\Feature-Ranking-et-Throttle-Testing>python verify_feature.py 1347
[verify_feature] Feature #1347 | Device: cuda
Global rank : 2048 / 2048 (1 = least impactful, 2048 = most)
Groups : [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
=====================================================================================
ABLATION — Feature #1347 (feature activation zeroed out)
=====================================================================================
Group -10 (n= 16) : -9.799 (baseline) [ |███> ] -9.601 (ablated) (+0.198)
Group -9 (n= 31) : -8.849 (baseline) [ |███> ] -8.690 (ablated) (+0.159)
Group -8 (n= 54) : -7.940 (baseline) [ |█> ] -7.899 (ablated) (+0.041)
Group -7 (n= 67) : -6.935 (baseline) [ |███> ] -6.743 (ablated) (+0.192)
Group -6 (n= 82) : -5.968 (baseline) [ |██> ] -5.859 (ablated) (+0.110)
Group -5 (n= 45) : -5.030 (baseline) [ |████> ] -4.819 (ablated) (+0.211)
Group -4 (n= 36) : -3.972 (baseline) [ <█| ] -4.022 (ablated) (-0.051)
Group -3 (n= 49) : -3.019 (baseline) [ <████| ] -3.235 (ablated) (-0.216)
Group -2 (n= 64) : -1.963 (baseline) [ |> ] -1.953 (ablated) (+0.010)
Group -1 (n= 56) : -0.879 (baseline) [ |█> ] -0.805 (ablated) (+0.074)
Group +1 (n= 59) : +1.033 (baseline) [ |> ] +1.055 (ablated) (+0.023)
Group +2 (n= 61) : +2.020 (baseline) [ <███| ] +1.862 (ablated) (-0.158)
Group +3 (n= 52) : +3.091 (baseline) [ <█████| ] +2.792 (ablated) (-0.299)
Group +4 (n= 48) : +4.097 (baseline) [ <██████| ] +3.742 (ablated) (-0.355)
Group +5 (n= 30) : +5.005 (baseline) [ |> ] +5.025 (ablated) (+0.021)
Group +6 (n= 97) : +6.065 (baseline) [<████████| ] +5.603 (ablated) (-0.462)
Group +7 (n= 66) : +7.039 (baseline) [<████████| ] +6.574 (ablated) (-0.465)
Group +8 (n= 45) : +8.038 (baseline) [ <█████| ] +7.770 (ablated) (-0.268)
Group +9 (n= 24) : +8.965 (baseline) [ <██| ] +8.838 (ablated) (-0.127)
Group +10 (n= 18) : +9.814 (baseline) [ <██████| ] +9.477 (ablated) (-0.336)
=====================================================================================
=====================================================================================
ACTIVATION STEERING — Feature #1347 (α scales the feature activation)
=====================================================================================
Group -10 (n= 16) | baseline mu = -9.799 sigma = 0.1399
---------------------------------------------------------------------------------
Original (α= 1) : -9.799 [ ● | ]
Ablation (α= 0) : -9.601 [ ● | ] (Shift: +0.198 >)
Steer (α=-4) : -6.330 [ ● | ] (Shift: +3.468 >)
Steer (α=-2) : -8.248 [ ● | ] (Shift: +1.551 >)
Steer (α=-1) : -9.099 [ ● | ] (Shift: +0.700 >)
Steer (α=+2) : -10.006 [ ● | ] (Shift: -0.207 <)
Steer (α=+4) : -10.039 [ ● | ] (Shift: -0.240 <)
Group -9 (n= 31) | baseline mu = -8.849 sigma = 0.1538
---------------------------------------------------------------------------------
Original (α= 1) : -8.849 [ ● | ]
Ablation (α= 0) : -8.690 [ ● | ] (Shift: +0.159 >)
Steer (α=-4) : -5.058 [ ● | ] (Shift: +3.791 >)
Steer (α=-2) : -7.194 [ ● | ] (Shift: +1.655 >)
Steer (α=-1) : -8.070 [ ● | ] (Shift: +0.779 >)
Steer (α=+2) : -8.958 [ ● | ] (Shift: -0.109 <)
Steer (α=+4) : -8.845 [ ● | ] (Shift: +0.004 >)
Group -8 (n= 54) | baseline mu = -7.940 sigma = 0.1630
---------------------------------------------------------------------------------
Original (α= 1) : -7.940 [ ● | ]
Ablation (α= 0) : -7.899 [ ● | ] (Shift: +0.041 >)
Steer (α=-4) : -5.897 [ ● | ] (Shift: +2.043 >)
Steer (α=-2) : -7.101 [ ● | ] (Shift: +0.839 >)
Steer (α=-1) : -7.616 [ ● | ] (Shift: +0.324 >)
Steer (α=+2) : -7.894 [ ● | ] (Shift: +0.046 >)
Steer (α=+4) : -7.374 [ ● | ] (Shift: +0.566 >)
Group -7 (n= 67) | baseline mu = -6.935 sigma = 0.1585
---------------------------------------------------------------------------------
Original (α= 1) : -6.935 [ ● | ]
Ablation (α= 0) : -6.743 [ ● | ] (Shift: +0.192 >)
Steer (α=-4) : -4.490 [ ● | ] (Shift: +2.446 >)
Steer (α=-2) : -5.660 [ ● | ] (Shift: +1.275 >)
Steer (α=-1) : -6.256 [ ● | ] (Shift: +0.679 >)
Steer (α=+2) : -6.960 [ ● | ] (Shift: -0.025 <)
Steer (α=+4) : -6.496 [ ● | ] (Shift: +0.440 >)
Group -6 (n= 82) | baseline mu = -5.968 sigma = 0.2195
---------------------------------------------------------------------------------
Original (α= 1) : -5.968 [ ● | ]
Ablation (α= 0) : -5.859 [ ● | ] (Shift: +0.110 >)
Steer (α=-4) : -3.993 [ ● | ] (Shift: +1.975 >)
Steer (α=-2) : -5.106 [ ● | ] (Shift: +0.863 >)
Steer (α=-1) : -5.532 [ ● | ] (Shift: +0.437 >)
Steer (α=+2) : -5.840 [ ● | ] (Shift: +0.128 >)
Steer (α=+4) : -5.160 [ ● | ] (Shift: +0.809 >)
Group -5 (n= 45) | baseline mu = -5.030 sigma = 0.1520
---------------------------------------------------------------------------------
Original (α= 1) : -5.030 [ ● | ]
Ablation (α= 0) : -4.819 [ ● | ] (Shift: +0.211 >)
Steer (α=-4) : -2.409 [ ● | ] (Shift: +2.621 >)
Steer (α=-2) : -3.763 [ ● | ] (Shift: +1.267 >)
Steer (α=-1) : -4.342 [ ● | ] (Shift: +0.688 >)
Steer (α=+2) : -4.945 [ ● | ] (Shift: +0.085 >)
Steer (α=+4) : -4.292 [ ● | ] (Shift: +0.738 >)
Group -4 (n= 36) | baseline mu = -3.972 sigma = 0.1603
---------------------------------------------------------------------------------
Original (α= 1) : -3.972 [ ● | ]
Ablation (α= 0) : -4.022 [ ● | ] (Shift: -0.051 <)
Steer (α=-4) : -3.061 [ ● | ] (Shift: +0.910 >)
Steer (α=-2) : -3.603 [ ● | ] (Shift: +0.368 >)
Steer (α=-1) : -3.865 [ ● | ] (Shift: +0.107 >)
Steer (α=+2) : -3.779 [ ● | ] (Shift: +0.193 >)
Steer (α=+4) : -3.087 [ ● | ] (Shift: +0.885 >)
Group -3 (n= 49) | baseline mu = -3.019 sigma = 0.1516
---------------------------------------------------------------------------------
Original (α= 1) : -3.019 [ ● | ]
Ablation (α= 0) : -3.235 [ ● | ] (Shift: -0.216 <)
Steer (α=-4) : -2.798 [ ● | ] (Shift: +0.221 >)
Steer (α=-2) : -3.122 [ ● | ] (Shift: -0.103 <)
Steer (α=-1) : -3.225 [ ● | ] (Shift: -0.206 <)
Steer (α=+2) : -2.640 [ ● | ] (Shift: +0.379 >)
Steer (α=+4) : -1.594 [ ● | ] (Shift: +1.425 >)
Group -2 (n= 64) | baseline mu = -1.963 sigma = 0.1718
---------------------------------------------------------------------------------
Original (α= 1) : -1.963 [ ● | ]
Ablation (α= 0) : -1.953 [ ● | ] (Shift: +0.010 >)
Steer (α=-4) : -0.305 [ ● ] (Shift: +1.659 >)
Steer (α=-2) : -1.285 [ ●| ] (Shift: +0.679 >)
Steer (α=-1) : -1.701 [ ● | ] (Shift: +0.263 >)
Steer (α=+2) : -1.768 [ ● | ] (Shift: +0.195 >)
Steer (α=+4) : -1.007 [ ●| ] (Shift: +0.956 >)
Group -1 (n= 56) | baseline mu = -0.879 sigma = 0.1642
---------------------------------------------------------------------------------
Original (α= 1) : -0.879 [ ●| ]
Ablation (α= 0) : -0.805 [ ●| ] (Shift: +0.074 >)
Steer (α=-4) : +0.501 [ |● ] (Shift: +1.380 >)
Steer (α=-2) : -0.369 [ ● ] (Shift: +0.510 >)
Steer (α=-1) : -0.642 [ ● ] (Shift: +0.237 >)
Steer (α=+2) : -0.822 [ ●| ] (Shift: +0.057 >)
Steer (α=+4) : -0.093 [ ● ] (Shift: +0.786 >)
Group +1 (n= 59) | baseline mu = +1.033 sigma = 0.1740
---------------------------------------------------------------------------------
Original (α= 1) : +1.033 [ | ● ]
Ablation (α= 0) : +1.055 [ | ● ] (Shift: +0.023 >)
Steer (α=-4) : +2.099 [ | ● ] (Shift: +1.066 >)
Steer (α=-2) : +1.446 [ | ● ] (Shift: +0.413 >)
Steer (α=-1) : +1.215 [ | ● ] (Shift: +0.182 >)
Steer (α=+2) : +1.184 [ | ● ] (Shift: +0.151 >)
Steer (α=+4) : +2.032 [ | ● ] (Shift: +0.999 >)
Group +2 (n= 61) | baseline mu = +2.020 sigma = 0.2088
---------------------------------------------------------------------------------
Original (α= 1) : +2.020 [ | ● ]
Ablation (α= 0) : +1.862 [ | ● ] (Shift: -0.158 <)
Steer (α=-4) : +2.503 [ | ● ] (Shift: +0.483 >)
Steer (α=-2) : +1.883 [ | ● ] (Shift: -0.137 <)
Steer (α=-1) : +1.821 [ | ● ] (Shift: -0.199 <)
Steer (α=+2) : +2.341 [ | ● ] (Shift: +0.321 >)
Steer (α=+4) : +3.302 [ | ● ] (Shift: +1.281 >)
Group +3 (n= 52) | baseline mu = +3.091 sigma = 0.1657
---------------------------------------------------------------------------------
Original (α= 1) : +3.091 [ | ● ]
Ablation (α= 0) : +2.792 [ | ● ] (Shift: -0.299 <)
Steer (α=-4) : +2.424 [ | ● ] (Shift: -0.667 <)
Steer (α=-2) : +2.349 [ | ● ] (Shift: -0.742 <)
Steer (α=-1) : +2.529 [ | ● ] (Shift: -0.562 <)
Steer (α=+2) : +3.477 [ | ● ] (Shift: +0.386 >)
Steer (α=+4) : +4.413 [ | ● ] (Shift: +1.322 >)
Group +4 (n= 48) | baseline mu = +4.097 sigma = 0.1863
---------------------------------------------------------------------------------
Original (α= 1) : +4.097 [ | ● ]
Ablation (α= 0) : +3.742 [ | ● ] (Shift: -0.355 <)
Steer (α=-4) : +2.876 [ | ● ] (Shift: -1.221 <)
Steer (α=-2) : +3.028 [ | ● ] (Shift: -1.068 <)
Steer (α=-1) : +3.398 [ | ● ] (Shift: -0.699 <)
Steer (α=+2) : +4.468 [ | ● ] (Shift: +0.371 >)
Steer (α=+4) : +5.359 [ | ● ] (Shift: +1.262 >)
Group +5 (n= 30) | baseline mu = +5.005 sigma = 0.2205
---------------------------------------------------------------------------------
Original (α= 1) : +5.005 [ | ● ]
Ablation (α= 0) : +5.025 [ | ● ] (Shift: +0.021 >)
Steer (α=-4) : +5.814 [ | ● ] (Shift: +0.810 >)
Steer (α=-2) : +5.184 [ | ● ] (Shift: +0.179 >)
Steer (α=-1) : +5.101 [ | ● ] (Shift: +0.096 >)
Steer (α=+2) : +5.074 [ | ● ] (Shift: +0.069 >)
Steer (α=+4) : +5.523 [ | ● ] (Shift: +0.519 >)
Group +6 (n= 97) | baseline mu = +6.065 sigma = 0.1866
---------------------------------------------------------------------------------
Original (α= 1) : +6.065 [ | ● ]
Ablation (α= 0) : +5.603 [ | ● ] (Shift: -0.462 <)
Steer (α=-4) : +4.931 [ | ● ] (Shift: -1.134 <)
Steer (α=-2) : +4.792 [ | ● ] (Shift: -1.273 <)
Steer (α=-1) : +5.121 [ | ● ] (Shift: -0.945 <)
Steer (α=+2) : +6.416 [ | ● ] (Shift: +0.351 >)
Steer (α=+4) : +7.336 [ | ● ] (Shift: +1.271 >)
Group +7 (n= 66) | baseline mu = +7.039 sigma = 0.1674
---------------------------------------------------------------------------------
Original (α= 1) : +7.039 [ | ● ]
Ablation (α= 0) : +6.574 [ | ● ] (Shift: -0.465 <)
Steer (α=-4) : +5.862 [ | ● ] (Shift: -1.177 <)
Steer (α=-2) : +5.576 [ | ● ] (Shift: -1.463 <)
Steer (α=-1) : +6.036 [ | ● ] (Shift: -1.003 <)
Steer (α=+2) : +7.517 [ | ● ] (Shift: +0.478 >)
Steer (α=+4) : +8.697 [ | ● ] (Shift: +1.657 >)
Group +8 (n= 45) | baseline mu = +8.038 sigma = 0.1778
---------------------------------------------------------------------------------
Original (α= 1) : +8.038 [ | ● ]
Ablation (α= 0) : +7.770 [ | ● ] (Shift: -0.268 <)
Steer (α=-4) : +7.806 [ | ● ] (Shift: -0.232 <)
Steer (α=-2) : +7.260 [ | ● ] (Shift: -0.778 <)
Steer (α=-1) : +7.453 [ | ● ] (Shift: -0.585 <)
Steer (α=+2) : +8.297 [ | ● ] (Shift: +0.259 >)
Steer (α=+4) : +8.927 [ | ● ] (Shift: +0.889 >)
Group +9 (n= 24) | baseline mu = +8.965 sigma = 0.1090
---------------------------------------------------------------------------------
Original (α= 1) : +8.965 [ | ● ]
Ablation (α= 0) : +8.838 [ | ● ] (Shift: -0.127 <)
Steer (α=-4) : +8.501 [ | ● ] (Shift: -0.464 <)
Steer (α=-2) : +8.402 [ | ● ] (Shift: -0.563 <)
Steer (α=-1) : +8.669 [ | ● ] (Shift: -0.296 <)
Steer (α=+2) : +9.123 [ | ● ] (Shift: +0.158 >)
Steer (α=+4) : +9.694 [ | ● ] (Shift: +0.729 >)
Group +10 (n= 18) | baseline mu = +9.814 sigma = 0.1806
---------------------------------------------------------------------------------
Original (α= 1) : +9.814 [ | ● ]
Ablation (α= 0) : +9.477 [ | ● ] (Shift: -0.336 <)
Steer (α=-4) : +8.466 [ | ● ] (Shift: -1.348 <)
Steer (α=-2) : +8.501 [ | ● ] (Shift: -1.313 <)
Steer (α=-1) : +8.993 [ | ● ] (Shift: -0.820 <)
Steer (α=+2) : +10.160 [ | ● ] (Shift: +0.346 >)
Steer (α=+4) : +11.023 [ | ● ] (Shift: +1.209 >)
=====================================================================================
Feature #576 — Hyperlocal (global rank 1753 / 2048)
Only fires for samples whose output falls in the −9 to −6 band. Ablation is zero for 16 of 20 groups. In the steering section every alpha returns the identical prediction for those groups — scaling something that is already zero does nothing.
(act-abl) C:\Workspace\Git_Repos\MLP-Experiments\Feature-Ranking-et-Throttle-Testing>python verify_feature.py 576
[verify_feature] Feature #576 | Device: cuda
Global rank : 1753 / 2048 (1 = least impactful, 2048 = most)
Groups : [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
=====================================================================================
ABLATION — Feature #576 (feature activation zeroed out)
=====================================================================================
Group -10 (n= 16) : -9.799 (baseline) [ |> ] -9.799 (ablated) (+0.000)
Group -9 (n= 31) : -8.849 (baseline) [ |████> ] -8.847 (ablated) (+0.002)
Group -8 (n= 54) : -7.940 (baseline) [ |█> ] -7.939 (ablated) (+0.001)
Group -7 (n= 67) : -6.935 (baseline) [ |█████> ] -6.933 (ablated) (+0.003)
Group -6 (n= 82) : -5.968 (baseline) [ |████████>] -5.964 (ablated) (+0.004)
Group -5 (n= 45) : -5.030 (baseline) [ |█> ] -5.029 (ablated) (+0.001)
Group -4 (n= 36) : -3.972 (baseline) [ |█> ] -3.971 (ablated) (+0.001)
Group -3 (n= 49) : -3.019 (baseline) [ |> ] -3.019 (ablated) (+0.000)
Group -2 (n= 64) : -1.963 (baseline) [ |> ] -1.963 (ablated) (+0.000)
Group -1 (n= 56) : -0.879 (baseline) [ |> ] -0.879 (ablated) (+0.000)
Group +1 (n= 59) : +1.033 (baseline) [ |> ] +1.033 (ablated) (+0.000)
Group +2 (n= 61) : +2.020 (baseline) [ |> ] +2.020 (ablated) (+0.000)
Group +3 (n= 52) : +3.091 (baseline) [ |> ] +3.091 (ablated) (+0.000)
Group +4 (n= 48) : +4.097 (baseline) [ |> ] +4.097 (ablated) (+0.000)
Group +5 (n= 30) : +5.005 (baseline) [ |> ] +5.005 (ablated) (+0.000)
Group +6 (n= 97) : +6.065 (baseline) [ |> ] +6.065 (ablated) (+0.000)
Group +7 (n= 66) : +7.039 (baseline) [ |> ] +7.039 (ablated) (+0.000)
Group +8 (n= 45) : +8.038 (baseline) [ |> ] +8.038 (ablated) (+0.000)
Group +9 (n= 24) : +8.965 (baseline) [ |> ] +8.965 (ablated) (+0.000)
Group +10 (n= 18) : +9.814 (baseline) [ |> ] +9.814 (ablated) (+0.000)
=====================================================================================
=====================================================================================
ACTIVATION STEERING — Feature #576 (α scales the feature activation)
=====================================================================================
Group -10 (n= 16) | baseline mu = -9.799 sigma = 0.1399
---------------------------------------------------------------------------------
Original (α= 1) : -9.799 [● | ]
Ablation (α= 0) : -9.799 [● | ] (Shift: +0.000 >)
Steer (α=-4) : -9.799 [● | ] (Shift: +0.000 >)
Steer (α=-2) : -9.799 [● | ] (Shift: +0.000 >)
Steer (α=-1) : -9.799 [● | ] (Shift: +0.000 >)
Steer (α=+2) : -9.799 [● | ] (Shift: +0.000 >)
Steer (α=+4) : -9.799 [● | ] (Shift: +0.000 >)
Group -9 (n= 31) | baseline mu = -8.849 sigma = 0.1538
---------------------------------------------------------------------------------
Original (α= 1) : -8.849 [ ● | ]
Ablation (α= 0) : -8.847 [ ● | ] (Shift: +0.002 >)
Steer (α=-4) : -8.838 [ ● | ] (Shift: +0.011 >)
Steer (α=-2) : -8.842 [ ● | ] (Shift: +0.006 >)
Steer (α=-1) : -8.844 [ ● | ] (Shift: +0.004 >)
Steer (α=+2) : -8.851 [ ● | ] (Shift: -0.002 <)
Steer (α=+4) : -8.855 [ ● | ] (Shift: -0.006 <)
Group -8 (n= 54) | baseline mu = -7.940 sigma = 0.1630
---------------------------------------------------------------------------------
Original (α= 1) : -7.940 [ ● | ]
Ablation (α= 0) : -7.939 [ ● | ] (Shift: +0.001 >)
Steer (α=-4) : -7.936 [ ● | ] (Shift: +0.004 >)
Steer (α=-2) : -7.938 [ ● | ] (Shift: +0.002 >)
Steer (α=-1) : -7.939 [ ● | ] (Shift: +0.001 >)
Steer (α=+2) : -7.941 [ ● | ] (Shift: -0.001 <)
Steer (α=+4) : -7.942 [ ● | ] (Shift: -0.002 <)
Group -7 (n= 67) | baseline mu = -6.935 sigma = 0.1585
---------------------------------------------------------------------------------
Original (α= 1) : -6.935 [ ● | ]
Ablation (α= 0) : -6.933 [ ● | ] (Shift: +0.003 >)
Steer (α=-4) : -6.922 [ ● | ] (Shift: +0.013 >)
Steer (α=-2) : -6.927 [ ● | ] (Shift: +0.008 >)
Steer (α=-1) : -6.930 [ ● | ] (Shift: +0.005 >)
Steer (α=+2) : -6.938 [ ● | ] (Shift: -0.003 <)
Steer (α=+4) : -6.943 [ ● | ] (Shift: -0.008 <)
Group -6 (n= 82) | baseline mu = -5.968 sigma = 0.2195
---------------------------------------------------------------------------------
Original (α= 1) : -5.968 [ ● | ]
Ablation (α= 0) : -5.964 [ ● | ] (Shift: +0.004 >)
Steer (α=-4) : -5.947 [ ● | ] (Shift: +0.021 >)
Steer (α=-2) : -5.956 [ ● | ] (Shift: +0.013 >)
Steer (α=-1) : -5.960 [ ● | ] (Shift: +0.008 >)
Steer (α=+2) : -5.973 [ ● | ] (Shift: -0.004 <)
Steer (α=+4) : -5.981 [ ● | ] (Shift: -0.013 <)
Group -5 (n= 45) | baseline mu = -5.030 sigma = 0.1520
---------------------------------------------------------------------------------
Original (α= 1) : -5.030 [ ● | ]
Ablation (α= 0) : -5.029 [ ● | ] (Shift: +0.001 >)
Steer (α=-4) : -5.027 [ ● | ] (Shift: +0.003 >)
Steer (α=-2) : -5.028 [ ● | ] (Shift: +0.002 >)
Steer (α=-1) : -5.029 [ ● | ] (Shift: +0.001 >)
Steer (α=+2) : -5.031 [ ● | ] (Shift: -0.001 <)
Steer (α=+4) : -5.032 [ ● | ] (Shift: -0.002 <)
Group -4 (n= 36) | baseline mu = -3.972 sigma = 0.1603
---------------------------------------------------------------------------------
Original (α= 1) : -3.972 [ ● | ]
Ablation (α= 0) : -3.971 [ ● | ] (Shift: +0.001 >)
Steer (α=-4) : -3.968 [ ● | ] (Shift: +0.003 >)
Steer (α=-2) : -3.970 [ ● | ] (Shift: +0.002 >)
Steer (α=-1) : -3.970 [ ● | ] (Shift: +0.001 >)
Steer (α=+2) : -3.972 [ ● | ] (Shift: -0.001 <)
Steer (α=+4) : -3.974 [ ● | ] (Shift: -0.002 <)
Group -3 (n= 49) | baseline mu = -3.019 sigma = 0.1516
---------------------------------------------------------------------------------
Original (α= 1) : -3.019 [ ● | ]
Ablation (α= 0) : -3.019 [ ● | ] (Shift: +0.000 >)
Steer (α=-4) : -3.018 [ ● | ] (Shift: +0.001 >)
Steer (α=-2) : -3.018 [ ● | ] (Shift: +0.001 >)
Steer (α=-1) : -3.019 [ ● | ] (Shift: +0.000 >)
Steer (α=+2) : -3.019 [ ● | ] (Shift: -0.000 <)
Steer (α=+4) : -3.020 [ ● | ] (Shift: -0.001 <)
Group -2 (n= 64) | baseline mu = -1.963 sigma = 0.1718
---------------------------------------------------------------------------------
Original (α= 1) : -1.963 [ ● | ]
Ablation (α= 0) : -1.963 [ ● | ] (Shift: +0.000 >)
Steer (α=-4) : -1.963 [ ● | ] (Shift: +0.000 >)
Steer (α=-2) : -1.963 [ ● | ] (Shift: +0.000 >)
Steer (α=-1) : -1.963 [ ● | ] (Shift: +0.000 >)
Steer (α=+2) : -1.963 [ ● | ] (Shift: +0.000 >)
Steer (α=+4) : -1.963 [ ● | ] (Shift: +0.000 >)
Group -1 (n= 56) | baseline mu = -0.879 sigma = 0.1642
---------------------------------------------------------------------------------
Original (α= 1) : -0.879 [ ●| ]
Ablation (α= 0) : -0.879 [ ●| ] (Shift: +0.000 >)
Steer (α=-4) : -0.879 [ ●| ] (Shift: +0.000 >)
Steer (α=-2) : -0.879 [ ●| ] (Shift: +0.000 >)
Steer (α=-1) : -0.879 [ ●| ] (Shift: +0.000 >)
Steer (α=+2) : -0.879 [ ●| ] (Shift: +0.000 >)
Steer (α=+4) : -0.879 [ ●| ] (Shift: +0.000 >)
Group +1 (n= 59) | baseline mu = +1.033 sigma = 0.1740
---------------------------------------------------------------------------------
Original (α= 1) : +1.033 [ | ● ]
Ablation (α= 0) : +1.033 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +1.033 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +1.033 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +1.033 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +1.033 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +1.033 [ | ● ] (Shift: +0.000 >)
Group +2 (n= 61) | baseline mu = +2.020 sigma = 0.2088
---------------------------------------------------------------------------------
Original (α= 1) : +2.020 [ | ● ]
Ablation (α= 0) : +2.020 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +2.020 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +2.020 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +2.020 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +2.020 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +2.020 [ | ● ] (Shift: +0.000 >)
Group +3 (n= 52) | baseline mu = +3.091 sigma = 0.1657
---------------------------------------------------------------------------------
Original (α= 1) : +3.091 [ | ● ]
Ablation (α= 0) : +3.091 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +3.091 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +3.091 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +3.091 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +3.091 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +3.091 [ | ● ] (Shift: +0.000 >)
Group +4 (n= 48) | baseline mu = +4.097 sigma = 0.1863
---------------------------------------------------------------------------------
Original (α= 1) : +4.097 [ | ● ]
Ablation (α= 0) : +4.097 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +4.097 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +4.097 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +4.097 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +4.097 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +4.097 [ | ● ] (Shift: +0.000 >)
Group +5 (n= 30) | baseline mu = +5.005 sigma = 0.2205
---------------------------------------------------------------------------------
Original (α= 1) : +5.005 [ | ● ]
Ablation (α= 0) : +5.005 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +5.005 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +5.005 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +5.005 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +5.005 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +5.005 [ | ● ] (Shift: +0.000 >)
Group +6 (n= 97) | baseline mu = +6.065 sigma = 0.1866
---------------------------------------------------------------------------------
Original (α= 1) : +6.065 [ | ● ]
Ablation (α= 0) : +6.065 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +6.065 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +6.065 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +6.065 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +6.065 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +6.065 [ | ● ] (Shift: +0.000 >)
Group +7 (n= 66) | baseline mu = +7.039 sigma = 0.1674
---------------------------------------------------------------------------------
Original (α= 1) : +7.039 [ | ● ]
Ablation (α= 0) : +7.039 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +7.039 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +7.039 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +7.039 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +7.039 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +7.039 [ | ● ] (Shift: +0.000 >)
Group +8 (n= 45) | baseline mu = +8.038 sigma = 0.1778
---------------------------------------------------------------------------------
Original (α= 1) : +8.038 [ | ● ]
Ablation (α= 0) : +8.038 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +8.038 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +8.038 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +8.038 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +8.038 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +8.038 [ | ● ] (Shift: +0.000 >)
Group +9 (n= 24) | baseline mu = +8.965 sigma = 0.1090
---------------------------------------------------------------------------------
Original (α= 1) : +8.965 [ | ● ]
Ablation (α= 0) : +8.965 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +8.965 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +8.965 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +8.965 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +8.965 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +8.965 [ | ● ] (Shift: +0.000 >)
Group +10 (n= 18) | baseline mu = +9.814 sigma = 0.1806
---------------------------------------------------------------------------------
Original (α= 1) : +9.814 [ | ● ]
Ablation (α= 0) : +9.814 [ | ● ] (Shift: +0.000 >)
Steer (α=-4) : +9.814 [ | ● ] (Shift: +0.000 >)
Steer (α=-2) : +9.814 [ | ● ] (Shift: +0.000 >)
Steer (α=-1) : +9.814 [ | ● ] (Shift: +0.000 >)
Steer (α=+2) : +9.814 [ | ● ] (Shift: +0.000 >)
Steer (α=+4) : +9.814 [ | ● ] (Shift: +0.000 >)
=====================================================================================
| Feature #1347 | Feature #576 | |
|---|---|---|
| Global rank | 2048 / 2048 | 1753 / 2048 |
| Groups affected | All 20 | ~4 (groups −9 to −6) |
| Ablation direction | Flips across sign boundary | Always toward zero |
| Steering sensitivity | High — α=−4 shifts ±3 units | Negligible at any α |
| Role | Encodes output magnitude universally | Narrow detector for mid-negative outputs |
Feature #1347 is causally load-bearing for the whole model. Feature #576 illustrates why the SAE needs 2048 dimensions — fine-grained detectors that handle specific output regions the broad features miss.
| File | Size (typical) | Description |
|---|---|---|
results/feature_ranks.pt |
~0.5 MB | Per-group feature rankings and mean activations |
results/throttle_results.pt |
~145 MB | All ablation + steering prediction lists |
results/reports.html |
~11.6 MB | Self-contained ridgeline report viewer |
results/index.html |
~0.5 MB | Self-contained D3 network graph |
Feature-Ranking-et-Throttle-Testing/
├── dataset/
│ ├── mlp_test.xlsx ← used by pipeline
│ ├── mlp_train.xlsx
│ ├── mlp_val.xlsx
│ ├── extrap_test.xlsx
│ ├── interp_test.xlsx
│ ├── precision_test.xlsx
│ └── scaling_test.xlsx
├── mlp/
│ ├── mlp_definition.py ← InterpretabilityMLP class
│ └── perfect_mlp.pth ← pre-trained weights
├── sae/
│ ├── sae_definition.py ← SparseAutoencoder class
│ └── universal_sae.pth ← pre-trained weights
├── results/ ← generated by pipeline
│ ├── feature_ranks.pt
│ ├── throttle_results.pt
│ ├── reports.html
│ └── index.html
├── feature_ranking.py ← step 2
├── throttle_testing.py ← step 3
├── generate_reports.py ← step 4a
├── network_graph.py ← step 4b
├── verify_feature.py ← manual verification (standalone)
├── workflow.bat ← full pipeline runner
├── workflow.log ← sample pipeline run log
└── reqs.txt ← pip dependencies