Feature Ranking & Throttle Testing

Mechanistic interpretability pipeline for an MLP + Sparse Autoencoder (SAE) model. Ranks SAE latent features by activation impact, then performs ablation and activation steering across all output groups to understand how individual features influence predictions.


Model Architecture

Input (10) └─ Linear(10 → 256) + BatchNorm + ReLU └─ Linear(256 → 512) + BatchNorm + ReLU └─ Linear(512 → 256) ← hook point: hidden2 └─ ReLU └─ Linear(256 → 1) output

A Sparse Autoencoder is attached to the hidden2 activation (256-dim):

hidden2 (256) → Encoder → latents (2048, Top-K=128 sparsity) └─ Decoder → reconstructed (256)

The decoder direction trick avoids a full matrix multiply per feature during throttle testing:

contrib_i = latents[:, i] * W_dec[:, i] # (N, 256) ablated = decoded - contrib_i # zero out feature i steered_α = decoded + (α - 1) * contrib_i # scale feature i by α output = ReLU(modified) @ W_out.T + b_out

Task & Dataset

Integer-valued regression: 10-integer input lists → output value in {-10…-1, +1…+10}.
The dataset is pre-split across 20 output groups (no zero class).

File Purpose
dataset/mlp_train.xlsx MLP training set
dataset/mlp_val.xlsx MLP validation set
dataset/mlp_test.xlsx Primary test set used by this pipeline
dataset/extrap_test.xlsx Out-of-distribution: extrapolation
dataset/interp_test.xlsx Out-of-distribution: interpolation
dataset/precision_test.xlsx Precision stress test
dataset/scaling_test.xlsx Scaling stress test

Group sample counts from a typical run (1000 total samples):

Group -10: n=16 | Group -9: n=31 | Group -8: n=54 | Group -7: n=67 Group -6: n=82 | Group -5: n=45 | Group -4: n=36 | Group -3: n=49 Group -2: n=64 | Group -1: n=56 | Group +1: n=59 | Group +2: n=61 Group +3: n=52 | Group +4: n=48 | Group +5: n=30 | Group +6: n=97 Group +7: n=66 | Group +8: n=45 | Group +9: n=24 | Group +10: n=18

Prerequisites

Conda environment (act-abl):

Set up a Conda environment with Python 3.11.6:

git clone https://github.com/Palani-SN/Feature-Ranking-et-Throttle-Testing.git cd Feature-Ranking-et-Throttle-Testing conda create -n act-abl python=3.11.6 conda activate act-abl python -m pip install torch==2.10.0 torchvision==0.25.0 --extra-index-url https://download.pytorch.org/whl/cu126 python -m pip install -r reqs.txt

Key packages from reqs.txt: numpy, pandas, openpyxl, matplotlib, plotly, scikit-learn, tqdm (and their dependencies).
Torch and torchvision are installed separately as shown above to pull the CUDA 12.6 build.
CUDA 12.6 required for GPU acceleration (falls back to CPU automatically).

Pre-trained weights must be present:


Running the Full Pipeline

conda activate act-abl cd Feature-Ranking-et-Throttle-Testing workflow.bat

The batch file activates the environment, runs all pipeline scripts in sequence, and prints a timing summary.
Typical runtime: ~14 minutes (CUDA, 1000 samples, 2048 features).

[1/4] conda activate act-abl [2/4] python feature_ranking.py → results/feature_ranks.pt [3/4] python throttle_testing.py → results/throttle_results.pt [4/4] python generate_reports.py → results/reports.html python network_graph.py → results/index.html

Once complete, open results/index.html in any browser.

Main Page

Network graph edges

Amber shared nodes

Purple common nodes


Scripts

feature_ranking.py

Groups the test dataset by rounded output value and computes per-group MLP predictions through the SAE. Ranks all 2048 SAE features by mean absolute activation, least impactful first.

Outputresults/feature_ranks.pt:

baseline_stats : {group → {mu, sigma, predictions, n}} feature_ranks : {group → [2048 feature indices, ascending impact]} global_rank : [2048 feature indices, ascending impact, averaged across groups] groups : [-10, -9, ..., -1, +1, ..., +10] group_mean_acts : {group → [2048 mean absolute activations]} dataset_path : path used

throttle_testing.py

Iterates all 2048 features in global rank order. For each feature and each output group, applies two interventions using the decoder-direction trick:

Progress is printed every 256 features.

Outputresults/throttle_results.pt:

results : {feature_idx → {group_val → {ablation: [...], steering: {α: [...]}}}} global_rank : [2048 feature indices] groups : list of group values multipliers : [-4, -2, -1, 0, 1, 2, 4] baseline_stats: {group → {mu, sigma, predictions, n}}

generate_reports.py

Reads both .pt result files and produces results/reports.html — a self-contained single-file viewer. All data is embedded as a JavaScript constant (const REPORTS_DATA = {...}); no server or fetch required. Open directly from disk. At 2048 features the file is ~11.6 MB.

The viewer displays ridgeline plots of prediction distributions and intervention deltas. URL parameters:


network_graph.py

Generates results/index.html — an interactive D3.js bipartite force-directed graph.

Config (top of file):

Variable Default Meaning
TOP_K 254 Top features per group included in graph
MIN_GROUPS 2 Minimum groups a feature must appear in to be shown

Node encoding:

Edge encoding: colour inherits from the connected group; thickness ∝ normalised mean activation.

Interactions:

A typical run produces ~288 visible feature nodes and ~5073 edges across 20 groups.


verify_feature.py — Manual Verification Tool

Standalone script for inspecting a single feature without running the full pipeline. Recomputes ablation and steering from scratch and prints ASCII bar-chart output.

python verify_feature.py <feature_idx> # Example — most impactful feature in a sample run: python verify_feature.py 1347

Ablation section — one row per group, shift bar relative to maximum shift seen:

Group -10 (n= 16) : -9.799 (baseline) [ |███> ] -9.601 (ablated) (+0.198) Group +7 (n= 66) : +7.039 (baseline) [<████████| ] +6.574 (ablated) (-0.465)

> = prediction rose when feature removed; < = prediction fell; blocks = relative magnitude.

Activation steering section — per group, tracks mean prediction on a shared number-line; | marks 0:

Group -10 (n= 16) | baseline mu = -9.799 sigma = 0.1399 --------------------------------------------------------------------------------- Original (α= 1) : -9.799 [ ● | ] Ablation (α= 0) : -9.601 [ ● | ] (Shift: +0.198 >) Steer (α=-4) : -6.330 [ ● | ] (Shift: +3.468 >) Steer (α=+4) : -10.039 [ ● | ] (Shift: -0.240 <)

Features with zero shift across all groups are inactive (sparse SAE did not fire them for any sample in those groups).

Reading the output — two contrasting examples

Feature #1347 — Global / Universal (global rank 2048 / 2048, most impactful)

Ablation shifts every one of the 20 groups, and the direction flips cleanly at the sign boundary — negative-output groups move toward zero, positive-output groups also move toward zero. When the feature is active it amplifies prediction magnitude away from zero across the whole task. Steering confirms this: at α=−4, group −10 jumps from −9.8 to −6.3 (+3.5 shift); every group responds and the response scales with α.

(act-abl) C:\Workspace\Git_Repos\MLP-Experiments\Feature-Ranking-et-Throttle-Testing>python verify_feature.py 1347 [verify_feature] Feature #1347 | Device: cuda Global rank : 2048 / 2048 (1 = least impactful, 2048 = most) Groups : [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ===================================================================================== ABLATION — Feature #1347 (feature activation zeroed out) ===================================================================================== Group -10 (n= 16) : -9.799 (baseline) [ |███> ] -9.601 (ablated) (+0.198) Group -9 (n= 31) : -8.849 (baseline) [ |███> ] -8.690 (ablated) (+0.159) Group -8 (n= 54) : -7.940 (baseline) [ |█> ] -7.899 (ablated) (+0.041) Group -7 (n= 67) : -6.935 (baseline) [ |███> ] -6.743 (ablated) (+0.192) Group -6 (n= 82) : -5.968 (baseline) [ |██> ] -5.859 (ablated) (+0.110) Group -5 (n= 45) : -5.030 (baseline) [ |████> ] -4.819 (ablated) (+0.211) Group -4 (n= 36) : -3.972 (baseline) [ <█| ] -4.022 (ablated) (-0.051) Group -3 (n= 49) : -3.019 (baseline) [ <████| ] -3.235 (ablated) (-0.216) Group -2 (n= 64) : -1.963 (baseline) [ |> ] -1.953 (ablated) (+0.010) Group -1 (n= 56) : -0.879 (baseline) [ |█> ] -0.805 (ablated) (+0.074) Group +1 (n= 59) : +1.033 (baseline) [ |> ] +1.055 (ablated) (+0.023) Group +2 (n= 61) : +2.020 (baseline) [ <███| ] +1.862 (ablated) (-0.158) Group +3 (n= 52) : +3.091 (baseline) [ <█████| ] +2.792 (ablated) (-0.299) Group +4 (n= 48) : +4.097 (baseline) [ <██████| ] +3.742 (ablated) (-0.355) Group +5 (n= 30) : +5.005 (baseline) [ |> ] +5.025 (ablated) (+0.021) Group +6 (n= 97) : +6.065 (baseline) [<████████| ] +5.603 (ablated) (-0.462) Group +7 (n= 66) : +7.039 (baseline) [<████████| ] +6.574 (ablated) (-0.465) Group +8 (n= 45) : +8.038 (baseline) [ <█████| ] +7.770 (ablated) (-0.268) Group +9 (n= 24) : +8.965 (baseline) [ <██| ] +8.838 (ablated) (-0.127) Group +10 (n= 18) : +9.814 (baseline) [ <██████| ] +9.477 (ablated) (-0.336) ===================================================================================== ===================================================================================== ACTIVATION STEERING — Feature #1347 (α scales the feature activation) ===================================================================================== Group -10 (n= 16) | baseline mu = -9.799 sigma = 0.1399 --------------------------------------------------------------------------------- Original (α= 1) : -9.799 [ ● | ] Ablation (α= 0) : -9.601 [ ● | ] (Shift: +0.198 >) Steer (α=-4) : -6.330 [ ● | ] (Shift: +3.468 >) Steer (α=-2) : -8.248 [ ● | ] (Shift: +1.551 >) Steer (α=-1) : -9.099 [ ● | ] (Shift: +0.700 >) Steer (α=+2) : -10.006 [ ● | ] (Shift: -0.207 <) Steer (α=+4) : -10.039 [ ● | ] (Shift: -0.240 <) Group -9 (n= 31) | baseline mu = -8.849 sigma = 0.1538 --------------------------------------------------------------------------------- Original (α= 1) : -8.849 [ ● | ] Ablation (α= 0) : -8.690 [ ● | ] (Shift: +0.159 >) Steer (α=-4) : -5.058 [ ● | ] (Shift: +3.791 >) Steer (α=-2) : -7.194 [ ● | ] (Shift: +1.655 >) Steer (α=-1) : -8.070 [ ● | ] (Shift: +0.779 >) Steer (α=+2) : -8.958 [ ● | ] (Shift: -0.109 <) Steer (α=+4) : -8.845 [ ● | ] (Shift: +0.004 >) Group -8 (n= 54) | baseline mu = -7.940 sigma = 0.1630 --------------------------------------------------------------------------------- Original (α= 1) : -7.940 [ ● | ] Ablation (α= 0) : -7.899 [ ● | ] (Shift: +0.041 >) Steer (α=-4) : -5.897 [ ● | ] (Shift: +2.043 >) Steer (α=-2) : -7.101 [ ● | ] (Shift: +0.839 >) Steer (α=-1) : -7.616 [ ● | ] (Shift: +0.324 >) Steer (α=+2) : -7.894 [ ● | ] (Shift: +0.046 >) Steer (α=+4) : -7.374 [ ● | ] (Shift: +0.566 >) Group -7 (n= 67) | baseline mu = -6.935 sigma = 0.1585 --------------------------------------------------------------------------------- Original (α= 1) : -6.935 [ ● | ] Ablation (α= 0) : -6.743 [ ● | ] (Shift: +0.192 >) Steer (α=-4) : -4.490 [ ● | ] (Shift: +2.446 >) Steer (α=-2) : -5.660 [ ● | ] (Shift: +1.275 >) Steer (α=-1) : -6.256 [ ● | ] (Shift: +0.679 >) Steer (α=+2) : -6.960 [ ● | ] (Shift: -0.025 <) Steer (α=+4) : -6.496 [ ● | ] (Shift: +0.440 >) Group -6 (n= 82) | baseline mu = -5.968 sigma = 0.2195 --------------------------------------------------------------------------------- Original (α= 1) : -5.968 [ ● | ] Ablation (α= 0) : -5.859 [ ● | ] (Shift: +0.110 >) Steer (α=-4) : -3.993 [ ● | ] (Shift: +1.975 >) Steer (α=-2) : -5.106 [ ● | ] (Shift: +0.863 >) Steer (α=-1) : -5.532 [ ● | ] (Shift: +0.437 >) Steer (α=+2) : -5.840 [ ● | ] (Shift: +0.128 >) Steer (α=+4) : -5.160 [ ● | ] (Shift: +0.809 >) Group -5 (n= 45) | baseline mu = -5.030 sigma = 0.1520 --------------------------------------------------------------------------------- Original (α= 1) : -5.030 [ ● | ] Ablation (α= 0) : -4.819 [ ● | ] (Shift: +0.211 >) Steer (α=-4) : -2.409 [ ● | ] (Shift: +2.621 >) Steer (α=-2) : -3.763 [ ● | ] (Shift: +1.267 >) Steer (α=-1) : -4.342 [ ● | ] (Shift: +0.688 >) Steer (α=+2) : -4.945 [ ● | ] (Shift: +0.085 >) Steer (α=+4) : -4.292 [ ● | ] (Shift: +0.738 >) Group -4 (n= 36) | baseline mu = -3.972 sigma = 0.1603 --------------------------------------------------------------------------------- Original (α= 1) : -3.972 [ ● | ] Ablation (α= 0) : -4.022 [ ● | ] (Shift: -0.051 <) Steer (α=-4) : -3.061 [ ● | ] (Shift: +0.910 >) Steer (α=-2) : -3.603 [ ● | ] (Shift: +0.368 >) Steer (α=-1) : -3.865 [ ● | ] (Shift: +0.107 >) Steer (α=+2) : -3.779 [ ● | ] (Shift: +0.193 >) Steer (α=+4) : -3.087 [ ● | ] (Shift: +0.885 >) Group -3 (n= 49) | baseline mu = -3.019 sigma = 0.1516 --------------------------------------------------------------------------------- Original (α= 1) : -3.019 [ ● | ] Ablation (α= 0) : -3.235 [ ● | ] (Shift: -0.216 <) Steer (α=-4) : -2.798 [ ● | ] (Shift: +0.221 >) Steer (α=-2) : -3.122 [ ● | ] (Shift: -0.103 <) Steer (α=-1) : -3.225 [ ● | ] (Shift: -0.206 <) Steer (α=+2) : -2.640 [ ● | ] (Shift: +0.379 >) Steer (α=+4) : -1.594 [ ● | ] (Shift: +1.425 >) Group -2 (n= 64) | baseline mu = -1.963 sigma = 0.1718 --------------------------------------------------------------------------------- Original (α= 1) : -1.963 [ ● | ] Ablation (α= 0) : -1.953 [ ● | ] (Shift: +0.010 >) Steer (α=-4) : -0.305 [ ● ] (Shift: +1.659 >) Steer (α=-2) : -1.285 [ ●| ] (Shift: +0.679 >) Steer (α=-1) : -1.701 [ ● | ] (Shift: +0.263 >) Steer (α=+2) : -1.768 [ ● | ] (Shift: +0.195 >) Steer (α=+4) : -1.007 [ ●| ] (Shift: +0.956 >) Group -1 (n= 56) | baseline mu = -0.879 sigma = 0.1642 --------------------------------------------------------------------------------- Original (α= 1) : -0.879 [ ●| ] Ablation (α= 0) : -0.805 [ ●| ] (Shift: +0.074 >) Steer (α=-4) : +0.501 [ |● ] (Shift: +1.380 >) Steer (α=-2) : -0.369 [ ● ] (Shift: +0.510 >) Steer (α=-1) : -0.642 [ ● ] (Shift: +0.237 >) Steer (α=+2) : -0.822 [ ●| ] (Shift: +0.057 >) Steer (α=+4) : -0.093 [ ● ] (Shift: +0.786 >) Group +1 (n= 59) | baseline mu = +1.033 sigma = 0.1740 --------------------------------------------------------------------------------- Original (α= 1) : +1.033 [ | ● ] Ablation (α= 0) : +1.055 [ | ● ] (Shift: +0.023 >) Steer (α=-4) : +2.099 [ | ● ] (Shift: +1.066 >) Steer (α=-2) : +1.446 [ | ● ] (Shift: +0.413 >) Steer (α=-1) : +1.215 [ | ● ] (Shift: +0.182 >) Steer (α=+2) : +1.184 [ | ● ] (Shift: +0.151 >) Steer (α=+4) : +2.032 [ | ● ] (Shift: +0.999 >) Group +2 (n= 61) | baseline mu = +2.020 sigma = 0.2088 --------------------------------------------------------------------------------- Original (α= 1) : +2.020 [ | ● ] Ablation (α= 0) : +1.862 [ | ● ] (Shift: -0.158 <) Steer (α=-4) : +2.503 [ | ● ] (Shift: +0.483 >) Steer (α=-2) : +1.883 [ | ● ] (Shift: -0.137 <) Steer (α=-1) : +1.821 [ | ● ] (Shift: -0.199 <) Steer (α=+2) : +2.341 [ | ● ] (Shift: +0.321 >) Steer (α=+4) : +3.302 [ | ● ] (Shift: +1.281 >) Group +3 (n= 52) | baseline mu = +3.091 sigma = 0.1657 --------------------------------------------------------------------------------- Original (α= 1) : +3.091 [ | ● ] Ablation (α= 0) : +2.792 [ | ● ] (Shift: -0.299 <) Steer (α=-4) : +2.424 [ | ● ] (Shift: -0.667 <) Steer (α=-2) : +2.349 [ | ● ] (Shift: -0.742 <) Steer (α=-1) : +2.529 [ | ● ] (Shift: -0.562 <) Steer (α=+2) : +3.477 [ | ● ] (Shift: +0.386 >) Steer (α=+4) : +4.413 [ | ● ] (Shift: +1.322 >) Group +4 (n= 48) | baseline mu = +4.097 sigma = 0.1863 --------------------------------------------------------------------------------- Original (α= 1) : +4.097 [ | ● ] Ablation (α= 0) : +3.742 [ | ● ] (Shift: -0.355 <) Steer (α=-4) : +2.876 [ | ● ] (Shift: -1.221 <) Steer (α=-2) : +3.028 [ | ● ] (Shift: -1.068 <) Steer (α=-1) : +3.398 [ | ● ] (Shift: -0.699 <) Steer (α=+2) : +4.468 [ | ● ] (Shift: +0.371 >) Steer (α=+4) : +5.359 [ | ● ] (Shift: +1.262 >) Group +5 (n= 30) | baseline mu = +5.005 sigma = 0.2205 --------------------------------------------------------------------------------- Original (α= 1) : +5.005 [ | ● ] Ablation (α= 0) : +5.025 [ | ● ] (Shift: +0.021 >) Steer (α=-4) : +5.814 [ | ● ] (Shift: +0.810 >) Steer (α=-2) : +5.184 [ | ● ] (Shift: +0.179 >) Steer (α=-1) : +5.101 [ | ● ] (Shift: +0.096 >) Steer (α=+2) : +5.074 [ | ● ] (Shift: +0.069 >) Steer (α=+4) : +5.523 [ | ● ] (Shift: +0.519 >) Group +6 (n= 97) | baseline mu = +6.065 sigma = 0.1866 --------------------------------------------------------------------------------- Original (α= 1) : +6.065 [ | ● ] Ablation (α= 0) : +5.603 [ | ● ] (Shift: -0.462 <) Steer (α=-4) : +4.931 [ | ● ] (Shift: -1.134 <) Steer (α=-2) : +4.792 [ | ● ] (Shift: -1.273 <) Steer (α=-1) : +5.121 [ | ● ] (Shift: -0.945 <) Steer (α=+2) : +6.416 [ | ● ] (Shift: +0.351 >) Steer (α=+4) : +7.336 [ | ● ] (Shift: +1.271 >) Group +7 (n= 66) | baseline mu = +7.039 sigma = 0.1674 --------------------------------------------------------------------------------- Original (α= 1) : +7.039 [ | ● ] Ablation (α= 0) : +6.574 [ | ● ] (Shift: -0.465 <) Steer (α=-4) : +5.862 [ | ● ] (Shift: -1.177 <) Steer (α=-2) : +5.576 [ | ● ] (Shift: -1.463 <) Steer (α=-1) : +6.036 [ | ● ] (Shift: -1.003 <) Steer (α=+2) : +7.517 [ | ● ] (Shift: +0.478 >) Steer (α=+4) : +8.697 [ | ● ] (Shift: +1.657 >) Group +8 (n= 45) | baseline mu = +8.038 sigma = 0.1778 --------------------------------------------------------------------------------- Original (α= 1) : +8.038 [ | ● ] Ablation (α= 0) : +7.770 [ | ● ] (Shift: -0.268 <) Steer (α=-4) : +7.806 [ | ● ] (Shift: -0.232 <) Steer (α=-2) : +7.260 [ | ● ] (Shift: -0.778 <) Steer (α=-1) : +7.453 [ | ● ] (Shift: -0.585 <) Steer (α=+2) : +8.297 [ | ● ] (Shift: +0.259 >) Steer (α=+4) : +8.927 [ | ● ] (Shift: +0.889 >) Group +9 (n= 24) | baseline mu = +8.965 sigma = 0.1090 --------------------------------------------------------------------------------- Original (α= 1) : +8.965 [ | ● ] Ablation (α= 0) : +8.838 [ | ● ] (Shift: -0.127 <) Steer (α=-4) : +8.501 [ | ● ] (Shift: -0.464 <) Steer (α=-2) : +8.402 [ | ● ] (Shift: -0.563 <) Steer (α=-1) : +8.669 [ | ● ] (Shift: -0.296 <) Steer (α=+2) : +9.123 [ | ● ] (Shift: +0.158 >) Steer (α=+4) : +9.694 [ | ● ] (Shift: +0.729 >) Group +10 (n= 18) | baseline mu = +9.814 sigma = 0.1806 --------------------------------------------------------------------------------- Original (α= 1) : +9.814 [ | ● ] Ablation (α= 0) : +9.477 [ | ● ] (Shift: -0.336 <) Steer (α=-4) : +8.466 [ | ● ] (Shift: -1.348 <) Steer (α=-2) : +8.501 [ | ● ] (Shift: -1.313 <) Steer (α=-1) : +8.993 [ | ● ] (Shift: -0.820 <) Steer (α=+2) : +10.160 [ | ● ] (Shift: +0.346 >) Steer (α=+4) : +11.023 [ | ● ] (Shift: +1.209 >) =====================================================================================

Feature-1347

Feature #576 — Hyperlocal (global rank 1753 / 2048)

Only fires for samples whose output falls in the −9 to −6 band. Ablation is zero for 16 of 20 groups. In the steering section every alpha returns the identical prediction for those groups — scaling something that is already zero does nothing.

(act-abl) C:\Workspace\Git_Repos\MLP-Experiments\Feature-Ranking-et-Throttle-Testing>python verify_feature.py 576 [verify_feature] Feature #576 | Device: cuda Global rank : 1753 / 2048 (1 = least impactful, 2048 = most) Groups : [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ===================================================================================== ABLATION — Feature #576 (feature activation zeroed out) ===================================================================================== Group -10 (n= 16) : -9.799 (baseline) [ |> ] -9.799 (ablated) (+0.000) Group -9 (n= 31) : -8.849 (baseline) [ |████> ] -8.847 (ablated) (+0.002) Group -8 (n= 54) : -7.940 (baseline) [ |█> ] -7.939 (ablated) (+0.001) Group -7 (n= 67) : -6.935 (baseline) [ |█████> ] -6.933 (ablated) (+0.003) Group -6 (n= 82) : -5.968 (baseline) [ |████████>] -5.964 (ablated) (+0.004) Group -5 (n= 45) : -5.030 (baseline) [ |█> ] -5.029 (ablated) (+0.001) Group -4 (n= 36) : -3.972 (baseline) [ |█> ] -3.971 (ablated) (+0.001) Group -3 (n= 49) : -3.019 (baseline) [ |> ] -3.019 (ablated) (+0.000) Group -2 (n= 64) : -1.963 (baseline) [ |> ] -1.963 (ablated) (+0.000) Group -1 (n= 56) : -0.879 (baseline) [ |> ] -0.879 (ablated) (+0.000) Group +1 (n= 59) : +1.033 (baseline) [ |> ] +1.033 (ablated) (+0.000) Group +2 (n= 61) : +2.020 (baseline) [ |> ] +2.020 (ablated) (+0.000) Group +3 (n= 52) : +3.091 (baseline) [ |> ] +3.091 (ablated) (+0.000) Group +4 (n= 48) : +4.097 (baseline) [ |> ] +4.097 (ablated) (+0.000) Group +5 (n= 30) : +5.005 (baseline) [ |> ] +5.005 (ablated) (+0.000) Group +6 (n= 97) : +6.065 (baseline) [ |> ] +6.065 (ablated) (+0.000) Group +7 (n= 66) : +7.039 (baseline) [ |> ] +7.039 (ablated) (+0.000) Group +8 (n= 45) : +8.038 (baseline) [ |> ] +8.038 (ablated) (+0.000) Group +9 (n= 24) : +8.965 (baseline) [ |> ] +8.965 (ablated) (+0.000) Group +10 (n= 18) : +9.814 (baseline) [ |> ] +9.814 (ablated) (+0.000) ===================================================================================== ===================================================================================== ACTIVATION STEERING — Feature #576 (α scales the feature activation) ===================================================================================== Group -10 (n= 16) | baseline mu = -9.799 sigma = 0.1399 --------------------------------------------------------------------------------- Original (α= 1) : -9.799 [● | ] Ablation (α= 0) : -9.799 [● | ] (Shift: +0.000 >) Steer (α=-4) : -9.799 [● | ] (Shift: +0.000 >) Steer (α=-2) : -9.799 [● | ] (Shift: +0.000 >) Steer (α=-1) : -9.799 [● | ] (Shift: +0.000 >) Steer (α=+2) : -9.799 [● | ] (Shift: +0.000 >) Steer (α=+4) : -9.799 [● | ] (Shift: +0.000 >) Group -9 (n= 31) | baseline mu = -8.849 sigma = 0.1538 --------------------------------------------------------------------------------- Original (α= 1) : -8.849 [ ● | ] Ablation (α= 0) : -8.847 [ ● | ] (Shift: +0.002 >) Steer (α=-4) : -8.838 [ ● | ] (Shift: +0.011 >) Steer (α=-2) : -8.842 [ ● | ] (Shift: +0.006 >) Steer (α=-1) : -8.844 [ ● | ] (Shift: +0.004 >) Steer (α=+2) : -8.851 [ ● | ] (Shift: -0.002 <) Steer (α=+4) : -8.855 [ ● | ] (Shift: -0.006 <) Group -8 (n= 54) | baseline mu = -7.940 sigma = 0.1630 --------------------------------------------------------------------------------- Original (α= 1) : -7.940 [ ● | ] Ablation (α= 0) : -7.939 [ ● | ] (Shift: +0.001 >) Steer (α=-4) : -7.936 [ ● | ] (Shift: +0.004 >) Steer (α=-2) : -7.938 [ ● | ] (Shift: +0.002 >) Steer (α=-1) : -7.939 [ ● | ] (Shift: +0.001 >) Steer (α=+2) : -7.941 [ ● | ] (Shift: -0.001 <) Steer (α=+4) : -7.942 [ ● | ] (Shift: -0.002 <) Group -7 (n= 67) | baseline mu = -6.935 sigma = 0.1585 --------------------------------------------------------------------------------- Original (α= 1) : -6.935 [ ● | ] Ablation (α= 0) : -6.933 [ ● | ] (Shift: +0.003 >) Steer (α=-4) : -6.922 [ ● | ] (Shift: +0.013 >) Steer (α=-2) : -6.927 [ ● | ] (Shift: +0.008 >) Steer (α=-1) : -6.930 [ ● | ] (Shift: +0.005 >) Steer (α=+2) : -6.938 [ ● | ] (Shift: -0.003 <) Steer (α=+4) : -6.943 [ ● | ] (Shift: -0.008 <) Group -6 (n= 82) | baseline mu = -5.968 sigma = 0.2195 --------------------------------------------------------------------------------- Original (α= 1) : -5.968 [ ● | ] Ablation (α= 0) : -5.964 [ ● | ] (Shift: +0.004 >) Steer (α=-4) : -5.947 [ ● | ] (Shift: +0.021 >) Steer (α=-2) : -5.956 [ ● | ] (Shift: +0.013 >) Steer (α=-1) : -5.960 [ ● | ] (Shift: +0.008 >) Steer (α=+2) : -5.973 [ ● | ] (Shift: -0.004 <) Steer (α=+4) : -5.981 [ ● | ] (Shift: -0.013 <) Group -5 (n= 45) | baseline mu = -5.030 sigma = 0.1520 --------------------------------------------------------------------------------- Original (α= 1) : -5.030 [ ● | ] Ablation (α= 0) : -5.029 [ ● | ] (Shift: +0.001 >) Steer (α=-4) : -5.027 [ ● | ] (Shift: +0.003 >) Steer (α=-2) : -5.028 [ ● | ] (Shift: +0.002 >) Steer (α=-1) : -5.029 [ ● | ] (Shift: +0.001 >) Steer (α=+2) : -5.031 [ ● | ] (Shift: -0.001 <) Steer (α=+4) : -5.032 [ ● | ] (Shift: -0.002 <) Group -4 (n= 36) | baseline mu = -3.972 sigma = 0.1603 --------------------------------------------------------------------------------- Original (α= 1) : -3.972 [ ● | ] Ablation (α= 0) : -3.971 [ ● | ] (Shift: +0.001 >) Steer (α=-4) : -3.968 [ ● | ] (Shift: +0.003 >) Steer (α=-2) : -3.970 [ ● | ] (Shift: +0.002 >) Steer (α=-1) : -3.970 [ ● | ] (Shift: +0.001 >) Steer (α=+2) : -3.972 [ ● | ] (Shift: -0.001 <) Steer (α=+4) : -3.974 [ ● | ] (Shift: -0.002 <) Group -3 (n= 49) | baseline mu = -3.019 sigma = 0.1516 --------------------------------------------------------------------------------- Original (α= 1) : -3.019 [ ● | ] Ablation (α= 0) : -3.019 [ ● | ] (Shift: +0.000 >) Steer (α=-4) : -3.018 [ ● | ] (Shift: +0.001 >) Steer (α=-2) : -3.018 [ ● | ] (Shift: +0.001 >) Steer (α=-1) : -3.019 [ ● | ] (Shift: +0.000 >) Steer (α=+2) : -3.019 [ ● | ] (Shift: -0.000 <) Steer (α=+4) : -3.020 [ ● | ] (Shift: -0.001 <) Group -2 (n= 64) | baseline mu = -1.963 sigma = 0.1718 --------------------------------------------------------------------------------- Original (α= 1) : -1.963 [ ● | ] Ablation (α= 0) : -1.963 [ ● | ] (Shift: +0.000 >) Steer (α=-4) : -1.963 [ ● | ] (Shift: +0.000 >) Steer (α=-2) : -1.963 [ ● | ] (Shift: +0.000 >) Steer (α=-1) : -1.963 [ ● | ] (Shift: +0.000 >) Steer (α=+2) : -1.963 [ ● | ] (Shift: +0.000 >) Steer (α=+4) : -1.963 [ ● | ] (Shift: +0.000 >) Group -1 (n= 56) | baseline mu = -0.879 sigma = 0.1642 --------------------------------------------------------------------------------- Original (α= 1) : -0.879 [ ●| ] Ablation (α= 0) : -0.879 [ ●| ] (Shift: +0.000 >) Steer (α=-4) : -0.879 [ ●| ] (Shift: +0.000 >) Steer (α=-2) : -0.879 [ ●| ] (Shift: +0.000 >) Steer (α=-1) : -0.879 [ ●| ] (Shift: +0.000 >) Steer (α=+2) : -0.879 [ ●| ] (Shift: +0.000 >) Steer (α=+4) : -0.879 [ ●| ] (Shift: +0.000 >) Group +1 (n= 59) | baseline mu = +1.033 sigma = 0.1740 --------------------------------------------------------------------------------- Original (α= 1) : +1.033 [ | ● ] Ablation (α= 0) : +1.033 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +1.033 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +1.033 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +1.033 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +1.033 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +1.033 [ | ● ] (Shift: +0.000 >) Group +2 (n= 61) | baseline mu = +2.020 sigma = 0.2088 --------------------------------------------------------------------------------- Original (α= 1) : +2.020 [ | ● ] Ablation (α= 0) : +2.020 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +2.020 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +2.020 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +2.020 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +2.020 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +2.020 [ | ● ] (Shift: +0.000 >) Group +3 (n= 52) | baseline mu = +3.091 sigma = 0.1657 --------------------------------------------------------------------------------- Original (α= 1) : +3.091 [ | ● ] Ablation (α= 0) : +3.091 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +3.091 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +3.091 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +3.091 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +3.091 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +3.091 [ | ● ] (Shift: +0.000 >) Group +4 (n= 48) | baseline mu = +4.097 sigma = 0.1863 --------------------------------------------------------------------------------- Original (α= 1) : +4.097 [ | ● ] Ablation (α= 0) : +4.097 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +4.097 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +4.097 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +4.097 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +4.097 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +4.097 [ | ● ] (Shift: +0.000 >) Group +5 (n= 30) | baseline mu = +5.005 sigma = 0.2205 --------------------------------------------------------------------------------- Original (α= 1) : +5.005 [ | ● ] Ablation (α= 0) : +5.005 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +5.005 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +5.005 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +5.005 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +5.005 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +5.005 [ | ● ] (Shift: +0.000 >) Group +6 (n= 97) | baseline mu = +6.065 sigma = 0.1866 --------------------------------------------------------------------------------- Original (α= 1) : +6.065 [ | ● ] Ablation (α= 0) : +6.065 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +6.065 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +6.065 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +6.065 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +6.065 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +6.065 [ | ● ] (Shift: +0.000 >) Group +7 (n= 66) | baseline mu = +7.039 sigma = 0.1674 --------------------------------------------------------------------------------- Original (α= 1) : +7.039 [ | ● ] Ablation (α= 0) : +7.039 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +7.039 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +7.039 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +7.039 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +7.039 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +7.039 [ | ● ] (Shift: +0.000 >) Group +8 (n= 45) | baseline mu = +8.038 sigma = 0.1778 --------------------------------------------------------------------------------- Original (α= 1) : +8.038 [ | ● ] Ablation (α= 0) : +8.038 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +8.038 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +8.038 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +8.038 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +8.038 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +8.038 [ | ● ] (Shift: +0.000 >) Group +9 (n= 24) | baseline mu = +8.965 sigma = 0.1090 --------------------------------------------------------------------------------- Original (α= 1) : +8.965 [ | ● ] Ablation (α= 0) : +8.965 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +8.965 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +8.965 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +8.965 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +8.965 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +8.965 [ | ● ] (Shift: +0.000 >) Group +10 (n= 18) | baseline mu = +9.814 sigma = 0.1806 --------------------------------------------------------------------------------- Original (α= 1) : +9.814 [ | ● ] Ablation (α= 0) : +9.814 [ | ● ] (Shift: +0.000 >) Steer (α=-4) : +9.814 [ | ● ] (Shift: +0.000 >) Steer (α=-2) : +9.814 [ | ● ] (Shift: +0.000 >) Steer (α=-1) : +9.814 [ | ● ] (Shift: +0.000 >) Steer (α=+2) : +9.814 [ | ● ] (Shift: +0.000 >) Steer (α=+4) : +9.814 [ | ● ] (Shift: +0.000 >) =====================================================================================

Feature-576

Feature #1347 Feature #576
Global rank 2048 / 2048 1753 / 2048
Groups affected All 20 ~4 (groups −9 to −6)
Ablation direction Flips across sign boundary Always toward zero
Steering sensitivity High — α=−4 shifts ±3 units Negligible at any α
Role Encodes output magnitude universally Narrow detector for mid-negative outputs

Feature #1347 is causally load-bearing for the whole model. Feature #576 illustrates why the SAE needs 2048 dimensions — fine-grained detectors that handle specific output regions the broad features miss.


Output Files Summary

File Size (typical) Description
results/feature_ranks.pt ~0.5 MB Per-group feature rankings and mean activations
results/throttle_results.pt ~145 MB All ablation + steering prediction lists
results/reports.html ~11.6 MB Self-contained ridgeline report viewer
results/index.html ~0.5 MB Self-contained D3 network graph

Directory Structure

Feature-Ranking-et-Throttle-Testing/ ├── dataset/ │ ├── mlp_test.xlsx ← used by pipeline │ ├── mlp_train.xlsx │ ├── mlp_val.xlsx │ ├── extrap_test.xlsx │ ├── interp_test.xlsx │ ├── precision_test.xlsx │ └── scaling_test.xlsx ├── mlp/ │ ├── mlp_definition.py ← InterpretabilityMLP class │ └── perfect_mlp.pth ← pre-trained weights ├── sae/ │ ├── sae_definition.py ← SparseAutoencoder class │ └── universal_sae.pth ← pre-trained weights ├── results/ ← generated by pipeline │ ├── feature_ranks.pt │ ├── throttle_results.pt │ ├── reports.html │ └── index.html ├── feature_ranking.py ← step 2 ├── throttle_testing.py ← step 3 ├── generate_reports.py ← step 4a ├── network_graph.py ← step 4b ├── verify_feature.py ← manual verification (standalone) ├── workflow.bat ← full pipeline runner ├── workflow.log ← sample pipeline run log └── reqs.txt ← pip dependencies