ReadMe.md

C:\Workspace\Git_Repos\monosemanticity-mlp-interpretability>workflow.bat [1/7] Activating Environment... [2/7] Generating Dataset... Generating 8000 rows for mlp_train.xlsx... Successfully saved mlp_train.xlsx Generating 1000 rows for mlp_val.xlsx... Successfully saved mlp_val.xlsx Generating 1000 rows for mlp_test.xlsx... Successfully saved mlp_test.xlsx [3/7] Training MLP... Epoch 50 | Val MSE: 0.709485 Epoch 100 | Val MSE: 0.813464 Epoch 150 | Val MSE: 0.447466 Epoch 200 | Val MSE: 0.347589 Epoch 250 | Val MSE: 0.199275 Epoch 300 | Val MSE: 0.185669 Epoch 350 | Val MSE: 0.190938 Epoch 400 | Val MSE: 0.154719 Epoch 450 | Val MSE: 0.150248 Epoch 500 | Val MSE: 0.147577 [4/7] Harvesting Activations... Harvesting activations... Success! Saved tensor of shape: torch.Size([8000, 512]) [5/7] Training Sparse Autoencoder (SAE)... Loaded activations: torch.Size([8000, 512]) SAE Epoch [10/100] | Loss: 0.004299 SAE Epoch [20/100] | Loss: 0.003039 SAE Epoch [30/100] | Loss: 0.002528 SAE Epoch [40/100] | Loss: 0.002164 SAE Epoch [50/100] | Loss: 0.001973 SAE Epoch [60/100] | Loss: 0.001845 SAE Epoch [70/100] | Loss: 0.001711 SAE Epoch [80/100] | Loss: 0.001621 SAE Epoch [90/100] | Loss: 0.001613 SAE Epoch [100/100] | Loss: 0.001476 SAE training complete. Weights saved. [6/7] Running Feature Probe... --- Interpretability Report --- Sample Input: [8, 9, 5, 1, 3, 2, 9, 4, 7, 1] | Expected Output: 8.0 MLP Output: 7.4849 Number of active SAE features: 62 Top Active Features (Monosemantic Candidates): Feature #1649 | Activation: 0.6050 Feature #1440 | Activation: 0.5738 Feature #1608 | Activation: 0.4926 Feature #2028 | Activation: 0.3303 Feature # 725 | Activation: 0.3191 --- Interpretability Report --- Sample Input: [8, 9, 5, 2, 3, 2, 8, 4, 7, 1] | Expected Output: 6.0 MLP Output: 5.6758 Number of active SAE features: 66 Top Active Features (Monosemantic Candidates): Feature #1440 | Activation: 0.5455 Feature #1649 | Activation: 0.4891 Feature # 725 | Activation: 0.3875 Feature #1608 | Activation: 0.3647 Feature # 72 | Activation: 0.3016 --- Interpretability Report --- Sample Input: [8, 9, 5, 3, 3, 2, 7, 4, 7, 1] | Expected Output: 4.0 MLP Output: 3.4448 Number of active SAE features: 60 Top Active Features (Monosemantic Candidates): Feature #1440 | Activation: 0.5258 Feature # 725 | Activation: 0.4585 Feature #1649 | Activation: 0.4047 Feature #1478 | Activation: 0.2813 Feature # 72 | Activation: 0.2565 --- Interpretability Report --- Sample Input: [8, 9, 5, 4, 3, 2, 5, 4, 7, 1] | Expected Output: 1.0 MLP Output: 0.8271 Number of active SAE features: 59 Top Active Features (Monosemantic Candidates): Feature #1440 | Activation: 0.4882 Feature # 725 | Activation: 0.4855 Feature #1649 | Activation: 0.3640 Feature #1478 | Activation: 0.3147 Feature #1212 | Activation: 0.1984 --- Interpretability Report --- Sample Input: [8, 9, 5, 5, 3, 2, 4, 4, 7, 1] | Expected Output: 1.0 MLP Output: 1.3972 Number of active SAE features: 64 Top Active Features (Monosemantic Candidates): Feature # 725 | Activation: 0.4864 Feature #1440 | Activation: 0.4512 Feature #1649 | Activation: 0.3346 Feature #1478 | Activation: 0.3299 Feature #1212 | Activation: 0.3073 [7/7] Generating Feature Reports... Tracing logic flow through the entire circuit... Clean, centered report saved to circuit_trace_detailed.xlsx Logic heatmap saved to: C:\Workspace\Git_Repos\monosemanticity-mlp-interpretability\logic_circuit_map.html Stacked norm dist saved to: C:\Workspace\Git_Repos\monosemanticity-mlp-interpretability\circuit_bell_curves.html Sankey diagram saved to: C:\Workspace\Git_Repos\monosemanticity-mlp-interpretability\uhd_bold_sankey.html ====================================================== Pipeline Complete: Monosemantic Features Identified. ====================================================== ------------------------------------------------------ Execution Summary: Started: 01:02:39 Finished: 01:23:48 Duration: 21 m 9 s ------------------------------------------------------

Monosemanticity-MLP-Interpretability

Introduction

Problem Statement

Folder Structure

Env Setup

Execution Steps

Execution Log

Experiment Inference: Mechanistic Interpretability of MLP Circuits

Model Convergence & Reconstruction Fidelity

Identification of Monosemantic Features

Structural Circuit Trace

Conclusion

References