Diagnosing FinBERT's neutral-class blind spot

Financial sentiment models are usually marketed on a single headline number: overall accuracy. That number hides a lot. When I evaluated FinBERT on a multi-source benchmark of investor text drawn from Reddit, X/Twitter, and professional news, the aggregate looked healthy at 82.0% accuracy and 80.3% macro F1. The story underneath was more interesting, and it became the basis of my final-year project and a first-author journal paper.

The problem with averages

The neutral class was carrying the weakness. Recall on neutral text sat at just 53.3%, well below the polarised classes. In finance this matters more than it sounds: a model that quietly reclassifies "neutral" risk language as "negative" will manufacture pessimism that was never in the source text.

Rather than treat that as an unavoidable cost, I wanted to know why it happened.

Using explainability as a diagnostic, not decoration

LIME is often bolted on at the end of a project to produce a nice highlight-the-words figure. I used it as an instrument. By tracing the token attributions on misclassified neutral examples, a single, consistent failure mode emerged: contextually neutral risk vocabulary (words like "risk", "exposure", "volatility") was being pulled toward the negative class regardless of how it was used.

Once you can name the failure mode, you can target it.

A small, surgical fine-tune

The fix did not require a larger model or a bigger dataset. A targeted fine-tune on a disjoint 300-sample corpus, weighted toward exactly that neutral risk language, recovered neutral recall by 18.4 points to 71.7% and lifted overall accuracy to 86.5%, with negligible degradation on the polarised classes.

To make sure this was a real effect and not noise, the evaluation used paired significance testing (McNemar, p = 0.002) and stratified-bootstrap confidence intervals, with benchmark labels validated by both LLM cross-annotation and an independent human annotator.

What I took from it

The result I am proudest of is not the accuracy figure. It is the workflow: measure beyond the average, use explainability to localise the actual fault, then intervene precisely. That loop transfers well beyond sentiment analysis, and it is the way I now approach most applied AI problems.

This work is written up as a first-author paper, Diagnosing and mitigating the neutral-class deficit in financial sentiment analysis, currently submitted to Data Science in Finance and Economics (AIMS Press).