Inspect model & pipeline

The trained NER model doesn't perform good on unseen 'real world' data. While the model performance on evaluation data is > 0.90 (f1-score), the performance on real world data is significantly lower. This might indicate some issues in the dataset and should be analyzed