changed eval metric to f1 score, because accuracy reward images with a lot of background

Merge request reports

Loading