Neural networks often learn to make predictions that rely on social attributes like gender and race, which causes the model to be social biased. While previous work tackles this issue with training an unbiased model, but a stronger and common requirement in practice is a more challenging setting where model weights is unchangeable. In this paper, we propose a post-processing debiasing framework, and explore new ideas for achieving algorithm fairness. Our observations show that the biased model learn the bias feature and make biased decisions based on it. We propose an instance debiasing method based on adversarial attacks to remove the information in the test sample that will activate the bias feature of the model, and require that the features of the new instance deactivate model's biased decision-making relus, so that the model only makes predictions based on the target feature. We demonstrate the effectiveness of this method in laboratory scenarios and real applications.
We summarize our main contributions as follows:
The observation is "Not all results are based on task-irrelevant features in the input."" Instead, when the bias attribute of the test sample is difficult to recognition, the unfair target model will output fair results. Therefore, We obtain a direct idea: We can hide the bias information in the sample to make the model output from unfair to fair. Based on our above research, we consider that we can use adversarial attack to manipulate image with the target of removing bias information.
When we use adversarial attack achieve instance debiasing, the debiasing performance is not enough and far from our debiasing goal. But we only use the Natural Unbiased Instance to test, the model bias is much lower than the model bias obtained by using the entire test set and achieve excellent debiasing performance.
Based on PCA manifold space, we can observe the relation between adversarial attack direction and manifold feature direction. We use the mean Cosine Similarity to measured, and observed that the mean Cosine Similarity between “Loss Gradient of I ” and “Loss Gradient of M” is 0.36. This low Cosine Similarity means The direction of adversarial attack isn’t feature direction,and this leads to underperform debiaisng.
Real-world application does not allow to modify model for debiasing, the only available debiasing method is the post-processing. We compare the performance of our algorithm with three typical post-processing debiasing algorithms. In laboratory scenarios, the debiasing algorithms can access to the data preparation and training stage to retrain model. To evaluate the effectiveness, we consider several typical pre and in-processing debiasing solutions. We evaluated the proposed AEDA solution on both simulated and real-world bias scenarios, all experimental results in evaluation support that Post-fairness successfully obtained the unbiased result in the biased model without modifying the original model.