Enhancing Traceability Link Recovery with Unlabeled Data

Abstract

Traceability link recovery (TLR) is an important software engineering task for developing trustworthy and reliable software systems. Recently proposed deep learning (DL) models have shown their effectiveness compared to traditional information retrieval-based methods. DL often heavily relies on sufficient labeled data to train the model. However, manually labeling traceability links is time-consuming, labor-intensive, and requires specific knowledge from domain experts. As a result, typically only a small portion of labeled data is accompanied by a large amount of unlabeled data in real-world projects. Our hypothesis is that artifacts are semantically similar if they have the same linked artifact(s). This paper presents TraceFUN, a new approach to enhance traceability link recovery with unlabeled data. TraceFUN first measures the similarities between unlabeled and labeled artifacts using two similarity prediction methods (i.e., vector space model and contrastive learning). Then, based on the similarities, newly labeled links are generated between the unlabeled artifacts and the linked objects of the labeled artifacts. Generated links are further used for TLR model training. We have evaluated TraceFUN on three GitHub projects with two state-of-the-art DL models (i.e., Trace BERT and TraceNN). The results show that TraceFUN is effective in terms of a maximum improvement of F1-score up to 21% and 1,088%, respectively for Trace BERT and TraceNN.

Publication
2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)
Date
Links