This blog post is an introduction to the evaluation measures which can be used to ensure that NLP models perform well at ranking the textual data they receive. This importance ranking, of the textual inputs our models process, is one of the building blocks Spoke uses in order to generate valuable, high-quality summaries for our users – saving them significant amounts of time and frustration through e.g. duplicate work. We tried a few different evaluation measures for this approach and created an overview of different approaches.
Ranking text can in general be treated either as an ordinal classification task, or as a linear regression task, the respective measure to evaluate the performance of a model can be found below. Some evaluation measures are more complex than others (to implement and to apply to a model) and you can find a table (along with the associated python libraries) to choose what may best suit your use case.
Ranking text or sentences in a document, in order to extract importance information for extractive summarisation, can be treated as a classification or regression task. If we consider the rank of a sentence as a continuous integer, we can use the linear regression approach. Alternatively, the ranks can be considered or divided into an ordinal scale and hence a classification approach can be used to classify sentences based on importance within a document. A document could, for instance, consist of multiple sentences (as shown below) and to extract the top priority or important sentences, we would like to classify them:
At Spoke, we produce AI-powered summaries for different types of users and use cases. For one of these use cases, we collect the progress our users have made in regard to specific projects and work with a variety of powerful NLP-models to produce summaries out of these updates, improving alignment within teams and across organisations.
The aim is to produce valuable (extractive) summaries, in order to reduce the time project leads spend on collecting updates and manually summarising those. A necessary cog in the larger summarisation wheel is to rank the collected progress updates in order of their importance, e.g. classifying the updates in an ordinal manner on an importance scale (Likert Scale) with four values such as Top Priority, Very Important, Important, Not Important, or as a continuous integer (1 to 4) using regression.
As explained above, ranking can either be treated as a classification or a regression task – naturally the evaluation measures differ for these two approaches:
The standard evaluation measures used for nominal classification, e.g. binary classification, penalise all misclassifications done by a ML/DL model equally. However, in ordinal classification, the penalisation for misclassifying a “high rank” data point into “low rank” should be much higher than misclassifying it as “mid rank”.
The following three criteria are to be considered before using any evaluation measure for your use case (cf. ACL paper):
A fourth criterion to be considered concerns whether the evaluation measure handles cases when the ranks/classes are not equidistant – this fourth aspect is out of the scope of this documentation.
The evaluation measures, along with their ease of implementation and fulfilment on the criteria above, are shown in the table below. All these measures are computed using a |C|*|C| confusion matrix from the ML/DL model, where C denotes the number of unique values of ranks the data is being classified into (or simply said labels). For detailed calculations on how each measure is computed, please refer to this paper also referenced above.
In this case, the ranking is done on a continuous scale where the predicted output is a continuous integer variable (with a finite range). The evaluation measures are as follows:
After internal discussions on the various evaluation measures, we decided to implement a few different ones – both for ordinal classification and regression (first three ones in the table and in the ranked list above), keeping in mind which criteria are fulfilled by each of them. We chose the measures that fulfilled all or most criteria, in order to get a good sense of how our models are performing at ranking text.
In case, you would like to implement any of these, or would like to know more about this topic, please feel free to reach out to Nishtha or anyone else on the Spoke Team with your questions – we're always happy to chat!